Updates to Grafana Dashboard and example alerts

More fun in the grafana dashboard. The rows organize the panels and are
collapsable. Also, tested with multiple nodes and fixed some
labeling issues when there are more than one node.

Update grafana alerting readme info and some fun prose about one of the
alerts as well as some reorganizing of the code for clarity.

finally, drop the time to fire for alerts because it's better to have them be a bit touchy so users can verify they work vs. not being sure.
This commit is contained in:
Elijah DeLee
2022-10-04 12:21:16 -04:00
committed by Rebeccah
parent 560b952dd6
commit d50c97ae22
3 changed files with 1592 additions and 1094 deletions

View File

@@ -36,9 +36,18 @@ GRAFANA=true PROMETHEUS=true EXTRA_SOURCES_ANSIBLE_OPTS="-e scrape_interval=1 ad
We are configuring alerts in grafana using the provisioning files method. This feature is new in Grafana as of August, 2022. Documentation can be found: https://grafana.com/docs/grafana/latest/administration/provisioning/#alerting however it does not fully show all parameters to the config. We are configuring alerts in grafana using the provisioning files method. This feature is new in Grafana as of August, 2022. Documentation can be found: https://grafana.com/docs/grafana/latest/administration/provisioning/#alerting however it does not fully show all parameters to the config.
One way to understand how to build rules is to build them in the UI and use chrometools to inspect the payload as you save the rules. It appears that the "data" portion of the payload for each rule is the same syntax as needed in the provisioning file config. To reload the alerts without restarting the container, from within the container you can send a POST with `curl -X POST http://admin:admin@localhost:3000/api/admin/provisioning/alerting/reload`. Keep in mind the grafana container does not contain `curl`. You can install it with the command `apk add curl`. One way to understand how to build rules is to build them in the UI and use chrometools to inspect the payload as you save the rules. It appears that the "data" portion of the payload for each rule is the same syntax as needed in the provisioning file config. To reload the alerts without restarting the container, from your terminal you can send a POST with `curl -X POST http://admin:admin@localhost:3001/api/admin/provisioning/alerting/reload`.
Another way to export rules is explore the api. Another way to export rules is explore the api.
1. Get all the folders: `GET` to `/api/folders` 1. Get all the folders: `GET` to `/api/folders`
2. Get the rules `GET` to `/api/ruler/grafana/api/v1/rules/{{ Folder }}` 2. Get the rules `GET` to `/api/ruler/grafana/api/v1/rules/{{ Folder }}`
You can do this via curl or in the web browser.
### Included Alerts
#### Alert if remaining capacity low and pending jobs exist
We want to know if jobs are in pending but we lack capacity in the cluster to run them. Our approach is to sum all remaining capacity in the cluster and compare it to the total capacity of the cluster. If less than 10% of our capacity is remaining and we have pending jobs, and this is true for more than 180s, we will fire the alert.
This alert is named "capacity_below_10_percent" and can be found in this directory in https://github.com/ansible/awx/blob/devel/tools/grafana/alerting/alerts.yml

View File

@@ -2,15 +2,21 @@
apiVersion: 1 apiVersion: 1
groups: groups:
- folder: awx - folder: awx
interval: 60s interval: 10s
name: awx_rules name: awx_rules
orgId: 1 orgId: 1
exec_err_state: Alerting
no_data_state: NoData
rules: rules:
- condition: if_failures_too_high - for: 5m
dashboardUid: awx noDataState: OK
panelId: 2
title: failure_rate_exceeded_20_percent
uid: failure_rate_exceeded_20_percent
condition: compare
data: data:
- refId: total_errors - refId: total_errors
queryType: '' queryType: ""
relativeTimeRange: relativeTimeRange:
from: 600 from: 600
to: 0 to: 0
@@ -19,7 +25,7 @@ groups:
editorMode: code editorMode: code
expr: >- expr: >-
max(delta(awx_instance_status_total{instance="awx1:8013", max(delta(awx_instance_status_total{instance="awx1:8013",
status="failed|error"}[30m])) status=~"failed|error"}[30m]))
hide: false hide: false
intervalMs: 1000 intervalMs: 1000
legendFormat: __auto legendFormat: __auto
@@ -27,11 +33,11 @@ groups:
range: true range: true
refId: total_errors refId: total_errors
- refId: max_errors - refId: max_errors
queryType: '' queryType: ""
relativeTimeRange: relativeTimeRange:
from: 0 from: 0
to: 0 to: 0
datasourceUid: '-100' datasourceUid: "-100"
model: model:
conditions: conditions:
- evaluator: - evaluator:
@@ -60,7 +66,7 @@ groups:
refId: max_errors refId: max_errors
type: reduce type: reduce
- refId: total_success - refId: total_success
queryType: '' queryType: ""
relativeTimeRange: relativeTimeRange:
from: 600 from: 600
to: 0 to: 0
@@ -80,11 +86,11 @@ groups:
range: true range: true
refId: total_success refId: total_success
- refId: max_success - refId: max_success
queryType: '' queryType: ""
relativeTimeRange: relativeTimeRange:
from: 0 from: 0
to: 0 to: 0
datasourceUid: '-100' datasourceUid: "-100"
model: model:
conditions: conditions:
- evaluator: - evaluator:
@@ -113,11 +119,11 @@ groups:
refId: max_success refId: max_success
type: reduce type: reduce
- refId: compare - refId: compare
queryType: '' queryType: ""
relativeTimeRange: relativeTimeRange:
from: 0 from: 0
to: 0 to: 0
datasourceUid: '-100' datasourceUid: "-100"
model: model:
conditions: conditions:
- evaluator: - evaluator:
@@ -158,15 +164,19 @@ groups:
maxDataPoints: 43200 maxDataPoints: 43200
refId: compare refId: compare
type: math type: math
for: 30m - for: 60s
noDataState: OK noDataState: OK
panelId: 2 panelId: 1
title: failure_rate_exceeded_20_percent title: redis_queue_too_large_to_clear_in_2_min
uid: failure_rate_exceeded_20_percent uid: redis_queue_too_large_to_clear_in_2_min
- condition: if_redis_queue_too_large condition: redis_queue_growing_faster_than_insertion_rate
dashboardUid: awx dashboardUid: awx
data: data:
- datasourceUid: awx_prometheus - refId: events_insertion_rate_per_second
relativeTimeRange:
from: 300
to: 0
datasourceUid: awx_prometheus
model: model:
editorMode: code editorMode: code
expr: irate(callback_receiver_events_insert_db{node='awx_1'}[1m]) expr: irate(callback_receiver_events_insert_db{node='awx_1'}[1m])
@@ -177,11 +187,11 @@ groups:
range: true range: true
refId: events_insertion_rate_per_second refId: events_insertion_rate_per_second
queryType: "" queryType: ""
refId: events_insertion_rate_per_second - refId: mean_event_insertion_rate
relativeTimeRange: relativeTimeRange:
from: 300 from: 0
to: 0 to: 0
- datasourceUid: -100 datasourceUid: -100
model: model:
conditions: conditions:
- evaluator: - evaluator:
@@ -208,11 +218,11 @@ groups:
refId: mean_event_insertion_rate refId: mean_event_insertion_rate
type: reduce type: reduce
queryType: "" queryType: ""
refId: mean_event_insertion_rate - refId: redis_queue_size
relativeTimeRange: relativeTimeRange:
from: 0 from: 300
to: 0 to: 0
- datasourceUid: awx_prometheus datasourceUid: awx_prometheus
model: model:
datasource: datasource:
type: prometheus type: prometheus
@@ -226,11 +236,11 @@ groups:
range: true range: true
refId: redis_queue_size refId: redis_queue_size
queryType: "" queryType: ""
refId: redis_queue_size - refId: last_redis_queue_size
relativeTimeRange: relativeTimeRange:
from: 300 from: 0
to: 0 to: 0
- datasourceUid: -100 datasourceUid: -100
model: model:
conditions: conditions:
- evaluator: - evaluator:
@@ -257,11 +267,12 @@ groups:
refId: last_redis_queue_size refId: last_redis_queue_size
type: reduce type: reduce
queryType: "" queryType: ""
refId: last_redis_queue_size - refId: redis_queue_growing_faster_than_insertion_rate
queryType: ""
relativeTimeRange: relativeTimeRange:
from: 0 from: 0
to: 0 to: 0
- datasourceUid: -100 datasourceUid: -100
model: model:
conditions: conditions:
- evaluator: - evaluator:
@@ -282,44 +293,35 @@ groups:
name: Expression name: Expression
type: __expr__ type: __expr__
uid: __expr__ uid: __expr__
expression: '($last_redis_queue_size > ($mean_event_insertion_rate * 120))' expression: "($last_redis_queue_size > ($mean_event_insertion_rate * 120))"
hide: false hide: false
intervalMs: 1000 intervalMs: 1000
maxDataPoints: 43200 maxDataPoints: 43200
refId: redis_queue_growing_faster_than_insertion_rate
type: math type: math
queryType: "" - for: 60s
refId: redis_queue_growing_faster_than_insertion_rate
relativeTimeRange:
from: 0
to: 0
for: 60s
noDataState: OK noDataState: OK
panelId: 1 panelId: 3
title: redis_queue_too_large_to_clear_in_2_min uid: capacity_below_10_percent
uid: redis_queue_too_large_to_clear_in_2_min title: capacity_below_10_percent
- condition: if_capacity_is_too_low condition: pending_jobs_and_capacity_compare
dashboardUid: awx
no_data_state: OK
exec_err_state: Error
data: data:
- refId: remaining_capacity - refId: remaining_capacity
queryType: '' queryType: ""
relativeTimeRange: relativeTimeRange:
from: 1800 from: 300
to: 0 to: 0
datasourceUid: awx_prometheus datasourceUid: awx_prometheus
model: model:
editorMode: builder editorMode: code
expr: awx_instance_remaining_capacity{instance="awx1:8013"} expr: sum(awx_instance_remaining_capacity)
hide: false hide: false
intervalMs: 1000 intervalMs: 1000
legendFormat: __auto legendFormat: __auto
maxDataPoints: 43200 maxDataPoints: 43200
range: true range: true
refId: remaining_capacity refId: remaining_capacity
- refId: if_capacity_is_too_low - refId: last_remaining_capacity
queryType: '' queryType: ""
relativeTimeRange: relativeTimeRange:
from: 0 from: 0
to: 0 to: 0
@@ -328,14 +330,63 @@ groups:
conditions: conditions:
- evaluator: - evaluator:
params: params:
- 20 - 3
- 0 type: outside_range
type: lt
operator: operator:
type: when type: and
query: query:
params: params:
- remaining_capacity - total_capacity
reducer:
params: []
type: percent_diff
type: query
datasource:
type: __expr__
uid: "-100"
expression: remaining_capacity
hide: false
intervalMs: 1000
maxDataPoints: 43200
reducer: last
refId: last_remaining_capacity
type: reduce
- refId: total_capacity
queryType: ""
relativeTimeRange:
from: 600
to: 0
datasourceUid: awx_prometheus
model:
datasource:
type: prometheus
uid: awx_prometheus
editorMode: code
expr: sum(awx_instance_capacity{instance="awx1:8013"})
hide: false
intervalMs: 1000
legendFormat: __auto
maxDataPoints: 43200
range: true
refId: total_capacity
- refId: last_total_capacity
queryType: ""
relativeTimeRange:
from: 0
to: 0
datasourceUid: "-100"
model:
conditions:
- evaluator:
params:
- 0
- 0
type: gt
operator:
type: and
query:
params:
- capacity_below_10%
reducer: reducer:
params: [] params: []
type: avg type: avg
@@ -344,12 +395,142 @@ groups:
name: Expression name: Expression
type: __expr__ type: __expr__
uid: __expr__ uid: __expr__
expression: remaining_capacity expression: total_capacity
hide: false hide: false
intervalMs: 1000 intervalMs: 1000
maxDataPoints: 43200 maxDataPoints: 43200
refId: if_capacity_is_too_low reducer: last
type: classic_conditions refId: last_total_capacity
for: 30m type: reduce
title: if_capacity_is_too_low - refId: 10_percent_total_capacity
uid: if_capacity_is_too_low queryType: ""
relativeTimeRange:
from: 0
to: 0
datasourceUid: "-100"
model:
conditions:
- evaluator:
params:
- 0
- 0
type: gt
operator:
type: and
query:
params:
- last_total_capacity
reducer:
params: []
type: avg
type: query
datasource:
name: Expression
type: __expr__
uid: __expr__
expression: "$last_total_capacity*.10"
hide: false
intervalMs: 1000
maxDataPoints: 43200
refId: 10_percent_total_capacity
type: math
- refId: pending_jobs
queryType: ""
relativeTimeRange:
from: 600
to: 0
datasourceUid: awx_prometheus
model:
datasource:
type: prometheus
uid: awx_prometheus
editorMode: builder
expr: awx_pending_jobs_total{instance="awx1:8013"}
hide: false
intervalMs: 1000
legendFormat: __auto
maxDataPoints: 43200
range: true
refId: pending_jobs
- refId: last_pending_jobs
queryType: ""
relativeTimeRange:
from: 0
to: 0
datasourceUid: "-100"
model:
conditions:
- evaluator:
params:
- 0
- 0
type: gt
operator:
type: and
query:
params:
- pending_jobs_and_capacity_compare
reducer:
params: []
type: avg
type: query
datasource:
name: Expression
type: __expr__
uid: __expr__
expression: pending_jobs
hide: false
intervalMs: 1000
maxDataPoints: 43200
reducer: last
refId: last_pending_jobs
type: reduce
- refId: pending_jobs_and_capacity_compare
queryType: ""
relativeTimeRange:
from: 0
to: 0
datasourceUid: "-100"
model:
conditions:
- evaluator:
params:
- 0
- 0
type: gt
operator:
type: and
query:
params:
- 10_percent_total_capacity
reducer:
params: []
type: last
type: query
- evaluator:
params:
- 0
- 0
type: gt
operator:
type: and
query:
params:
- pending_jobs
reducer:
params: []
type: last
type: query
datasource:
name: Expression
type: __expr__
uid: __expr__
expression:
"($10_percent_total_capacity > $last_remaining_capacity) && $last_pending_jobs
> 1"
hide: false
intervalMs: 1000
maxDataPoints: 43200
reducer: mean
refId: pending_jobs_and_capacity_compare
type: math

File diff suppressed because it is too large Load Diff