Updates to Grafana Dashboard and example alerts

More fun in the grafana dashboard. The rows organize the panels and are collapsable. Also, tested with multiple nodes and fixed some labeling issues when there are more than one node. Update grafana alerting readme info and some fun prose about one of the alerts as well as some reorganizing of the code for clarity. finally, drop the time to fire for alerts because it's better to have them be a bit touchy so users can verify they work vs. not being sure.
2026-03-20 07:43:35 -05:00 · 2022-10-04 12:21:16 -04:00
parent 560b952dd6
commit d50c97ae22
3 changed files with 1592 additions and 1094 deletions
--- a/tools/grafana/README.md
+++ b/tools/grafana/README.md
@@ -36,9 +36,18 @@ GRAFANA=true PROMETHEUS=true EXTRA_SOURCES_ANSIBLE_OPTS="-e scrape_interval=1 ad
 We are configuring alerts in grafana using the provisioning files method. This feature is new in Grafana as of August, 2022. Documentation can be found: https://grafana.com/docs/grafana/latest/administration/provisioning/#alerting however it does not fully show all parameters to the config.
-One way to understand how to build rules is to build them in the UI and use chrometools to inspect the payload as you save the rules. It appears that the "data" portion of the payload for each rule is the same syntax as needed in the provisioning file config. To reload the alerts without restarting the container, from within the container you can send a POST with `curl -X POST http://admin:admin@localhost:3000/api/admin/provisioning/alerting/reload`. Keep in mind the grafana container does not contain `curl`. You can install it with the command `apk add curl`.
+One way to understand how to build rules is to build them in the UI and use chrometools to inspect the payload as you save the rules. It appears that the "data" portion of the payload for each rule is the same syntax as needed in the provisioning file config. To reload the alerts without restarting the container, from your terminal you can send a POST with `curl -X POST http://admin:admin@localhost:3001/api/admin/provisioning/alerting/reload`.
 Another way to export rules is explore the api.
 1. Get all the folders:  `GET` to `/api/folders`
 2. Get the rules `GET` to `/api/ruler/grafana/api/v1/rules/{{ Folder }}`
 You can do this via curl or in the web browser.
 ### Included Alerts
 #### Alert if remaining capacity low and pending jobs exist
 We want to know if jobs are in pending but we lack capacity in the cluster to run them. Our approach is to sum all remaining capacity in the cluster and compare it to the total capacity of the cluster. If less than 10% of our capacity is remaining and we have pending jobs, and this is true for more than 180s, we will fire the alert.
 This alert is named "capacity_below_10_percent" and can be found in this directory in https://github.com/ansible/awx/blob/devel/tools/grafana/alerting/alerts.yml
--- a/tools/grafana/alerting/alerts.yml
+++ b/tools/grafana/alerting/alerts.yml
@@ -2,15 +2,21 @@
 apiVersion: 1
 groups:
  - folder: awx
-    interval: 60s
+    interval: 10s
    name: awx_rules
    orgId: 1
    exec_err_state: Alerting
    no_data_state: NoData
    rules:
-      - condition: if_failures_too_high
+      - for: 5m
-        dashboardUid: awx
+        noDataState: OK
        panelId: 2
        title: failure_rate_exceeded_20_percent
        uid: failure_rate_exceeded_20_percent
        condition: compare
        data:
          - refId: total_errors
-            queryType: ''
+            queryType: ""
            relativeTimeRange:
              from: 600
              to: 0
@@ -19,7 +25,7 @@ groups:
              editorMode: code
              expr: >-
                max(delta(awx_instance_status_total{instance="awx1:8013",
-                status="failed|error"}[30m]))
+                status=~"failed|error"}[30m]))
              hide: false
              intervalMs: 1000
              legendFormat: __auto
@@ -27,11 +33,11 @@ groups:
              range: true
              refId: total_errors
          - refId: max_errors
-            queryType: ''
+            queryType: ""
            relativeTimeRange:
              from: 0
              to: 0
-            datasourceUid: '-100'
+            datasourceUid: "-100"
            model:
              conditions:
                - evaluator:
@@ -60,7 +66,7 @@ groups:
              refId: max_errors
              type: reduce
          - refId: total_success
-            queryType: ''
+            queryType: ""
            relativeTimeRange:
              from: 600
              to: 0
@@ -80,11 +86,11 @@ groups:
              range: true
              refId: total_success
          - refId: max_success
-            queryType: ''
+            queryType: ""
            relativeTimeRange:
              from: 0
              to: 0
-            datasourceUid: '-100'
+            datasourceUid: "-100"
            model:
              conditions:
                - evaluator:
@@ -113,11 +119,11 @@ groups:
              refId: max_success
              type: reduce
          - refId: compare
-            queryType: ''
+            queryType: ""
            relativeTimeRange:
              from: 0
              to: 0
-            datasourceUid: '-100'
+            datasourceUid: "-100"
            model:
              conditions:
                - evaluator:
@@ -158,15 +164,19 @@ groups:
              maxDataPoints: 43200
              refId: compare
              type: math
-        for: 30m
+      - for: 60s
        noDataState: OK
-        panelId: 2
+        panelId: 1
-        title: failure_rate_exceeded_20_percent
+        title: redis_queue_too_large_to_clear_in_2_min
-        uid: failure_rate_exceeded_20_percent
+        uid: redis_queue_too_large_to_clear_in_2_min
-      - condition: if_redis_queue_too_large
+        condition: redis_queue_growing_faster_than_insertion_rate
        dashboardUid: awx
        data:
-          - datasourceUid: awx_prometheus
+          - refId: events_insertion_rate_per_second
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: awx_prometheus
            model:
              editorMode: code
              expr: irate(callback_receiver_events_insert_db{node='awx_1'}[1m])
@@ -177,11 +187,11 @@ groups:
              range: true
              refId: events_insertion_rate_per_second
            queryType: ""
-            refId: events_insertion_rate_per_second
+          - refId: mean_event_insertion_rate
            relativeTimeRange:
-              from: 300
+              from: 0
              to: 0
-          - datasourceUid: -100
+            datasourceUid: -100
            model:
              conditions:
                - evaluator:
@@ -208,11 +218,11 @@ groups:
              refId: mean_event_insertion_rate
              type: reduce
            queryType: ""
-            refId: mean_event_insertion_rate
+          - refId: redis_queue_size
            relativeTimeRange:
-              from: 0
+              from: 300
              to: 0
-          - datasourceUid: awx_prometheus
+            datasourceUid: awx_prometheus
            model:
              datasource:
                type: prometheus
@@ -226,11 +236,11 @@ groups:
              range: true
              refId: redis_queue_size
            queryType: ""
-            refId: redis_queue_size
+          - refId: last_redis_queue_size
            relativeTimeRange:
-              from: 300
+            from: 0
-              to: 0
+            to: 0
-          - datasourceUid: -100
+            datasourceUid: -100
            model:
              conditions:
                - evaluator:
@@ -257,11 +267,12 @@ groups:
              refId: last_redis_queue_size
              type: reduce
            queryType: ""
-            refId: last_redis_queue_size
+          - refId: redis_queue_growing_faster_than_insertion_rate
            queryType: ""
            relativeTimeRange:
              from: 0
              to: 0
-          - datasourceUid: -100
+            datasourceUid: -100
            model:
              conditions:
                - evaluator:
@@ -282,44 +293,35 @@ groups:
                name: Expression
                type: __expr__
                uid: __expr__
-              expression: '($last_redis_queue_size > ($mean_event_insertion_rate * 120))'
+              expression: "($last_redis_queue_size > ($mean_event_insertion_rate * 120))"
              hide: false
              intervalMs: 1000
              maxDataPoints: 43200
              refId: redis_queue_growing_faster_than_insertion_rate
              type: math
-            queryType: ""
+      - for: 60s
            refId: redis_queue_growing_faster_than_insertion_rate
            relativeTimeRange:
              from: 0
              to: 0
        for: 60s
        noDataState: OK
-        panelId: 1
+        panelId: 3
-        title: redis_queue_too_large_to_clear_in_2_min
+        uid: capacity_below_10_percent
-        uid: redis_queue_too_large_to_clear_in_2_min
+        title: capacity_below_10_percent
-      - condition: if_capacity_is_too_low
+        condition: pending_jobs_and_capacity_compare
        dashboardUid: awx
        no_data_state: OK
        exec_err_state: Error
        data:
          - refId: remaining_capacity
-            queryType: ''
+            queryType: ""
            relativeTimeRange:
-              from: 1800
+              from: 300
              to: 0
            datasourceUid: awx_prometheus
            model:
-              editorMode: builder
+              editorMode: code
-              expr: awx_instance_remaining_capacity{instance="awx1:8013"}
+              expr: sum(awx_instance_remaining_capacity)
              hide: false
              intervalMs: 1000
              legendFormat: __auto
              maxDataPoints: 43200
              range: true
              refId: remaining_capacity
-          - refId: if_capacity_is_too_low
+          - refId: last_remaining_capacity
-            queryType: ''
+            queryType: ""
            relativeTimeRange:
              from: 0
              to: 0
@@ -328,14 +330,63 @@ groups:
              conditions:
                - evaluator:
                    params:
-                      - 20
+                      - 3
-                      - 0
+                    type: outside_range
                    type: lt
                  operator:
-                    type: when
+                    type: and
                  query:
                    params:
-                      - remaining_capacity
+                      - total_capacity
                  reducer:
                    params: []
                    type: percent_diff
                  type: query
              datasource:
                type: __expr__
                uid: "-100"
              expression: remaining_capacity
              hide: false
              intervalMs: 1000
              maxDataPoints: 43200
              reducer: last
              refId: last_remaining_capacity
              type: reduce
          - refId: total_capacity
            queryType: ""
            relativeTimeRange:
              from: 600
              to: 0
            datasourceUid: awx_prometheus
            model:
              datasource:
                type: prometheus
                uid: awx_prometheus
              editorMode: code
              expr: sum(awx_instance_capacity{instance="awx1:8013"})
              hide: false
              intervalMs: 1000
              legendFormat: __auto
              maxDataPoints: 43200
              range: true
              refId: total_capacity
          - refId: last_total_capacity
            queryType: ""
            relativeTimeRange:
              from: 0
              to: 0
            datasourceUid: "-100"
            model:
              conditions:
                - evaluator:
                    params:
                      - 0
                      - 0
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - capacity_below_10%
                  reducer:
                    params: []
                    type: avg
@@ -344,12 +395,142 @@ groups:
                name: Expression
                type: __expr__
                uid: __expr__
-              expression: remaining_capacity
+              expression: total_capacity
              hide: false
              intervalMs: 1000
              maxDataPoints: 43200
-              refId: if_capacity_is_too_low
+              reducer: last
-              type: classic_conditions
+              refId: last_total_capacity
-        for: 30m
+              type: reduce
-        title: if_capacity_is_too_low
+          - refId: 10_percent_total_capacity
-        uid: if_capacity_is_too_low
+            queryType: ""
            relativeTimeRange:
              from: 0
              to: 0
            datasourceUid: "-100"
            model:
              conditions:
                - evaluator:
                    params:
                      - 0
                      - 0
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - last_total_capacity
                  reducer:
                    params: []
                    type: avg
                  type: query
              datasource:
                name: Expression
                type: __expr__
                uid: __expr__
              expression: "$last_total_capacity*.10"
              hide: false
              intervalMs: 1000
              maxDataPoints: 43200
              refId: 10_percent_total_capacity
              type: math
          - refId: pending_jobs
            queryType: ""
            relativeTimeRange:
              from: 600
              to: 0
            datasourceUid: awx_prometheus
            model:
              datasource:
                type: prometheus
                uid: awx_prometheus
              editorMode: builder
              expr: awx_pending_jobs_total{instance="awx1:8013"}
              hide: false
              intervalMs: 1000
              legendFormat: __auto
              maxDataPoints: 43200
              range: true
              refId: pending_jobs
          - refId: last_pending_jobs
            queryType: ""
            relativeTimeRange:
              from: 0
              to: 0
            datasourceUid: "-100"
            model:
              conditions:
                - evaluator:
                    params:
                      - 0
                      - 0
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - pending_jobs_and_capacity_compare
                  reducer:
                    params: []
                    type: avg
                  type: query
              datasource:
                name: Expression
                type: __expr__
                uid: __expr__
              expression: pending_jobs
              hide: false
              intervalMs: 1000
              maxDataPoints: 43200
              reducer: last
              refId: last_pending_jobs
              type: reduce
          - refId: pending_jobs_and_capacity_compare
            queryType: ""
            relativeTimeRange:
              from: 0
              to: 0
            datasourceUid: "-100"
            model:
              conditions:
                - evaluator:
                    params:
                      - 0
                      - 0
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - 10_percent_total_capacity
                  reducer:
                    params: []
                    type: last
                  type: query
                - evaluator:
                    params:
                      - 0
                      - 0
                    type: gt
                  operator:
                    type: and
                  query:
                    params:
                      - pending_jobs
                  reducer:
                    params: []
                    type: last
                  type: query
              datasource:
                name: Expression
                type: __expr__
                uid: __expr__
              expression:
                "($10_percent_total_capacity > $last_remaining_capacity) && $last_pending_jobs
                > 1"
              hide: false
              intervalMs: 1000
              maxDataPoints: 43200
              reducer: mean
              refId: pending_jobs_and_capacity_compare
              type: math
--- a/tools/grafana/dashboards/demo_dashboard.json
+++ b/tools/grafana/dashboards/demo_dashboard.json