Cheat Sheet - Prometheus

USE Method in Prometheus

CPU Utilisation

1 - avg(rate(node_cpu{job="default/node-exporter",mode="idle"}[1m]))
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m]))

CPU Saturation

sum(node_load1{job="default/node-exporter"})
  / 
sum(node:node_num_cpu:sum)

With the following recording rule:

    - expr: |-
        count by (instance) (sum by (instance, cpu) (
          node_cpu_seconds_total{job="node-exporter"}
        ))
      record: node:node_num_cpu:sum

Memory Utilisation

1 - sum( 
  node_memory_MemFree_bytes{job="…"} +
  node_memory_Cached_bytes{job="…"} +
  node_memory_Buffers_bytes{job="…"} 
) 
  /
sum(node_memory_MemTotal_bytes{job="…"})

Memory Saturation

1e3 = 1000 (it is sometimes used to convert from milliseconds to seconds).

1e3 * sum(
  rate(node_vmstat_pgpgin{job="…"}[1m]) +
  rate(node_vmstat_pgpgout{job="…"}[1m]))
)

RED Method in Prometheus

Rate (QPS)

sum(rate(request_duration_seconds_count{job="…"}[1m]))

Errors

sum(rate(request_duration_seconds_count{job="…", status_code!~"2.."}[1m]))

Duration (Latency)

histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{job="…"}[1m])) by (le))

Four Golden Signals

  • Latency - time taken to service a request
  • Traffic - how much demand is placed on your system
  • Errors - rate or requests are failing
  • Saturation - how “full” your service is

Misc.

  • Instant Query = queries only the last datapoint best displayed in table panel

Grouping

count(device_boot_time* on (instance) group_left(region, firmware)  device_info{region="$region", firmware=~"$firmware"})

Here, device_info is the dummy gauge, always just set to one. You can see that we are selecting this gauge based on the Grafana template values.

These two labels are used to group_left join on other metrics, for which we want (on) the instance. The metric is device_boot_time but in this case the value is not relevant for the metric, as all we are doing using count to count all devices that have a boot time greater than 0 (which is all devices).

This gives a count of all of the devices in the region that match the region and firmware.

In order to get the All option to work for firmware, a few things are needed:

  • Select All in the dashboard’s variable settings
  • Set a custom value for the option of .*
  • Use the pattern match operator in the info-block selector of =~

See https://github.com/GoogleCloudPlatform/community/blob/100c6bf74a5d67d174da1957fe3267d83426f658/tutorials/cloud-iot-prometheus-monitoring/dashboards/Region%20Dashboard-1545166109026.json

Get top 10 longest running services:

topk(5, time() - device_boot_time* on (instance) group_left(region, firmware)  device_info{region="$region", firmware=~"$firmware"})

Crashing Devices

topk(10, changes(device_boot_time[1h])

The changes function looks for sudden jumps to the value of a gauge. This can be used to track the number of times a device reboots in an hour. For this demo, our simulated devices reboot 8 times per hour on average. If you use the dashboard selectors to choose Massachusetts as the region, you will see that one device is crashing much more frequently, and the dashboard uses conditional formatting to show this as red.

    alerts:
      groups:
        - name: device_alerts
          rules:
          - alert: LidLeftOpen
            expr: (time() - (lid_open_start_time *  lid_open_status)) > 900 and (time() - (lid_open_start_time *  lid_open_status)) < 9000
            for: 30s
            labels:
              severity: page
              system: mechanical
            annotations:
              summary: "Lid open on {{ $labels.instance }}"
              description: "Lid has remained open for more than 15 minutes on {{ $labels.instance }}"
          - alert: DeviceRebooting
            expr: changes(device_boot_time[1h]) > 20
            for: 2m
            labels:
              severity: debug
              system: software
            annotations:
              summary: "{{ $labels.instance }} rebooting"
              description: "{{ $labels.instance }} has been rebooting more than 20 times an hour"

Uptime from start timestamp

Metric value should be number of seconds since the epoch. The following will tell you how many hours since the workflow started:

(time() - argo_workflows_custom_start_time_gauge_workflow) / 3600

What range should I use with rate()?

The general rule for choosing the range is that it should be at least 4x the scrape interval. This is to allow for various races, and to be resilient to a failed scrape.

Source: https://www.robustperception.io/what-range-should-i-use-with-rate

Dashboard Design

Your services will have a rough tree structure, have a dashboard per service and walk the tree from the top when you have a problem. Similarly for each service, have dashboards per subsystem. Rule of Thumb: Limit of 5 graphs per dashboard, and 5 lines per graph.

Heatmap

Query: rate(foo_metric_bucket[10m])
Legend format: {{le}}
Format as: Heatmap

Time Range Variable

Variable Definition

Name: big_routine_job
Type: Query
Refresh: On Time Range Change
Query: query_result(topk(3, avg_over_time(go_goroutines[$__range])))
Regex: /job="([^"]+)"/

Dashboard Query

sum(rate(http_requests_total{job="$big_routine_job"}[10m])) by (handler)

Refining Rate

rate(requests[5m])
sum(rate(requests[5m])) by (service_name)
sum(rate(requests{service_name="catalogue"}[5m])) by (instance)

Histogram

Derive average request duration over a rolling 5 minute period

rate(request_duration_sum[5m])
  /
rate(request_duration_count[5m])

Top 5 Docker images by CPU

topk(5,
  sum by (image) (
    rate(container_cpu_usage_seconds_total{
      id=~"/system.slice/docker.*"}[5m]
    )
  )
)

Alerts

- alert: "Node CPU Usage"
  expr: (100 - (avg by (instance)
        (irate(node_cpu{app="prometheus-node-exporter", mode="idle"}[5m]))
        * 100)) > 70
  for: 30s
  annotations:
    miqTarget: "ExtManagementSystem"
    url: "https://www.example.com/high_node_cpu_fixing_instructions"
    description: "{{ $labels.instance }}: CPU usage is above 70%
          (current value is: {{ $value }})"
  labels:
    severity: "warning"

Kubernetes

CPU Utilization, Saturation, and Errors

The nodes in your cluster have resources. The most important resources your nodes provide in a Kubernetes cluster are CPU, Memory, Network and Disk. Let’s apply the USE method to all of these.

To calculate the amount of cpu utilization by host in your Kubernetes cluster we want to sum all the modes except for idle, iowait, guest, and guest_nice. The PromQL looks like this:

sum(rate(
         node_cpu{mode!=idle,
                  mode!=iowait,
                  mode!~^(?:guest.*)$”
                  }[5m])) BY (instance)

A metric that the node-exporter gives us for saturation is the Unix load average. Loosely, the load average is the number of processes running plus the those waiting to run.

node_load1, node_load5 and node_load15 represent the 1, 5 and 15 min load averages. This metric is a gauge and is already averaged for you. As a standalone metric it is somewhat useless with knowing how many CPUs your node has. Is a load average of 10 good or bad? It depends. If you divide the load average by the number of CPUs you have in your cluster, then you get an approximation of the CPU saturation of your system.

node_exporter does not expose a count of node CPUs directly, but if you count just one of the above CPU modes, say “system”, you can get a CPU count by node:

count(node_cpu{mode="system"}) by (node)

Now you can normalize the node_load1 metric by the number of CPUs on the node expressed as a percentage:

sum(node_load1) by (node) / count(node_cpu{mode="system"}) by (node) * 100

The node_exporter does not reveal anything about CPU errors.

Memory Utilization, Saturation and Errors

It would seem that memory utilization and saturation are a somewhat easier to reason about as the amount of physical memory on a node is known. Again nothing is easy! node_exporter gives us 43 node_memory_* metrics to work with!

The amount of available memory on a Linux system is not just the reported “free” memory metric. Unix systems rely heavily on memory that is not in use by applications to share code (buffers) and to cache disk pages (cached). So one measure of available memory is:

sum(node_memory_MemFree + node_memory_Cached + node_memory_Buffers)

Newer Linux kernels (after 3.14) expose a better free memory metric, node_memory_MemAvailable.

Divide that by the total memory available on the node and you get a percentage of available memory, then subtract from 1 to get a measure of node memory utilization:

1 - sum(node_memory_MemAvailable) by (node) 
/ sum(node_memory_MemTotal) by (node)

References