USE Method in Prometheus
CPU Utilisation
1 - avg(rate(node_cpu{job="default/node-exporter",mode="idle"}[1m]))
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m]))
CPU Saturation
sum(node_load1{job="default/node-exporter"})
/
sum(node:node_num_cpu:sum)
With the following recording rule:
- expr: |-
count by (instance) (sum by (instance, cpu) (
node_cpu_seconds_total{job="node-exporter"}
))
record: node:node_num_cpu:sum
Memory Utilisation
1 - sum(
node_memory_MemFree_bytes{job="…"} +
node_memory_Cached_bytes{job="…"} +
node_memory_Buffers_bytes{job="…"}
)
/
sum(node_memory_MemTotal_bytes{job="…"})
Memory Saturation
1e3 = 1000 (it is sometimes used to convert from milliseconds to seconds).
1e3 * sum(
rate(node_vmstat_pgpgin{job="…"}[1m]) +
rate(node_vmstat_pgpgout{job="…"}[1m]))
)
RED Method in Prometheus
Rate (QPS)
sum(rate(request_duration_seconds_count{job="…"}[1m]))
Errors
sum(rate(request_duration_seconds_count{job="…", status_code!~"2.."}[1m]))
Duration (Latency)
histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{job="…"}[1m])) by (le))
Four Golden Signals
- Latency - time taken to service a request
- Traffic - how much demand is placed on your system
- Errors - rate or requests are failing
- Saturation - how “full” your service is
Misc.
- Instant Query = queries only the last datapoint best displayed in table panel
Grouping
count(device_boot_time* on (instance) group_left(region, firmware) device_info{region="$region", firmware=~"$firmware"})
Here, device_info
is the dummy gauge, always just set to one. You can see that we are selecting this gauge based on the Grafana template values.
These two labels are used to group_left
join on other metrics, for which we want (on
) the instance. The metric is device_boot_time
but in this case the value is not relevant for the metric, as all we are doing using count
to count all devices that have a boot time greater than 0 (which is all devices).
This gives a count of all of the devices in the region that match the region and firmware.
In order to get the All option to work for firmware, a few things are needed:
- Select All in the dashboard’s variable settings
- Set a custom value for the option of
.*
- Use the pattern match operator in the info-block selector of
=~
Get top 10 longest running services:
topk(5, time() - device_boot_time* on (instance) group_left(region, firmware) device_info{region="$region", firmware=~"$firmware"})
Crashing Devices
topk(10, changes(device_boot_time[1h])
The changes function looks for sudden jumps to the value of a gauge. This can be used to track the number of times a device reboots in an hour. For this demo, our simulated devices reboot 8 times per hour on average. If you use the dashboard selectors to choose Massachusetts as the region, you will see that one device is crashing much more frequently, and the dashboard uses conditional formatting to show this as red.
alerts:
groups:
- name: device_alerts
rules:
- alert: LidLeftOpen
expr: (time() - (lid_open_start_time * lid_open_status)) > 900 and (time() - (lid_open_start_time * lid_open_status)) < 9000
for: 30s
labels:
severity: page
system: mechanical
annotations:
summary: "Lid open on {{ $labels.instance }}"
description: "Lid has remained open for more than 15 minutes on {{ $labels.instance }}"
- alert: DeviceRebooting
expr: changes(device_boot_time[1h]) > 20
for: 2m
labels:
severity: debug
system: software
annotations:
summary: "{{ $labels.instance }} rebooting"
description: "{{ $labels.instance }} has been rebooting more than 20 times an hour"
Uptime from start timestamp
Metric value should be number of seconds since the epoch. The following will tell you how many hours since the workflow started:
(time() - argo_workflows_custom_start_time_gauge_workflow) / 3600
What range should I use with rate()?
The general rule for choosing the range is that it should be at least 4x the scrape interval. This is to allow for various races, and to be resilient to a failed scrape.
Source: https://www.robustperception.io/what-range-should-i-use-with-rate
Dashboard Design
Your services will have a rough tree structure, have a dashboard per service and walk the tree from the top when you have a problem. Similarly for each service, have dashboards per subsystem. Rule of Thumb: Limit of 5 graphs per dashboard, and 5 lines per graph.
Heatmap
Query: rate(foo_metric_bucket[10m])
Legend format: {{le}}
Format as: Heatmap
Time Range Variable
Variable Definition
Name: big_routine_job
Type: Query
Refresh: On Time Range Change
Query: query_result(topk(3, avg_over_time(go_goroutines[$__range])))
Regex: /job="([^"]+)"/
Dashboard Query
sum(rate(http_requests_total{job="$big_routine_job"}[10m])) by (handler)
Refining Rate
rate(requests[5m])
sum(rate(requests[5m])) by (service_name)
sum(rate(requests{service_name="catalogue"}[5m])) by (instance)
Histogram
Derive average request duration over a rolling 5 minute period
rate(request_duration_sum[5m])
/
rate(request_duration_count[5m])
Top 5 Docker images by CPU
topk(5,
sum by (image) (
rate(container_cpu_usage_seconds_total{
id=~"/system.slice/docker.*"}[5m]
)
)
)
Alerts
- alert: "Node CPU Usage"
expr: (100 - (avg by (instance)
(irate(node_cpu{app="prometheus-node-exporter", mode="idle"}[5m]))
* 100)) > 70
for: 30s
annotations:
miqTarget: "ExtManagementSystem"
url: "https://www.example.com/high_node_cpu_fixing_instructions"
description: "{{ $labels.instance }}: CPU usage is above 70%
(current value is: {{ $value }})"
labels:
severity: "warning"
Kubernetes
CPU Utilization, Saturation, and Errors
The nodes in your cluster have resources. The most important resources your nodes provide in a Kubernetes cluster are CPU, Memory, Network and Disk. Let’s apply the USE method to all of these.
To calculate the amount of cpu utilization by host in your Kubernetes cluster we want to sum all the modes except for idle, iowait, guest, and guest_nice. The PromQL looks like this:
sum(rate(
node_cpu{mode!=”idle”,
mode!=”iowait”,
mode!~”^(?:guest.*)$”
}[5m])) BY (instance)
A metric that the node-exporter gives us for saturation is the Unix load average. Loosely, the load average is the number of processes running plus the those waiting to run.
node_load1, node_load5 and node_load15 represent the 1, 5 and 15 min load averages. This metric is a gauge and is already averaged for you. As a standalone metric it is somewhat useless with knowing how many CPUs your node has. Is a load average of 10 good or bad? It depends. If you divide the load average by the number of CPUs you have in your cluster, then you get an approximation of the CPU saturation of your system.
node_exporter does not expose a count of node CPUs directly, but if you count just one of the above CPU modes, say “system”, you can get a CPU count by node:
count(node_cpu{mode="system"}) by (node)
Now you can normalize the node_load1 metric by the number of CPUs on the node expressed as a percentage:
sum(node_load1) by (node) / count(node_cpu{mode="system"}) by (node) * 100
The node_exporter does not reveal anything about CPU errors.
Memory Utilization, Saturation and Errors
It would seem that memory utilization and saturation are a somewhat easier to reason about as the amount of physical memory on a node is known. Again nothing is easy! node_exporter gives us 43 node_memory_* metrics to work with!
The amount of available memory on a Linux system is not just the reported “free” memory metric. Unix systems rely heavily on memory that is not in use by applications to share code (buffers) and to cache disk pages (cached). So one measure of available memory is:
sum(node_memory_MemFree + node_memory_Cached + node_memory_Buffers)
Newer Linux kernels (after 3.14) expose a better free memory metric, node_memory_MemAvailable
.
Divide that by the total memory available on the node and you get a percentage of available memory, then subtract from 1 to get a measure of node memory utilization:
1 - sum(node_memory_MemAvailable) by (node)
/ sum(node_memory_MemTotal) by (node)