# Cheat Sheet - Prometheus

## USE Method in Prometheus

### CPU Utilisation

1 - avg(rate(node_cpu{job="default/node-exporter",mode="idle"}[1m]))
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m]))


### CPU Saturation

sum(node_load1{job="default/node-exporter"})
/
sum(node:node_num_cpu:sum)


With the following recording rule:

    - expr: |-
count by (instance) (sum by (instance, cpu) (
node_cpu_seconds_total{job="node-exporter"}
))
record: node:node_num_cpu:sum


### Memory Utilisation

1 - sum(
node_memory_MemFree_bytes{job="…"} +
node_memory_Cached_bytes{job="…"} +
node_memory_Buffers_bytes{job="…"}
)
/
sum(node_memory_MemTotal_bytes{job="…"})


### Memory Saturation

1e3 = 1000 (it is sometimes used to convert from milliseconds to seconds).

1e3 * sum(
rate(node_vmstat_pgpgin{job="…"}[1m]) +
rate(node_vmstat_pgpgout{job="…"}[1m]))
)


## RED Method in Prometheus

### Rate (QPS)

sum(rate(request_duration_seconds_count{job="…"}[1m]))


### Errors

sum(rate(request_duration_seconds_count{job="…", status_code!~"2.."}[1m]))


### Duration (Latency)

histogram_quantile(0.99, sum(rate(request_duration_seconds_bucket{job="…"}[1m])) by (le))


# Four Golden Signals

• Latency - time taken to service a request
• Traffic - how much demand is placed on your system
• Errors - rate or requests are failing

# Misc.

• Instant Query = queries only the last datapoint best displayed in table panel

## Grouping

count(device_boot_time* on (instance) group_left(region, firmware)  device_info{region="$region", firmware=~"$firmware"})


Here, device_info is the dummy gauge, always just set to one. You can see that we are selecting this gauge based on the Grafana template values.

These two labels are used to group_left join on other metrics, for which we want (on) the instance. The metric is device_boot_time but in this case the value is not relevant for the metric, as all we are doing using count to count all devices that have a boot time greater than 0 (which is all devices).

This gives a count of all of the devices in the region that match the region and firmware.

In order to get the All option to work for firmware, a few things are needed:

• Select All in the dashboard’s variable settings
• Set a custom value for the option of .*
• Use the pattern match operator in the info-block selector of =~

## Get top 10 longest running services:

topk(5, time() - device_boot_time* on (instance) group_left(region, firmware)  device_info{region="$region", firmware=~"$firmware"})


## Crashing Devices

topk(10, changes(device_boot_time[1h])


The changes function looks for sudden jumps to the value of a gauge. This can be used to track the number of times a device reboots in an hour. For this demo, our simulated devices reboot 8 times per hour on average. If you use the dashboard selectors to choose Massachusetts as the region, you will see that one device is crashing much more frequently, and the dashboard uses conditional formatting to show this as red.

    alerts:
groups:
rules:
expr: (time() - (lid_open_start_time *  lid_open_status)) > 900 and (time() - (lid_open_start_time *  lid_open_status)) < 9000
for: 30s
labels:
severity: page
system: mechanical
annotations:
summary: "Lid open on {{ $labels.instance }}" description: "Lid has remained open for more than 15 minutes on {{$labels.instance }}"
expr: changes(device_boot_time[1h]) > 20
for: 2m
labels:
severity: debug
system: software
annotations:
summary: "{{ $labels.instance }} rebooting" description: "{{$labels.instance }} has been rebooting more than 20 times an hour"


## Uptime from start timestamp

Metric value should be number of seconds since the epoch. The following will tell you how many hours since the workflow started:

(time() - argo_workflows_custom_start_time_gauge_workflow) / 3600


## What range should I use with rate()?

The general rule for choosing the range is that it should be at least 4x the scrape interval. This is to allow for various races, and to be resilient to a failed scrape.

## Dashboard Design

Your services will have a rough tree structure, have a dashboard per service and walk the tree from the top when you have a problem. Similarly for each service, have dashboards per subsystem. Rule of Thumb: Limit of 5 graphs per dashboard, and 5 lines per graph.

## Heatmap

Query: rate(foo_metric_bucket[10m])
Legend format: {{le}}
Format as: Heatmap


## Time Range Variable

### Variable Definition

Name: big_routine_job
Type: Query
Refresh: On Time Range Change
Query: query_result(topk(3, avg_over_time(go_goroutines[$__range]))) Regex: /job="([^"]+)"/  ### Dashboard Query sum(rate(http_requests_total{job="$big_routine_job"}[10m])) by (handler)


## Refining Rate

rate(requests[5m])
sum(rate(requests[5m])) by (service_name)
sum(rate(requests{service_name="catalogue"}[5m])) by (instance)


## Histogram

Derive average request duration over a rolling 5 minute period

rate(request_duration_sum[5m])
/
rate(request_duration_count[5m])


## Top 5 Docker images by CPU

topk(5,
sum by (image) (
rate(container_cpu_usage_seconds_total{
id=~"/system.slice/docker.*"}[5m]
)
)
)


- alert: "Node CPU Usage"
expr: (100 - (avg by (instance)
(irate(node_cpu{app="prometheus-node-exporter", mode="idle"}[5m]))
* 100)) > 70
for: 30s
annotations:
miqTarget: "ExtManagementSystem"
url: "https://www.example.com/high_node_cpu_fixing_instructions"
description: "{{ $labels.instance }}: CPU usage is above 70% (current value is: {{$value }})"
labels:
severity: "warning"


# Kubernetes

## CPU Utilization, Saturation, and Errors

The nodes in your cluster have resources. The most important resources your nodes provide in a Kubernetes cluster are CPU, Memory, Network and Disk. Let’s apply the USE method to all of these.

To calculate the amount of cpu utilization by host in your Kubernetes cluster we want to sum all the modes except for idle, iowait, guest, and guest_nice. The PromQL looks like this:

sum(rate(
node_cpu{mode!=”idle”,
mode!=”iowait”,
mode!~”^(?:guest.*)\$”
}[5m])) BY (instance)


A metric that the node-exporter gives us for saturation is the Unix load average. Loosely, the load average is the number of processes running plus the those waiting to run.

node_exporter does not expose a count of node CPUs directly, but if you count just one of the above CPU modes, say “system”, you can get a CPU count by node:

count(node_cpu{mode="system"}) by (node)


Now you can normalize the node_load1 metric by the number of CPUs on the node expressed as a percentage:

sum(node_load1) by (node) / count(node_cpu{mode="system"}) by (node) * 100


The node_exporter does not reveal anything about CPU errors.

## Memory Utilization, Saturation and Errors

It would seem that memory utilization and saturation are a somewhat easier to reason about as the amount of physical memory on a node is known. Again nothing is easy! node_exporter gives us 43 node_memory_* metrics to work with!

The amount of available memory on a Linux system is not just the reported “free” memory metric. Unix systems rely heavily on memory that is not in use by applications to share code (buffers) and to cache disk pages (cached). So one measure of available memory is:

sum(node_memory_MemFree + node_memory_Cached + node_memory_Buffers)


Newer Linux kernels (after 3.14) expose a better free memory metric, node_memory_MemAvailable.

Divide that by the total memory available on the node and you get a percentage of available memory, then subtract from 1 to get a measure of node memory utilization:

1 - sum(node_memory_MemAvailable) by (node)
/ sum(node_memory_MemTotal) by (node)