Cheat Sheet - Monitoring

SLA Monitoring

If you are not familiar with these terms, I would strongly recommend reading the article from Google’s SRE book on Service Level Objectives first.

In summary:

  • SLAs: Service Level Agreement1
    • What service you commit to provide to users, with possible penalties if you are not able to meet it.
    • Example: “99.5%” availability.
    • Keyword: contract
  • SLOs: Service Level Objective
    • What you have internally set as a target, driving your measuring threshold (for example, on dashboards and alerting). In general, this should be stricter than your SLA.
    • A natural structure for SLOs is thus: SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.
    • Example: “99.9%” availability (the so called “three 9s”).
    • Keyword: thresholds
  • SLIs: Service Level Indicators
    • What you actually measure, to ascertain whether your SLOs are on/off-target.
    • Example: error ratios, latency, QPS, availability
    • Keyword: metrics

How to measure

From the above, it’s clear that we must have service metrics to tell us when the service is considered (un)available. There are several approaches for this:

Methodology RED USE
Meaning Rate
Errors
Duration
Utilization
Saturation
Errors
Used for request driven things like endpoints resources like queues, caches, CPUs, disks

Services tend to fall into a few broad categories in terms of the SLIs they find relevant:

  • User-facing serving systems, such as frontends, generally care about availability, latency, and throughput. In other words: Could we respond to the request? How long did it take to respond? How many requests could be handled?

  • Storage systems often emphasize latency, availability, and durability. In other words: How long does it take to read or write data? Can we access the data on demand? Is the data still there when we need it? See Data Integrity: What You Read Is What You Wrote for an extended discussion of these issues.

  • Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end latency. In other words: How much data is being processed? How long does it take the data to progress from ingestion to completion? (Some pipelines may also have targets for latency on individual processing stages.)

All systems should care about correctness: was the right answer returned, the right data retrieved, the right analysis done? Correctness is important to track as an indicator of system health, even though it’s often a property of the data in the system rather than the infrastructure per se, and so usually not an SRE responsibility to meet.

For most services, the most straightforward way of representing risk tolerance is in terms of the acceptable level of unplanned downtime. Unplanned downtime is captured by the desired level of service availability, usually expressed in terms of the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability. For serving systems, this metric is traditionally calculated based on the proportion of system uptime:

Time-based availability

Using this formula over the period of a year, we can calculate the acceptable number of minutes of downtime to reach a given number of nines of availability:

\[availability = {uptime \over (uptime + downtime)}\]

For distributed systems a time-based metric for availability is usually not meaningful because we are looking across globally distributed services. Sufficient fault isolation makes it very likely that we are serving at least a subset of traffic for a given service somewhere in the world at any given time (i.e., we are at least partially “up” at all times). Therefore, instead of using metrics around uptime, we define availability in terms of the request success rate.

Aggregate availability

Aggregate availability shows how this yield-based metric is calculated over a rolling window (i.e., proportion of successful requests over a one-day window).

\(availability = {successful requests \over total requests}\) [availability = {successful requests \over total requests}]

aggregate availability formula

Comparison

Comparison Time-based Availability Aggregate Availability
System Type Consumer Infrastructure
Used for Monoliths, user facing systems distributed services, batch, pipeline, storage, caching layers and transactional systems
Measured as proportion of system uptime request success rate
Question to ask yourself How much time was the system unavailable? What's the amount of errors across all service instances? Helps with analyzing system performance by revealing bottlenecks or errors
Methodology to use RED USE

Most often, we set quarterly availability targets for a service and track our performance against those targets on a weekly, or even daily, basis. This strategy lets us manage the service to a high-level availability objective by looking for, tracking down, and fixing meaningful deviations as they inevitably arise. See Service Level Objectives for more details.

Define your SLIs and SLOs

Now that we have an idea what SLIs/SLOs/SLAs are it’s time for an example.

SLIs

Expression: [Metric Identifier][Operator][Metric Value]

  • 95th percentile home page latency over 5 minutes < 500ms
  • Home page request response codes != 5xx
  • Home page requests served in < 100ms

SLOs

Expression: [Success Objective][SLI][Period]

  • 99% of 95th percentile home page latency over 5 minutes < 500ms over the trailing month
  • 99% of home page request response codes != 5xx over the last 7 days
  • 95% of home page requests served in < 100ms over the last 24 hours
  1. Identify system boundaries
  2. Define capabilities exposed by each system
  3. Plain-english definition of “available” for each capability
  4. Define corresponding technical SLIs
  5. Start measuring to get a baseline
  6. Define SLO targets (per SLI or per capability)
  7. Iterate and tune

Data Tier (a message bus system)

Tier Capability SLI SLO
Ingest/Routing Ingest Percent of well-formed payloads accepted 99.9%
Ingest/Routing Routing Time to deliver message to correct destination 99.5% of messages in under 5 seconds
Horizontally Scaled Data Tier Query Data Latency  
Horizontally Scaled Data Tier Query Data Correctness/Error rate  
  1. Note: SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial—a rebate or a penalty—but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask “what happens if the SLOs aren’t met?”: if there is no explicit consequence, then you are almost certainly looking at an SLO.