Cheat Sheet - Monitoring

SLA Monitoring

If you are not familiar with these terms, I would strongly recommend reading the article from Google’s SRE book on Service Level Objectives first.

In summary:

SLAs: Service Level Agreement¹
- What service you commit to provide to users, with possible penalties if you are not able to meet it.
- Example: “99.5%” availability.
- Keyword: contract
SLOs: Service Level Objective
- What you have internally set as a target, driving your measuring threshold (for example, on dashboards and alerting). In general, this should be stricter than your SLA.
- A natural structure for SLOs is thus: SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.
- Example: “99.9%” availability (the so called “three 9s”).
- Keyword: thresholds
SLIs: Service Level Indicators
- What you actually measure, to ascertain whether your SLOs are on/off-target.
- Example: error ratios, latency, QPS, availability
- Keyword: metrics

How to measure

From the above, it’s clear that we must have service metrics to tell us when the service is considered (un)available. There are several approaches for this:

Methodology	RED	USE
Meaning	Rate Errors Duration	Utilization Saturation Errors
Used for	request driven things like endpoints	resources like queues, caches, CPUs, disks

Services tend to fall into a few broad categories in terms of the SLIs they find relevant:

User-facing serving systems, such as frontends, generally care about availability, latency, and throughput. In other words: Could we respond to the request? How long did it take to respond? How many requests could be handled?
Storage systems often emphasize latency, availability, and durability. In other words: How long does it take to read or write data? Can we access the data on demand? Is the data still there when we need it? See Data Integrity: What You Read Is What You Wrote for an extended discussion of these issues.
Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end latency. In other words: How much data is being processed? How long does it take the data to progress from ingestion to completion? (Some pipelines may also have targets for latency on individual processing stages.)

All systems should care about correctness: was the right answer returned, the right data retrieved, the right analysis done? Correctness is important to track as an indicator of system health, even though it’s often a property of the data in the system rather than the infrastructure per se, and so usually not an SRE responsibility to meet.

For most services, the most straightforward way of representing risk tolerance is in terms of the acceptable level of unplanned downtime. Unplanned downtime is captured by the desired level of service availability, usually expressed in terms of the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability. For serving systems, this metric is traditionally calculated based on the proportion of system uptime:

Time-based availability

Using this formula over the period of a year, we can calculate the acceptable number of minutes of downtime to reach a given number of nines of availability:

\[availability = {uptime \over (uptime + downtime)}\]

For distributed systems a time-based metric for availability is usually not meaningful because we are looking across globally distributed services. Sufficient fault isolation makes it very likely that we are serving at least a subset of traffic for a given service somewhere in the world at any given time (i.e., we are at least partially “up” at all times). Therefore, instead of using metrics around uptime, we define availability in terms of the request success rate.

Aggregate availability

Aggregate availability shows how this yield-based metric is calculated over a rolling window (i.e., proportion of successful requests over a one-day window).

\(availability = {successful requests \over total requests}\) [availability = {successful requests \over total requests}]

aggregate availability formula

Comparison

Comparison	Time-based Availability	Aggregate Availability
System Type	Consumer	Infrastructure
Used for	Monoliths, user facing systems	distributed services, batch, pipeline, storage, caching layers and transactional systems
Measured as	proportion of system uptime	request success rate
Question to ask yourself	`How much time was the system unavailable?`	`What's the amount of errors across all service instances?` Helps with analyzing system performance by revealing bottlenecks or errors
Methodology to use	RED	USE

Most often, we set quarterly availability targets for a service and track our performance against those targets on a weekly, or even daily, basis. This strategy lets us manage the service to a high-level availability objective by looking for, tracking down, and fixing meaningful deviations as they inevitably arise. See Service Level Objectives for more details.

Define your SLIs and SLOs

Now that we have an idea what SLIs/SLOs/SLAs are it’s time for an example.

SLIs

Expression: [Metric Identifier][Operator][Metric Value]

95th percentile home page latency over 5 minutes < 500ms
Home page request response codes != 5xx
Home page requests served in < 100ms

SLOs

Expression: [Success Objective][SLI][Period]

99% of 95th percentile home page latency over 5 minutes < 500ms over the trailing month
99% of home page request response codes != 5xx over the last 7 days
95% of home page requests served in < 100ms over the last 24 hours

Identify system boundaries
Define capabilities exposed by each system
Plain-english definition of “available” for each capability
Define corresponding technical SLIs
Start measuring to get a baseline
Define SLO targets (per SLI or per capability)
Iterate and tune

Data Tier (a message bus system)

Tier	Capability	SLI	SLO
Ingest/Routing	Ingest	Percent of well-formed payloads accepted	99.9%
Ingest/Routing	Routing	Time to deliver message to correct destination	99.5% of messages in under 5 seconds
Horizontally Scaled Data Tier	Query Data	Latency
Horizontally Scaled Data Tier	Query Data	Correctness/Error rate

Note: SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financial—a rebate or a penalty—but they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask “what happens if the SLOs aren’t met?”: if there is no explicit consequence, then you are almost certainly looking at an SLO.

↩

PREVIOUSTake high resolution screenshots of websites

NEXTEnable TLS for RDS MySQL JDBC connections