## Introduction - Compound (a.k.a composite) SLAs

Cloud providers publish individuals Service Level Agreements (SLAs) for the individual services they provide. We want to calculate the compound availability of multiple systems based on the following components:

- Azure Traffic Manager: 99.99% or ‘four nines’.
- SQL Azure: 99.99% or ‘four nines’.
- Azure App Service: 99.95% or ‘three nine five’.

When combined together in architectures there is the possibility that any one component could suffer an outage resulting in an overall availability that is not equal to the component services.

### Serial Compound Availability

In this example there are three possible failure modes:

- SQL Azure is down
- App Service is down
- Both are down

### Serial and Parallel Availability

In this architecture there are a large number of failure modes however principally:

- SQL Server in RegionA is down
- SQL Server in RegionB is down
- App Service in RegionA is down
- App Service in RegionB is down
- Traffic Manager is down
- Combinations of Above

## Calculation

We can take that as a math problem with the SLA being the probability of being OK. In this case we can rely on probability rules to get an overall.

For the first case (Serial Compound Availability) the probability that App Service (A) and Sql Service (B) are down at the same time is the product of their probability:

```
P(A)*P(B) = 0.0005 * 0.0005 = 0,00000025
```

The probability that one of them is down is the sum of their probability:

```
P(A)+P(B) = 0.001
```

When two events are independents the resulting formula to take into account the probability of both being down is:

```
P(A,B) = P(A) + P(B) - P(A)*P(B) = 0.001 - 0,00000025 = 0,00099975
```

So the overall SLA would be `1 - 0,00099975 = 0,99900025`

wich in percent is `99.900025 %`

A simplification is the product of the first probability: `0.9995 * 0.9995 = 0,99900025`

.

Let’s assume that both services were unavailable for one hour today (4,166666% of a day). This gives:

```
0.04166666 + 0.04166666 - (0.04166666 * 0.04166666) = 0,081597222
```

So the probability of being OK is `1 - 0.0816 = 0.9184`

in percent: `91,84%`

```
24 * 0.0816 = 1.95 h
```

This is less than the worst case of 2 hours because there’s a chance both are down at the same time.

Keeping that in mind, you may notice the availability for each is `95,84%`

and `0,958333333 * 0,958333333 = 0,918402778`

which is our `91.84%`

from above (sorry for the full decimals here, but they are needed for the demonstration)

Now for the second case (Serial and Parallel Availability), we’ll start with our compound probability for each region (Sorry I dismissed the change for SQL to keep it reasonable), assuming there’s no independent probability for the region itself and that each region is isolated and as such a DB failure take only its region down.

We have the traffic manager OK probability `P(T) = 0.9999`

and each app+DB couple with a OK probability `P(G) = 0,99900025`

.

How many regions we have plays a role as we have to apply the product of failure probability just to get the probability of both regions being down at the same time:

`0,00099975 * 0,00099975 = 0,0000009995000625`

which means an overall availability of at least one region of `99,049375 %`

.

Now we have the overall regions availability, the product with that of the traffic manager gives us the overall availability of the system:

```
0.9999 * 0,9999990004999375 = 0,99989900059988750625
```

The overall availability is `99.989900 %`

.

## Putting it together

For the serial compound availability multiply the probability of each component being available:

So:

99.95% * 99.95% = 0,99900025% =

99.9%

Calculating it in parallel is a little more complicated as we do need to consider what the percentage *un*availability will be:

The calculation is done as follows:

- Multiply the
*un*availability of the two regions together.

0.1% * 0.1% =

0.0001%

- Convert that back to availability

100% - 0.0001% =

99.9999%

- Multiply the Traffic Manager availability by the availability of the two regions.

99.99% * 99.9999% =

99.9899%

- The result is the whole system availability.

99.9899% is close to

99.99%

Here’s a summary of the calculation:

Component | Availability | Excel Formula |
---|---|---|

C1) Traffic Manager | 99,99% | 0,9999 |

C2) App Service | 99,95% | 0,9995 |

C3) SQL Azure | 99,95% | 0,9995 |

C4) Region A | 99,900025% | =C2*C3 |

C5) Region B | 99,900025% | =C2*C3 |

C6) Compound | 99,99990% | =1-((1-C4)) * (1-C5)) |

System | 99,9899% | =C1*C6 |

Another source as explanation is available on Azure’s docs (link courtesy of Raj Rao)

References: