Service Level Objectives (SLOs) are fast becoming the industry standard to measure reliability, and help teams decide when to prioritize it. One of the steps in adopting an SLO culture is to identify the metrics that matter without drowning in noise and alert fatigue. This article explores how to apply the black box concept to aggregate granular metrics into Service Level Indicators (SLIs) that focus on the user experience as an indicator of system reliability.
To SLI or not to SLI
In general terms, SLOs define targets for the proper level of reliability of a given product, such as a service or a website. SLOs are applied to or informed by SLIs. An SLI is a measurement that is determined over a metric, or a piece of data, representing some property of a service. And this is where we, as engineers, can get lost in the details, since the perpetual proximity to the systems we build and support often leads us to think of system reliability in technical terms or metrics (e.g. response time, error rate, throughput). While these are certainly valuable metrics, user experience may be compromised even if error rate is zero and duration is well within SLO. Consider, for example, the response data. Even if well-formed, it may not be current, or flat out wrong. An error-free and quick response is of no value to a user that expects current and correct data. Error rate and response time remain both valuable metrics and SLIs, but focusing exclusively on them would let higher-level issues go undetected.
We could add freshness and correctness SLIs, but by doing so, we increase the number of signals we are monitoring. And with each signal, or SLI and associated SLO and error budget, we increase alert frequency, and make reliability reports unnecessarily complex. In other words, adding SLIs may address a particular aspect of system reliability, but it also introduces additional complexities.
Tales of Black and White Boxes
So let’s take a step back and borrow a concept from a related discipline: quality engineering — in particular software testing. Tests commonly fall into one of two categories: black box tests, or white box tests.
In systems theory, the black box is an abstraction representing a class of concrete open systems which can be viewed solely in terms of its stimuli inputs and output reactions, without any knowledge of its internal workings. A given input is expected to result in a particular output, without any consideration to the processing steps. Common examples include end-to-end tests.
White box tests, on the contrary, are designed with knowledge of, and to test, internal structures and workings of an application. Common examples include unit and integration tests.
User Journey as Black Box
Now that we understand the concept of black box vs. white box tests, let’s apply it to our SLIs. As mentioned above, a good SLI considers the entire user journey. Conceptually, a user journey aligns with the black box paradigm: for a given input, a particular output is expected. For example: requests to our API (the ‘input’) result in responses that provide fresh data to clients within a given time frame (the ‘output’, incl. success criteria). There are several aspects worth mentioning with this SLI:
- the SLI is applied at a system level;
- the SLI aggregates lower-level metrics implicitly and explicitly; and
- the SLI is binary - it is either true or false.
These aspects combine to an SLI that represents the user experience (system level), and that supports binary pass/fail attribution to an SLO target. In other words, the user journey is measured as a black box SLI.
White box to Black box: an example
Let’s consider a concrete example. A user requests a new account for a website. After the request was processed successfully, the user receives a confirmation email with an activation link. The user follows the link to activate the new account and log in. This workflow is visualized in the following sequence diagram.
Of particular interest is the account creation, and user notification via email. Both steps occur asynchronously. In particular the event processing engine, where the request for account creation is queued, offers several opportunities for insightful SLIs: for example queue length, average processing time, etc. Those SLIs, however, fall into the white box category: they contribute to the user experience, yet are opaque to the user. The user journey begins with the initial request for account creation (input), and ends with the email containing the activation link (output). Re-phrasing the example from earlier, a user-focused (black box) SLI could be ‘a request for a new account (the ‘input’) results in sending an email with a valid activation link within 1 minute (the ‘output’, incl. success criteria). This single high-level SLI aggregates several lower-level metrics, it measures many things by measuring only a few.
Let’s switch to the engineering mindset mentioned in the introduction, and assume the processing queue is stuck. The high-level black box SLI does not capture queue-specific metrics, suggesting a more granular SLI specific to queue size may be needed. However, white box metrics like this will affect the error budget burn of the aggregate SLO that is associated with the high-level black box SLI. Monitoring and Observability tools will allow engineers to diagnose and troubleshoot particular issues, such as a stuck queue, while understanding the impact on system reliability (via the higher level, black box SLO’s error budget and burn rate). The solution to the stuck processing queue used in this example is not an SLI dedicated to the queue, but reliability-focused work to diagnose and correct the root cause of the queue getting stuck.
This article introduces a SLI thought model that utilizes a common paradigm from Quality Engineering. This thought model offers a different way to think about SLIs, and supports implementing the fundamental objective of SLIs and the associated reliability stack they inform: ensure a positive and reliable user experience, by measuring reliability and providing quantitative support for decisions on prioritizing development efforts. Only a happy user is a continuous user.
If this kind of challenge excites you, then maybe you should join us!
A. Hidalgo, Implementing Service Level Objectives, O’Reilly 2020, p. 3 ↩︎