Implementing SLOs-as-Code - a Case Study

By managing Service Level Objectives as code, we can co-locate SLO definitions and ownership with the product code and team. This supports horizontal scaling of SLO ownership while integrating with the code management process, and creating an audit trail for SLOs.

Implementing SLOs-as-Code - a Case Study

Introduction

When you measure the reliability of hundreds or even thousands of products and services across an enterprise, ownership of Service Level Objectives (SLOs) should not reside with a single team. There are a few aspects that make scaling SLO ownership horizontally across teams much more efficient: a standardized SLO management platform, process architecture, and automation across the enterprise. Managing SLOs as code alongside product code is a step towards 360° product ownership and enables us to automate SLO updates via continuous integration.

Objectives and Architecture

The larger objective behind the SLOs-as-code approach is to move towards 360° product ownership. At Procore, teams own their product’s code and processes, including aspects like testing, performance, reliability, automated deployment pipelines, and deployment configurations. Adding SLO ownership to a team's portfolio will also help improve product reliability.

Implementing SLOs as code involves three architectural components, which will be discussed in detail in their own respective sections. In summary, these components are:

Figure 1: SLOs as code Architecture
  • SLO objects and definitions: Standardized SLOs and related resources facilitate centralized management and automation.
  • Automation: SLOs as code are our single and reproducible source of truth, which we automate via Continuous Integration.
  • A common platform: Nobl9 is the SLO management platform upon which SLOs, related resources and automation are built.

SLO Objects and Definitions

Procore's Observability team has designed our SLO-as-code approach to scale with Procore's growing number of teams and services. Choosing YAML as the source of truth allows Procore a scalable approach for the company through centralized automation. Following the examples put forth by openslo.com and embracing a ubiquitous language like YAML helps avoid adding the complexities of Terraform for development teams and is easier to embed in every team's directories. We used a Gitops approach to Infrastructure as Code (IaC) to create and maintain our Nobl9 resources.

The Nobl9 resources can be defined as YAML configuration (config) files. In particular, one can  declaratively define a resource's properties (in the config file) and have a tool read and process that into a live and running resource. It's important to draw a distinction between the resource and its configuration, as we'll be discussing both throughout this article. All resources, from Projects (the primary grouping of resources in Nobl9) to SLOs, can be defined through YAML.

Figure 2. A simple visualization of Nobl9 object relationships. A Project is a top-level object. Projects contain Services, SLOs are attached to Services, and SLOs can trigger alerts via project-scoped Alert Policies. Finally, Role Bindings define who has access to the Project contents.

Procore adopted a hybrid approach to organizing our Nobl9 configuration so that the Observability team can review systemic changes while teams still own changes to their service SLOs. A separate central repository is the source of truth for all other Nobl9 configurations, such as Project and Alert Policy definitions. As projects have a one-to-many relationship with their services, it would quickly become a game of "guess where" if the project configuration were defined within one of its service repositories. The central repository is owned by the Observability team and allows us to manage permissions through pull request reviews submitted by product teams for their non-SLO Nobl9 resources. Once requests are merged, our automation applies the changes. The SLO definitions are co-located with the service's code. Teams self-regulate these resources to maintain their SLOs internally.

Automation

Automating the creation and modification of Nobl9 resources is essential to Procore — it makes iterating on the deployment process quick and painless for our engineering teams. Automation removes human error and possible complications that usually come along with manually applying config files via CLI tools.

With that in mind, the Observability team created a CI job/workflow that our engineering teams copy into their project repo during their SLOs-as-code onboarding. Procore’s CI job utilizes the Nobl9 sloctl docker image, so we do not have to install the sloctl CLI tool in our CI containers. We configured this job to only apply the configs that have been added or updated, which helps to future-proof our pipeline as we scale the number of our SLO configs.

Figure 3: SLOs-as-code Automation Workflow

This requires the following steps in the workflow:

  1. An engineer creates or modifies a config and commits the changes to GitHub.
  2. The engineer opens a PR, gets reviews, and merges their changes into the main branch.
  3. CI picks up the changes in GitHub and kicks off the following steps:
    3.1 Using git diff and regex pattern matching, we figure out what configs have been added or changed from the most recent merge to main:
        #!/bin/bash
        ADDED_OR_CHANGED_CONFIGS=$(git diff --name-only --diff-filter=d 
        HEAD^ HEAD | { grep -E "^${NOBL9_RESOURCE_DIR}\/.*\.y[a]?ml$" 
        || :; })
    
    3.2 We loop over the added/changed configs and run sloctl apply using the nobl9/sloctl docker image, which updates our resources in Nobl9:
        #!/bin/bash
        for file in “$ADDED_OR_CHANGED_CONFIGS”
        do
            docker run nobl9/sloctl:v0.0.80 apply -f “$file”
        done
    

We use the same workflow in our centralized Infrastructure as Code (IaC) repo that manages our projects, services, role-binding and alert policy configs.

Common Platform

Our platform requires another key feature for SLOs as code to work: We need to be able to consolidate different observability data sources and telemetry streams that power SLIs. This platform needs to support standardized data formats to define SLOs and related entities, such as alert policies for different conditions and trends, while allowing for the management of these formats via automation. We evaluated three vendors that offered SLO solutions, and only Nobl9 met all our requirements, which extended beyond the SLOs-as-code topic of this article.

Observations

Adopting SLOs as code can be impeded by three particular barriers of entry: new tooling (Nobl9 platform), data and format (SLO configs and resources), and process. To drive adoption, we have designed a self-service model supported by an open-door consulting practice. While this is an ongoing learning process for all involved, we are seeing early signs of success of an approach that uses:

  • Detailed, step-by-step instructions, including code snippets and templates.
  • Screencast tutorials supplementing the instructions.
  • Initial face-to-face, white-glove meetings to explain concepts and provide demos. The personal experience and conversation in our experience created considerable enthusiasm and goodwill among our teams. It proved to be a much better approach than just pointing them to documentation.
  • Fast feedback cycles with onboarding teams to improve instructions.

Quite frequently, teams stated their primary challenges were not the technical aspects of SLOs as code discussed in this article but how to design and improve on meaningful SLOs. We addressed this by authoring articles like Black Box SLIs and other internal resources for guidance. While posts can be a useful tool, we found that working with teams and discussing how their product impacts the user experience was more beneficial to modeling their SLOs using observability data.

Last but not least, we are integrating with our Embedded SRE team to help onboard more services. When a product team engages with ESRE on their deployment strategy and configuration, ESRE also discusses onboarding to SLOs-as-code and basic SLOs, such as error rate or duration.

Summary

In this article, we describe how Procore adopted an Infrastructure as Code approach to SLOs. We present the architecture, data structures, and automation we developed and implemented to co-locate SLO ownership with the product teams and allow teams to manage SLOs as code.

If challenges like this excite you, consider joining Procore’s Engineering Team!