We are undergoing an infrastructure evolution at Procore, which has adopted a microservices architecture. In order to meet the increased demand to run more microservices, the Procore Cloud Platform Orchestration team has been tasked with provisioning and maintaining a growing number of Kubernetes clusters.

With the increased scale of Kubernetes workloads has come a new challenge. We needed to ensure the various components and add-ons that make up a Procore Kubernetes Cluster (a “cluster”) stay aligned across clusters and environments to provide a consistent user experience. To accomplish this, we modified our codebase to make deploying new clusters and maintaining the existing ones easier. Specifically, we (1) reduced the size of our existing codebase, (2) enforced consistency across our clusters, and (3) boiled down the differences between clusters to a single override file for each cluster.

Reduce the Size of the Codebase

Our first modification was reducing the size of our terraform codebase to provision new clusters. To accomplish this, we invoked the “Do Not Repeat Yourself” software development principle (“DRY”). Before the code refactoring, each directory defining a cluster contained a separate file for each cluster component. With dozens of individual modules defining each cluster, it was difficult to determine which features and which versions of features were running across all our different clusters.

To tackle this issue, we rolled up all our individual modules into two wrapper modules. The first one, named procore_cluster_base, wraps up the low-level networking and configuration of the EKS cluster itself. The cluster base module outputs information about the cluster used to configure the helm and Kubernetes terraform providers used by the second wrapper module. The second module, called procore_cluster_addons, contains configurations for higher-level features, such as networking, storage, monitoring, and secrets management.

Each of our two procore_cluster_ wrapper modules uses symlinks to make relative references to the individual modules contained within them. When the code is updated for one of our individual component modules, the changes are rolled up into the wrapper modules. The change is then versioned in source control, and any cluster using the wrapper modules is updated with the change by moving to the new version.

Enforce Consistency Across Clusters

By centralizing our cluster definitions in two wrapper modules, the configurations between clusters no longer drift by running different code versions of each underlying individual component. However, it was a challenge to pull each cluster into this lock-step state due to the drift we had accumulated.

The first step in aligning the clusters’ state was changing the name of each individual terraform module to its new name as a component of a wrapper terraform module. Terraform uses a resource instance address to track existing remote objects. As we did not want to destroy and recreate our clusters, we needed to rename our modules using the `terraform state mv` command. For each cluster, we wrote a script to run the `terraform state mv` command for each module. The starting names for the modules were slightly different between clusters. However, since every cluster was moved to the same wrapper modules, we were able to clean up the naming inconsistencies between cluster components.

The riskiest part of the project was renaming our terraform modules correctly so that the `terraform plan` did not destroy our existing resources. In order to mitigate the risk, we temporarily switched to using a local backend to verify the state changes made by the migration script so that we wouldn’t alter our remote backend state.

terraform {
  backend "local" {
    path = "ALTERED_CLUSTER_STATE.json"
  }
}

When we were fully satisfied with the ‘terraform plan’ from the local backend state, we switched the backend config back to using the s3 remote state and pushed the local state file to the remote bucket. Using a local terraform backend is a deprecated feature that is not generally recommended, but in this particular use case, we were able to use it as an intermediate step to safely verify our changes without altering our existing remote state.

Once we verified our state changes and pushed the new state up to our remote backend, we needed to apply the new `terraform plan` to the cluster. Because of the accumulated drift, each cluster’s plan was slightly different. Some clusters needed to be updated to align with the latest version of our cluster terraform module codebase. Cluster by cluster, we ran `terraform apply` to bring each cluster up to date with the latest version of each individual component module, all wrapped up in our new wrapper modules. Having the clusters aligned with the latest code, let’s take a look at the top-level configuration of each cluster.

Cluster value overrides a contained in a single point of truth

Although the clusters are aligned in using the same underlying source modules, there are still some minor differences between them. For example, each cluster has a different name. There are also clusters spanning across multiple accounts and regions that need to be specified on a per-cluster basis. We decided to use a single `locals.tf` file per cluster to define these differences. In this file, we pass in values for a particular cluster and use booleans prefixed with `enable_` to turn on features that only run on particular clusters.

Local values have a few advantages over other terraform value types. Local values are more flexible because the author is allowed to interpolate the values of locals into other locals. This reduces hardcoding the same string across many places in the `locals.tf` file.

locals {
  cluster_name   = “example”
  cluster_domain = “${local.cluster_name}.example.com”
}

Terraform also allows data block references as local values, which further helps reduce unique definitions to a single file in our cluster configuration.

Top-level cluster configuration files

This brought our refactored cluster definition down to a manageable number of files. There are some files for the boilerplate terraform setup, the two wrapper module definitions, and the locals.tf file defining all the values passed into the particular cluster.

Conclusion

Let’s have a look at the before and after for our cluster configuration. The actual numbers varied slightly between clusters, but this is what we found on average. We reduced the number of terraform modules needed to create a cluster from twenty-three to two. We netted one hundred and eighty-three lines of code deleted per cluster. And finally, we reduced the number of files defining each cluster at the top level from twenty-five to nine.

By following the DRY principle to reduce the configuration needed per cluster, we have provided a consistent platform to our service-author users to run the services that make up Procore.

If challenges like Infrastructure as Code and cloud orchestration excite you, come join us!