Introduction to Site Reliability Engineering

Note: Tutorial Github repository can be found here.

There is an old adage, “you can’t improve what you can’t measure”.

For years, engineers cared about individual servers and the applications that ran on top of them.  Web applications were built with simple architectures and deployed to colocated servers in an artisanal fashion.  Then, the web started to change - systems suddenly needed to become more resistant to failure, as more and more people began to rely on web applications as part of their everyday life.  As these systems scale both horizontally and vertically, as they start to become more resilient to failure, and as we add more design considerations like message queues and system oriented architectures, our mindset on how to validate our systems has also evolved.  Things like structured logging, profiling, and distributed tracing have become more commonplace to help us debug production systems and understand problems before they become too difficult to fix.  

Adding telemetry to an application helps to understand how the application is performing and visualizes the end user experience.  Adding specific metrics around operations that are important to your business allows application owners to determine exactly when they need to take action around the services they build and own in an engineering organization.  Let's discuss the fundamentals of Site Reliability Engineering through SLOs, SLIs, and SLAs.

SLOs, SLIs, and SLAs

Web applications are now built with the ability to scale and be resilient, but we have to remember the reason that we are building them - for our customers.  Many development jobs stem from the value of building a product for a customer.  Service Level Indicators and Service Level Objectives help to demonstrate that business value.   Site Reliability Engineers often use Service Level Indicators (SLI) and Service Level Objectives (SLO) to perform these measurements and take action if the service level objective is not being met.  The latency of a request would be an example of a SLI.  An example of a SLO would be “99% of requests will be completed in 250ms”.  If these SLIs and SLOs aren’t being met, product velocity should be stopped and the problems that have arisen from the breaking of these SLOs should be fixed.  If your web application stops performing in a way that’s acceptable to the user, you’ll very likely lose revenue and trust with them.  There have been many studies on web speed’s correlation to customer retention and satisfaction.  SLIs and SLOs can directly help to improve the performance and reliability of your web application and in turn keep your users happy and the invoices being paid.

Fundamental SLIs and SLOs drive the creation of Service Level Agreements (SLA).   After you’ve set a SLO for your service, you can confidently set a baseline Service Level Agreement that you can share with your business stakeholders to offer to your customers or partners.

Tracking Reliability Using SLOs and SLIs

There are many approaches for tracking reliability of a service. Tom Wilkie’s RED method for metrics is a popular choice based on its simplicity and ease of implementation.  We will be using this method for our example application.  RED is an acronym that is used for the following principles:

Requests:  The number of requests that your system receives.

Errors: The number of errors that your system emits.

Duration: The amount of time it takes to complete a request in your system.

The Four Golden Signals by Google follows some similar principles, but also includes saturation in your system.  If you’d like to include saturation (determining exactly how "full" your service is) in your chart, you should be able to retrieve it from your traffic ingress.

Once we have these metrics available to us, we can quickly determine the health of the system.  Knowing how many requests come in, how many errors we are producing, and the duration of requests over a particular timeframe will help us to understand the overall health of our system and eventually offer a relevant SLA.

Example Application

We can see an example application at the following URL:

https://github.com/bobstrecansky/SLIsSLOsWhatAreThose

Our example application is broken down into 5 separate docker containers, using some popular open source tools that are freely available for download:

  • goapp - the application that we are instrumenting SLIs and SLOs around
  • prometheus - the Prometheus monitoring application that is used both for scraping Prometheus exporters and displaying information
  • alertmanager - an alerting tool that we can use to alert whomever we choose to based on specific criteria from Prometheus
  • grafana - A graphing tool that we use in order to display graphs of our metrics and visualize error budgets and service level objectives
  • loadtest - a simple load testing Docker container (using the hey application) in order to generate some artificial load in our test system.

Let’s discuss each of these containers.

goapp

In this example application, we have exposed two endpoints:

This example is meant to be simple so that we can keep our focus on SLIs and SLOs around this service.  Your application will have more complex endpoints that could call for additional monitoring. 

This example exposes Prometheus metrics called CounterVec, as a counter for our total number of incoming requests and HistogramVec, as a histogram for latency. 

Descriptions for counters and histograms can be found on the Prometheus metric types page here.

We then load a HTTP server and serve our http endpoints to the user. 

prometheus

Our prometheus instance is served at http://localhost:9090/.  Prometheus is used as a mechanism to store and serve time series data that is created by applications. Alerts can be configured based on queries and are periodically evaluated.

An introduction to Prometheus can be found here.

alertmanager

Our alertmanager instance is served at http://localhost:9093/.  Alertmanager is a tool to route and manage alerts from Prometheus to notification endpoints, such as email, slack, pagerduty, etc..  Given the criteria, we can alert a specific receiver to perform any sort of action that we would like to based on the alerting criteria.  Some examples of actions we can trigger based on Alertmanager routes include, sending automated emails, slack notification integration, paging, and restarting webservers through webhooks.

In our example, we are using Gmail to send an alert to an impacted party.  More information about alertmanager can be found here.

grafana

Our grafana instance is served at http://localhost:3000/.  Grafana is an open source graphing platform that is often used to visualize data through graphs and charts. More information about grafana can be found here.

loadtest

Our load testing instance is a simple bash script that runs hey, a Go load test generator.  We make concurrent calls to our Go service endpoints using this tool.  You can see this bash script here.

Using our Example Application

Now that we understand what each of the containers in our application are used for, let’s give it a test!

To use the example project that has been created for this post, you’ll need to install Docker Engine 18.06.0 or higher.  Instructions to install Docker Engine for your Operating System can be found here.
Once you have Docker Engine installed, you can bring up our set of target applications by cloning the repository, cd’ing into the directory, and starting our project using docker compose:

[bob@blinky ~]$ git clone https://github.com/bobstrecansky/SLIsSLOsWhatAreThose
[bob@blinky ~]$ cd SLIsSLOsWhatAreThose/

For this example, we’ll also need to create a Gmail app password for our alerting.  You don’t need to do this if you don’t want to see an example AlertManager alert.  To create this app password, you’ll need to visit https://myaccount.google.com/apppasswords and create a new app password. I’ve included a screenshot of the resulting screen below after creating an app password:


You’ll need to add the resulting credential in the Alertmanager configuration in your cloned repository.  

Note: make sure you don’t commit your password to your git repository; plaintext passwords shouldn’t be stored in git.

In the /SLIsSLOsWhatAreThose/alertmanager/alertmanager.yaml file, you’ll need to change the following bits of YAML.  Change the to field to the email address you’d like to send the alert to, the from field, auth_username, and auth_identity fields to the email address that was used to set up the app password, and the auth_password to the app password that was created in the previous step.

to: EMAILADDRESS
from: EMAILADDRESS 
auth_username: EMAILADDRESS
auth_identity: EMAILADDRESS
auth_password: PASSWORD

Once you’ve done this, we are ready to start up our services.  We’ll bring up this service using docker-compose:

[bob@blinky SLIsSLOsWhatAreThose]$ docker-compose up -d --build

After you perform this action, you should be able to visit http://localhost:3000

On your machine.  You’ll be able to log into the Grafana portal that appears with the following credentials:

Username: admin
Password: slislo


Once logged in, you’ll see a welcome screen like the one below.  


Click on the Home dropdown in the top right corner of the screen, and select the “RED Metrics Dashboard” dashboard in the menu on the left side of the screen.   


At this point we will see our RED Metrics Dashboard.  You may have to refresh the page using the refresh button in the top right corner of the screen, depending on how fast your data populated.  You can now see our graphs that show request rate, error rate, and latency distribution buckets:


The queries that are associated with these dashboards can be found within the dashboard by clicking the specific chart’s title and clicking edit.

The request count shows when there is an influx of requests in your system.  A sudden spike of traffic for a special event, a malicious attacker trying to DDoS your system, or an organic growth in traffic can all be monitored and alerted on with a request count monitor.

An error rate is helpful to determine if something in your system is failing.  The error rate can help to notice when incorrect or poorly performing code is introduced to the system or if you don’t have enough scale to handle all of your incoming requests in time.

The duration graph shows how long a request takes to process and return to an end user.  Duration graphs are normally displayed in percentiles.  The 50th percentile shows what 50% of the end users see, the 95th percentile shows what 5% outliers see, and a 99th percentile line shows what 1% outliers see.  This is especially important to see normalized timings.  You may have end users that will get slow responses for a myriad of reasons, but if only a small percentage your customers are seeing this slow request, it may not be as critical to fix as if it’s impacting a large percentage of your customers.

One Prometheus alerting rule is set up as an example to show how we can alert on particular behaviors.  In our Prometheus alerting rules, we set a rule that if we see a greater than 10% of 500 responses for a duration of 1 minute, we’ll send an error message.  That expression can be found here.  Our load test makes a batch of requests to the /error_response endpoint in order to trigger this alert.

If you take a look at the email account that you configured to receive our alert, you’ll notice that you’ve received an email with this alert:

Note: be patient, as this may take a couple of minutes to populate.


We can also see this alert appear in Alertmanager:


After your alert subsides (a couple of minutes later), you’ll also receive an email saying that the issue has been resolved:


These alerts are important, as they can help engineering teams take action on their services before they break their SLO (and in turn their SLA).  Rolling back bad code, scaling your service, or troubleshooting performance issues can all be direct actions taken from service level alerting.

Summary

Requests, errors, and durations (as defined by the RED method) can help us to define SLIs and SLOs  for our services. We can see how SLOs and SLIs can help to make a business case for how your app is performing.  Understanding these patterns can help you to design and maintain a resilient system and give your business stakeholders confidence that it is running as expected.  If you’d like to read more about SRE, Google’s SRE books are a great place to learn about how to implement SRE practices in your application.

Tags:
SRE
Tags:
Continuous Delivery
Tags:
Devops Engineers
Tags:
Engineering

Related Posts