What is observability as code?

Observability as code (also known as o11y as code) is the process of automating the configuration of observability tools. You manage your infrastructure with code, so why not manage your observability the same way—and your dashboards as well? 

Why is observability as code important?

Observability as code brings several benefits to the field of observability, which is crucial for understanding and maintaining complex systems. Here are some reasons why observability as code is important:

Consistency and reproducibility

Defining observability configurations in code ensures consistency across different environments. This reduces the chances of errors caused by manual configuration discrepancies between development, testing, and production environments. It enables the recreation of observability setups in a predictable and reproducible manner. This is essential for troubleshooting and debugging issues that may arise in different environments.

Version control

Observability configurations are treated as code and can be stored in version control systems (e.g., Git), allowing teams to track changes, collaborate effectively, and roll back to previous configurations if needed. Version control also provides an audit trail, helping to understand when and why specific changes were made.

Automation

Observability as code facilitates the automation of configuration tasks. Automated processes ensure that configurations are applied consistently and quickly across the entire infrastructure. This is particularly valuable in dynamic and rapidly changing environments.

Infrastructure as Code (IaC) integration

Many organizations adopt Infrastructure as Code (IaC) practices to manage their infrastructure. Observability as code can seamlessly integrate with IaC tools, allowing observability configurations to be included in the same codebase as infrastructure definitions. This enhances the overall manageability and maintainability of the system.

Collaboration and documentation

Code serves as a form of documentation. Describing observability configurations in code helps teams have a clear and centralized source of information about the monitoring and logging setup. This aids in onboarding new team members and provides a shared understanding of how the system is observed.

Scalability

As systems grow in complexity, the manual management of observability configurations becomes increasingly challenging. Observability as code supports the scalability of observability practices by allowing teams to efficiently manage configurations for large and intricate systems.

Flexibility and agility

Code-based observability allows for more flexible and agile development processes. Changes to observability configurations can be made alongside code changes, ensuring that monitoring and logging are aligned with the evolving requirements of the application.

This three-part blog series is your guide to o11y as code, providing tips, examples, and guidance. In this series, we'll walk through examples of how you can automate the configuration of your observability tools, starting with dashboards here in part one. Part one covers the basics of Terraform, how to provision a sample app, and how to create dashboards as code.

By the end of the series, you'll have worked with a total of five examples of observability as code using New Relic and Hashicorp's Terraform: dashboards as code, alerts as code, synthetic monitoring as code, tags as code, and workloads as code. You'll be working with data from the sample FoodMe restaurant ordering app. You'll be working with data from the sample FoodMe restaurant ordering app.

How did we get here? Infrastructure as code

Since infrastructure as code (also known as IaC) appeared on the scene more than a decade ago, it’s become a core requirement in the modern cloud era. The terminology “as code” means treating infrastructure configuration just like we treat code, pushing configuration into source control, then carefully pushing out changes again to the infrastructure layer.

With the rise of modern distributed systems, we also see more outages, and finding the root cause of the issue can be challenging when something goes wrong. Observability fits into the new paradigm because we need to determine the internal states of our systems from their outputs. Observability uses different system outputs such as tracing, logs, and metrics to understand the internal state of the distributed components, diagnose where the problems are, and get to the root cause.

Unfortunately, the operational practices we rely on didn’t change much, and developers and operations engineers might find they still look at hundreds of alerts or dashboards. This approach leads to non-repeatable, non-standardized dashboard configurations or adjusting alerts dynamically to avoid signals fatigue and drifting from organizational best practices. 

But we can use what we know about infrastructure as code to automate observability. Meet the new approach: observability as code, which treats observability configurations as code. As explained in Observability as code simplifies your life, observability as code represents a shift of intention to an auditable code-managed solution that reduces the work needed to maintain and develop a configuration. 

Understand the basics of Terraform

Terraform by Hashicorp is an infrastructure as code tool that you can use to define and manage infrastructure resources in configuration files that are easily readable by humans. You can declaratively manage services and automate your changes to those services.

In most examples, a Terraform module is a set of Terraform configuration files in one directory. When you run Terraform commands directly from that single directory, it is considered the root module. Here's what it looks like, as shown in the Terraform docs:

.
├── LICENSE
├── README.md
├── main.tf
├── variables.tf
├── outputs.tf

Terraform files used in this blog series

The examples in the tutorial exercises in this blog series focus on two important files:

  • The main.tf file contains the main set of configurations for your module. You can also create other configuration files and organize them in a way that makes sense for your project.
  • The variables.tf file contains the variable definitions for your module. If you want others to use your module, configure the variables as arguments in the module block.

Example of a New Relic Terraform provider

Here’s an example of a New Relic provider in Terraform from Configuring the New Relic Terraform Provider.

# get the New Relic terraform provider
terraform {
  required_version = "~> 1.0"
  required_providers {
    newrelic = {
      source  = "newrelic/newrelic"
    }
  }
}

# configure the New Relic provider
provider "newrelic" {
  account_id = <Your Account ID>
  api_key = <Your User API Key>    # usually prefixed with 'NRAK'
  region = "US"                    # Valid regions are US and EU
}

You can also use environment variables to configure the provider, which can simplify your provider block. Each provider has key schema attribute, such as account_id, api_key, and region.

Terraform commands to remember

To initialize and run Terraform effectively, remember these four commands:

  • The terraform init command performs initialization steps to prepare the current working directory for use with Terraform. This command is safe to run multiple times, to update the working directory with configuration changes.
  • The terraform plan command creates an execution plan, which lets you preview the changes that Terraform will make to your infrastructure. You can use this command to check whether the proposed changes match what you expect before you apply the changes.
  • The terraform apply command automatically creates an execution plan, prompting you to approve that plan, and then takes the indicated actions. Follow the prompts, and answer yes to apply the changes. Terraform will then provision the resources.
  • The terraform destroy command is a convenient way to remove all the remote objects managed by a particular Terraform configuration. Follow the prompts, and Terraform will delete all the resources.

For more information on Terraform commands, see Provisioning infrastructure with Terraform.

The examples in the next sections show key concepts in Terraform such as providers, data sources, and resources. You'll be automating configuration of New Relic dashboards to view data from the sample FoodMe restaurant app

This blog post demo uses the newrelic_one_dashboard resource. As an alternative, if you want to use the newrelic_one_dashboard_json resource, see the Creating dashboards with Terraform and JSON templates tutorial.

Before you begin provisioning your first Terraform module

For this tutorial, we’re going to provision a sample app. But before you provision your first Terraform module, you’ll need to get an account ID, your user key, and point to the correct data center: 

This video walkthrough covers prerequisite work.

Provision the sample app

Before we work on implementing observability as code, let’s start by provisioning our sample app! 

1. Generate your unique URL for the FoodMe example app with this Glitch link: glitch.com/edit/#!/remix/nr-devrel-o11yascode

2. Set the environment variables. Go to .env and insert these values:

  • LICENSE_KEY: Insert your New Relic ingest API keys.
  • APP_NAME: Insert your name or initials to the name of the app FoodMe-XXX (for example, FoodMe-Jan).

3. Preview your URI.

Go to Tools (bottom of the panel), and select Preview in a new window.

4. Record your URL.

Note your newly generated URL. You’ll use this later on in part two of the series for synthetic monitoring as code.

5. Generate some workloads. Now that you’re in the sample app, enter an example name, delivery address, and select Find Restaurants! After you are on the main page, click around to generate some workloads for the sample app. We'll need some data to look at in the dashboards.

Create dashboards as code

Now we're ready for our first observability as code example: dashboards as code. With New Relic custom dashboards, you can collect and visualize the specific data that you want to see and display in New Relic. You'll learn how to automate configuring dashboards in New Relic using Terraform.

There are three main steps. To see everything we are covering in this section, watch this video. For more details, go to Getting started with New Relic and Terraform. You can also work along with these steps with code samples in GitHub and the hands-on workshop in Instruqt.

In Terraform, each resource block describes one or more observability objects, such as dashboards, alerts, notification workflows, or workloads. We'll use examples from Resource: newrelic_one_dashboard:

1. Create a resource block  and declare a type (newrelic_one_dashboard) with a given name (exampledash). The type and the name of the resource are the identifier for the resource, so they must be unique within a module. Here's a simple example for deploying dashboards as code in New Relic, based on Resource: newrelic_one_dashboard.

# New Relic One Dashboard
resource "newrelic_one_dashboard" "exampledash" {
	# The title of the dashboard.
  name = "New Relic Terraform Example"

	# A nested block that describes a page
  page {
		# The name of the page.
    name = "New Relic Terraform Example"

		# A nested block that describes a Billboard widget
    widget_billboard {
      title = "Requests per minute"
      row = 1
      column = 1
      width = 6
      height = 3

			# A nested block that describes a NRQL Query
      nrql_query {
        query = "FROM Transaction SELECT rate(count(*), 1 minute)"
      }
    }
  }
}

For more details on attribute reference, see the attribute reference for the newrelic provider in Terraform

For more details on New Relic Query Language (NRQL), see syntax, clauses, and functions.

2. Next, you'll include a variables.tf file in Terraform. You can customize Terraform modules with input variables instead of modifying the source code of the module. Then it's easy to share and reuse modules across other configurations in Terraform. At the end of this section, you'll see an example variables.tf file.

3. Finally, you'll combine what we covered about the New Relic provider, the resources, the main.tf file, and the corresponding variariables.tf file to deploy dashboards as code.

The next two example main.tf and variariables.tf files use concepts described in Google Site Reliability Engineering, The Four Golden Signals: latency, traffic, errors, and throughput. These examples are based on code samples in the Getting Started with the New Relic Provider documentation.

Example main.tf file complete code

# get the New Relic terraform provider
terraform {
  required_version = "~> 1.0"
  required_providers {
    newrelic = {
      source  = "newrelic/newrelic"
    }
  }
}

# configure the New Relic provider
provider "newrelic" {
  account_id = (var.nr_account_id)
  api_key = (var.nr_api_key)    # usually prefixed with 'NRAK'
  region = (var.nr_region)      # Valid regions are US and EU
}

# resource to create, update, and delete dashboards in New Relic
resource "newrelic_one_dashboard" "dashboard_name" {
  name = "O11y_asCode-FoodMe-Dashboards-TF"

  # determines who can see the dashboard in an account
  permissions = "public_read_only"

  page {
    name = "Dashboards as Code"

    widget_markdown {
      title = "Golden Signals - Latency"
      row = 1
      column = 1
      width = 4
      height = 3

      text = "## The Four Golden Signals - Latency\n---\n#### The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. \n\n#### For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. \n\n#### On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors."
    }

    widget_line {
      title = "Golden Signals - Latency - FoodMe - Line"
      row = 1
      column = 5
      width = 4
      height = 3

      nrql_query {
        query = "SELECT average(apm.service.overview.web) * 1000 as 'Latency' FROM Metric WHERE appName like '%FoodMe%' since 30 minutes ago TIMESERIES AUTO"
      }
    }

    widget_stacked_bar {
      title = "Golden Signals - Latency - FoodMe - Stacked Bar"
      row = 1
      column = 9
      width = 4
      height = 3

      nrql_query {
        query = "SELECT average(apm.service.overview.web) * 1000 as 'Latency' FROM Metric WHERE appName like '%FoodMe%' since 30 minutes ago TIMESERIES AUTO"
      }
    }

    widget_markdown {
      title = "Golden Signals - Errors"
      row = 4
      column = 1
      width = 4
      height = 3

      text = "## The Four Golden Signals - Errors\n---\n\n#### The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, \"If you committed to one-second response times, any request over one second is an error\").\n \n#### Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. \n\n#### Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content."
    }

    widget_area {
      title = "Golden Signals - Errors - FoodMe - Area"
      row = 4
      column = 5
      width = 4
      height = 3

      nrql_query {
        query = "SELECT (count(apm.service.error.count) / count(apm.service.transaction.duration))*100 as 'Errors' FROM Metric WHERE (appName like '%FoodMe%') AND (transactionType = 'Web') SINCE 30 minutes ago TIMESERIES AUTO"
      }
    }

    widget_billboard {
      title = "Golden Signals - Errors - FoodMe - Billboard Compare With"
      row = 4
      column = 9
      width = 4
      height = 3

      nrql_query {
        query = "SELECT (count(apm.service.error.count) / count(apm.service.transaction.duration))*100 as 'Errors' FROM Metric WHERE (appName like '%FoodMe%') AND (transactionType = 'Web') SINCE 30 minutes ago COMPARE WITH 30 minutes ago"
      }
    }

    widget_markdown {
      title = "Golden Signals - Traffic"
      row = 7
      column = 1
      width = 4
      height = 3

      text = "## The Four Golden Signals - Traffic\n---\n\n#### A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. \n\n#### For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). \n\n#### For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. \n\n#### For a key-value storage system, this measurement might be transactions and retrievals per second."
    }

    widget_table {
      title = "Golden Signals - Traffic - FoodMe - Table"
      row = 7
      column = 5
      width = 4
      height = 3

      nrql_query {
        query = "SELECT rate(count(apm.service.transaction.duration), 1 minute) as 'Traffic' FROM Metric WHERE (appName LIKE '%FoodMe%') AND (transactionType = 'Web') FACET path SINCE 30 minutes ago"
      }
    }

    widget_pie {
      title = "Golden Signals - Traffic - FoodMe - Pie"
      row = 7
      column = 9
      width = 4
      height = 3

      nrql_query {
        query = "SELECT rate(count(apm.service.transaction.duration), 1 minute) as 'Traffic' FROM Metric WHERE (appName LIKE '%FoodMe%') AND (transactionType = 'Web') FACET path SINCE 30 minutes ago"
      }
    }

    widget_markdown {
      title = "Golden Signals - Saturation"
      row = 10
      column = 1
      width = 4
      height = 3

      text = "## The Four Golden Signals - Saturation\n---\n\n#### How \"full\" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.\n\n#### In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., \"Give me a nonce\" or \"I need a globally unique monotonic integer\") that rarely change configuration, a static value from a load test might be adequate. \n\n#### As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.\n\n#### Finally, saturation is also concerned with predictions of impending saturation, such as \"It looks like your database will fill its hard drive in 4 hours.\""
    }

    widget_line {
      title = "Golden Signals - Saturation - CPU & Memory - Multi-Queries"
      row = 10
      column = 5
      width = 4
      height = 3

      nrql_query {
        query = "SELECT rate(sum(apm.service.cpu.usertime.utilization), 1 second) * 100 as 'cpuUsed' FROM Metric WHERE appName LIKE '%FoodMe%' SINCE 30 minutes ago TIMESERIES AUTO"
      }

      nrql_query {
        query = "SELECT average(apm.service.memory.physical) * rate(count(apm.service.instance.count), 1 minute) / 1000 as 'memoryUsed %' FROM Metric WHERE appName LIKE '%FoodMe%' SINCE 30 minutes ago TIMESERIES AUTO"
      }
    }

    widget_line {
      title = "Golden Signals - Saturation - Memory - Line Compare With"
      row = 10
      column = 9
      width = 4
      height = 3

      nrql_query {
        query = "SELECT average(apm.service.memory.physical) * rate(count(apm.service.instance.count), 1 minute) / 1000 as 'memoryUsed %' FROM Metric WHERE appName LIKE '%FoodMe%' SINCE 30 minutes ago COMPARE WITH 20 minutes ago TIMESERIES AUTO"
      }
    }
  }
}

Example variables.tf file complete code

# your unique New Relic account ID
variable "nr_account_id" {
  default = "XXXXX"
}
# your User API key
variable "nr_api_key" {
  default = "XXXXX"
}

# valid regions are US and EU
variable "nr_region" {
  default = "US"
}

What the final result looks like

Now that you have deployed dashboards as code, your final result should look like this in New Relic: