DataDog is an awesome SaaS monitoring platform. We have 100+ developers leveraging the platform to collect their metrics, create dashboards and send alerts.
As with anything, if you don’t maintain and clean your tools, after a while things can become a little messy. Dashboards start to get named wildly different things with no standards. Alerts aren’t deleted for decommissioned services. Team names change and alerts are suddenly pointing to a the wrong Slack channel.
Something has to be done to improve the situation. Setting up some standards or rules around DataDog usage can help, but this is a fine line you need to walk between freedom and standardization. Be too strict or harsh on people and they no longer find the tool nice to use, instead thinking of it as pain in the ass.
Take too much freedom away, and you get “shadow IT” situations with people using their own tools or going their own way.
With this in mind, I decided on a few goals:
- Deciding on Terraform
- Terraform with DataDog Basics
- Repository Structure & Separation of Concerns
- Defining Applications and Teams
- Running Terraform
- Setting Environment Variables for Terraform
- Types of DataDog Monitors
- Wrapping Up
Deciding on Terraform
There were a few options around for managing DataDog from code:
- DogPush - Manage DataDog monitors in YAML
- Barkdog - Manage DataDog monitors using Ruby DSL, and updates monitors according to DSL
- Interferon - A Ruby gem enabling you to store your alerts configuration in code
- DogWatch - Ruby gem designed to provide a simple method for creating DataDog monitors in Ruby
- Ansible DataDog Montior Module - Manages monitors within Datadog via Ansible
- Terraform DataDog Provider - Supports creating monitors, users, timeboards and downtimes
I ended up deciding to go with Terraform mainly due to these two reasons:
- Being able to create timeboards using the same Terraform DSL / process.
- Terraform is also far more widely supported so from a “googling of problems” perspective (Ansible too)
Terraform with DataDog Basics
The DataDog Blog recently published a post called Managing Datadog with Terraform.
This will cover the basics to give you an introduction to Terraform. Once you have a read, head back over to this post for some more in-depth usage.
Repository Structure & Separation of Concerns
With the workflow in mind, I setup the following repository structure:
With this structure, you would run the terraform
commands from inside the applications directory:
The terraform.tfstate
file will get stored in the applications directory - which means each application will have its own state file.
The reason for this is “separation of concerns” or reducing your “blast radius”. If you have 100 apps and someone makes a mistake, you don’t want Terraform to nuke the rest of the 100 apps and screw up their configuration or state.
Defining Applications and Teams
Now we have our repository structure, let’s zoom into a specific application, for example mssql
terraform.tfvars
The terraform.tfvars
is the standard file name for Terraform variables. We will want to use these variables all over the rest of our configuration
vars.tf
The vars.tf
is the standard file name for Terraform input variable deceleration. This is where we define what variables are allowed to be passed into our main.tf
, creating the resources.
When you run Terraform, it will automatically find the terraform.tfvars
file and use all the variables it knows about.
Terraform will then prompt you to input variables that it isn’t aware of. You can also set Terraform variables using environment variables or pass them in at the command line. More details on variables in Terraform can be found here.
main.tf
The main.tf
is where the actual Terraform resources go.
This file will contain:
- The
provider
for DataDog, we need to passapi_key
andapp_key
- The
datadog_monitor
resource which will create our actual monitors
You can find the DataDog Terraform Provider documentation here.
Here is the full file:
The few main concepts for the main.tf
file:
-
We will pass in
datadog_api_key
anddatadog_app_key
via the Terraform command line so we don’t have these in our git repository -
The application owner and application name are being pulled from the variables provided in
terraform.tfvars
. This means if the team that owns the application changes, we can simply update it once insideterraform.tfvars
and it updates across all of our checks. -
As we decided
notify
would be a list (an array), we are using one of the Terraform built in interpolation functions tojoin
. This joins each item in the list with a space and puts it inside the message so DataDog can notify multiple destinations. -
We give each
datadog_monitor
resource a unique name (eg.common_free_disk
anddatawarehouse_free_disk
). This is how Terraform can keep track of the resource and allow us to change the DataDog monitorname
etc.
Running Terraform
Now we have our files setup, we can run Terraform.
Terraform will now tell you what actions will be taken against DataDog.
If you are happy with what it is going to do:
With that, you should now have your monitors created in DataDog.
Setting Environment Variables for Terraform
If you don’t want to have to pass in the datadog
variables in each time, you can set the following environment variables:
Types of DataDog Monitors
DataDog provides many types of possible monitors you can create including host
, metric
, process
etc.
Creating monitors for all of them via Terraform requires knowing the query behind the monitor. These queries match up with the DataDog Monitor API.
Here are a few examples:
- Type:
metric alert
- Query:
avg(last_1h):system.disk.in_use{role:mssql-common} by {device,host} > 0.75
- Screenshot
- Query:
- Type:
service check
- Query:
'process.up'.over('role:sensu_server','process:redis-server').by('host','process').last(2).count_by_status()
- Screenshot
- Query:
- Type:
query alert
- Query:
avg(last_2h):anomalies(sum:order.count{environment:production}.as_rate(),'adaptive', 2, direction='below') >= 0.5
- Screenshot
- Query:
- Type:
service check
- Query:
'http.can_connect'.over('environment:production','url:http://www.google.com').last(4).count_by_status()
- Thresholds:
thresholds { critical = 3 }
- Query:
Wrapping Up
Terraform is an awesome way to automate your infrastructure and services out of code. Using Terraform to provision DataDog makes it easy to standardize, re-use and update your monitors quickly and easily.
The most important part of using Terraform is the upfront planning. This entails splitting resources into logical groups so the blast radius is small if something does explode.
I created a datadog-terraform-example repository with the code from this blog to get you started.
Good luck automating your DataDog!