Understanding performance of your infrastructure is extremely important, especially when running production systems. There is nothing worse than a customer calling and saying they are experiencing slowness with one of their applications and you having no idea where to start looking.
In the 2014 State of DevOps survey survey, one of the questions asked was how is your organization notified of failure?.
Here was the multiple choice question asked:
Through the survey, one of the top practices that correlated with performant IT teams was Monitor system and application health:
Logging and monitoring systems make it easy to detect failures and identify the events that contributed to them. Proactive monitoring of system health based on threshold and rate-of-change warnings enables us to preemptively detect and mitigate problems.
If you want some more information about performant DevOps teams and the methods they used to test teams, I recommend the talk What We Learned from Three Years of Sciencing the Crap Out of DevOps.
Monitoring performance counters on Windows in any centralized manager way has always been tricky. In 2014 I wrote a PowerShell Module to send performance counters to Graphite which turned out to be pretty popular called Graphite-PowerShell-Functions.
Thankfully, things are getting easier. Let’s take a look at using InfluxDB to store our metrics, Telegraf to tramsit the metrics and Grafana do display them.
By the end of the article, you should be able to make a dashboard that looks something like this:
📢 Want to be notified when I post more like this? Follow me on Twitter: @MattHodge 📢
Requirements
You will need a Linux machine which will host the InfluxDB and Grafana installations. I will be using Ubuntu 14.04 x64 for this.
Preparing the Ubuntu Machine
There is nothing special that needs to be performed on the Ubuntu server before installing InfluxDB or Grafana. Just make sure all the packages are up to date:
UTC Time
The other thing I would recommend is setting the time zone of the Ubuntu server to UTC. It is a good idea to standardize on UTC as the time zone for all your metrics. InfluxDB uses UTC so stick to it.
(You can read about some of the struggles when you don’t use UTC here).
Install InfluxDB
InfluxDB is an open source distributed time series database with no external dependencies. It’s useful for recording metrics, events, and performing analytics. I recommend having a read of the key concepts of InfluxDB over at their documentation page.
Let’s download and install the InfluxDB .deb
InfluxDB listens on 2 main ports:
- TCP port
8083
is used for InfluxDB’s Admin panel - TCP port
8086
is used for client-server communication over InfluxDB’s HTTP API
Once installed, go to http://Your-Linux-Server-IP:8083
in the browser and confirm you can access the InfluxDB admin panel:
Install Grafana
Grafana is a beautiful open source, metrics dashboard and graph editor. It can read data from multiple sources, for example Graphite, Elasticsearch, OpenTSDB, as well as InfluxDB. Take a look at the Grafana live demo site to see what it can do.
First we will download and install the Grafana .deb
. You can find the latest version over at http://grafana.org/download/.
Grafana’s web interface listens on TCP port 3000
by default.
Go to http://Your-Linux-Server-IP:3000
in the browser and confirm you can access the InfluxDB admin panel:
Telegraf
Telegraf is an agent written in Go for collecting metrics from the system it’s running on, or from other services, and writing them into InfluxDB or other outputs.
We will be using the win_perf_counters
plugin for telegraf to collect Windows performance counters and send them over to InfluxDB. More information on the plugin can be found at the telegraf GitHub page.
Install the Telegraf Client
As the Windows agent is still in an experimental phase, head over to its GitHub page at https://github.com/influxdata/telegraf to grab the latest version.
At the time of writing the latest version could be found at http://get.influxdb.org/telegraf/telegraf-0.11.1-1_windows_amd64.zip.
Extract the zip file into a directory, I used C:\telegraf
.
Inside you will see 2 files:
telegraf.exe
- this is the application. It is written in Go which compiles nicely into a single.exe
filetelegraf.conf
- all the configuration options for telegraf
Configure Telegraf
Basic Configuration
Open the telegraf.conf
file in a text editor - I would recommend one which supports TOML syntax highlighting such as Atom.
The Windows version of telegraf has a configuration file setup to collect some common Windows performance counters by default, so we do not need to change very much for it to work.
The first thing we will change is the collection interval. This is how often the performance counters will be read. I am setting mine to 5 seconds. This configuration option is under the [agent]
section:
Next, under the [[outputs.influxdb]]
section, we need to update the urls
option to point to our InfluxDB server at http://Your-Linux-Server-IP:8086
.
Deciding What To Capture
As this is a Hyper-V server, I wanted to collect some Hyper-V specific metrics. I found two articles, a post by Ben Armstrong about Dynamic Memory Performance Counters with Hyper-V and Measuring Performance on Hyper-V on MSDN.
These were the parts that stuck out from the articles:
Use the following rule of thumb when measuring disk latency on the Hyper-V host operating system using the “\Logical Disk()\Avg. Disk sec/Read “or “\Logical Disk()\Avg. Disk sec/Write” performance monitor counters:
1ms to 15ms = Healthy
15ms to 25ms = Warning or Monitor
26ms or greater = Critical, performance will be adversely affected
and
My favorite performance counter is the “Average Pressure” counter under the “Hyper-V Dynamic Memory Balancer” category. This gives you a very simple view of the overall memory allocation of your system
As long as this number is under 100, you know that there is enough memory is your system to service your virtual machines. Ideally this value should be at 80 or lower. The closer this gets to 100, the closer you are to running out of memory. Once this number goes over 100 then you can pretty much guarantee that you have virtual machines that are paging in the guest operating system.
Depending on the type of server you are trying to monitor, you will want to do the same and research a few important performance counters you should be keeping an eye on.
Adding Additional Counters
We have worked out exactly what needs to be monitored, lets add them to the configuration file.
First we will add \Logical Disk(*)\Avg. sec/Read
and \Logical Disk(*)\Avg. sec/Write
.
The configuration file already includes LogicalDisk
monitoring, so we just need to add Avg. sec/Write
and Avg. sec/Read
into the Counters
array for LogicalDisk
in the section in the file.
After doing this, the configuration for the LogicalDisk
counters looks like this:
Next, we want to add the Hyper-V Dynamic Memory Balancer
counter. I wasn’t sure if its full path, so I used PowerShell to find it:
From here I found the full counter path was \Hyper-V Dynamic Memory Balancer(System Balancer)\Average Pressure
(JSON adds the double slashes). This was added to the configuration file:
Save the telegraf.conf
file.
To run telegraf, open and then we will start telegraf with the following command:
If all went well you should see telegraf starting to collect your metrics and send them over to InfluxDB.
Troubleshooting
If you get an error saying 2016/03/28 19:48:01 toml: line 1: parse error
this is because you used standard old notepad and its line-endings broke things. Use a real text editor!
Installing Telegraf as a service
If you are happy with how Telegraf is functioning, you can install it a service so it starts itself when the system reboots. Follow the instructions here.
Viewing the Data in Grafana
Now you have some metrics being sent into InfluxDB, you can use Grafana to view them.
Open up http://Your-Linux-Server-IP:3000
and login using the default credentials:
- Username:
admin
- Password:
admin
Configure a Data Source
Grafana needs to have a data source added so it knows where to look for the metrics.
Click on Data Sources
on the left and then Add new
at the top.
Choose the type InfluxDB 0.9.x
for the data source and enter the URL for InfluxDB. Keep in mind that Grafana is running on the same box as InfluxDB, so you can just use http://localhost:8086
.
Keep access as proxy
.
The default database for the telegraf agent is telegraf
. The Grafana form will not let you save unless you enter a User and Password, so just enter in something random as we have not configured any InfluxDB credentials.
Create a Dashboard
To display our data, we will need to create a dashboard. Select Home
from the top menu and click New
.
Add a Graph
In the new dashboard page you will see a little green rectangle over on the left, click it and choose Add Panel
> Graph
.
Click on the Metrics
tab, and down on the bottom right of the page is the data source dropdown. Choose the data source we added, called InfluxDB
.
In the data selection section, choose From win_cpu
and match the rest of the fields up to the image below to get a graph of the CPU usage.
You can read more about querying data from InfluxDB in Grafana in the Grafana docs.
Next, click on the General
tab and enter a name for the graph.
Head over to the Axes & Grid
tab. There are a ton of options here. As this is a graph to show CPU usage of one or more Hyper-V servers, I chose to structure and enter a name for the graph.
- As we are looking at the
% Processor Time
performance counter, set theLeft Y Unit
to bepercent (0-100)
. - Set some
threshold
levels - these just give a nice visual representation of when you should be worried about a the graph entering the danger zone. - You can also display additional values under the graph next to your metrics, in this example I enabled
Min
,Max
andAvg
.
Save the Dashboard
Click Back to dashboard
and then up the top of the page, choose the Cog icon > Settings
.
Give the dashboard a name and save it - I choose Hyper-V Dashboard
and entered the hyper-v
tag.
Create a Table
I added a Table
panel to track disk latency on the Hyper-V server:
The query that I used for this was as follows:
You will notice I used a math function and multiplied the performance counter by 1000
. As this performance counter records in seconds with millisecond precision, I had to multiply by 1000
to get a millisecond value for the counter.
From there I went to the Options
tab and set the Unit
value to milliseconds (ms)
and set the thresholds that were recommended by Microsoft.
Create a Single Value Display
Finally I added a Single Value
panel to track Hyper-V memory pressure.
The query that I used for this was as follows:
I then went to the Options
tab and set the Postfix
of the metric to be avg pressure
. I also enabled Background
coloring and set the Thresholds
as recommended by Ben’s blog post.
Wrapping Up
InfluxDB and Telegraf provide an excellent and simple way to ship Windows performance counters off the server, and Grafana lets us display these metrics in beautiful dashboards.
Hopefully this starts you on your journey to graphing performance data for your systems.
Keep an eye out for another post shortly which will discuss some more advanced usage including using annotations on the graphs so you can correlate events in your infrastructure to system performance.