Understanding performance of your infrastructure is extremely important, especially when running production systems. There is nothing worse than a customer calling and saying they are experiencing slowness with one of their applications and you having no idea where to start looking.
In the 2014 State of DevOps survey survey, one of the questions asked was how is your organization notified of failure?.
Here was the multiple choice question asked:
Through the survey, one of the top practices that correlated with performant IT teams was Monitor system and application health:
Logging and monitoring systems make it easy to detect failures and identify the events that contributed to them. Proactive monitoring of system health based on threshold and rate-of-change warnings enables us to preemptively detect and mitigate problems.
If you want some more information about performant DevOps teams and the methods they used to test teams, I recommend the talk What We Learned from Three Years of Sciencing the Crap Out of DevOps.
Monitoring performance counters on Windows in any centralized manager way has always been tricky. In 2014 I wrote a PowerShell Module to send performance counters to Graphite which turned out to be pretty popular called Graphite-PowerShell-Functions.
By the end of the article, you should be able to make a dashboard that looks something like this:
- Install InfluxDB
- Install Grafana
- Viewing the Data in Grafana
- Wrapping Up
You will need a Linux machine which will host the InfluxDB and Grafana installations. I will be using Ubuntu 14.04 x64 for this.
Preparing the Ubuntu Machine
There is nothing special that needs to be performed on the Ubuntu server before installing InfluxDB or Grafana. Just make sure all the packages are up to date:
The other thing I would recommend is setting the time zone of the Ubuntu server to UTC. It is a good idea to standardize on UTC as the time zone for all your metrics. InfluxDB uses UTC so stick to it.
(You can read about some of the struggles when you don’t use UTC here).
InfluxDB is an open source distributed time series database with no external dependencies. It’s useful for recording metrics, events, and performing analytics. I recommend having a read of the key concepts of InfluxDB over at their documentation page.
Let’s download and install the InfluxDB
InfluxDB listens on 2 main ports:
- TCP port
8083is used for InfluxDB’s Admin panel
- TCP port
8086is used for client-server communication over InfluxDB’s HTTP API
Once installed, go to
http://Your-Linux-Server-IP:8083 in the browser and confirm you can access the InfluxDB admin panel:
Grafana is a beautiful open source, metrics dashboard and graph editor. It can read data from multiple sources, for example Graphite, Elasticsearch, OpenTSDB, as well as InfluxDB. Take a look at the Grafana live demo site to see what it can do.
First we will download and install the Grafana
.deb. You can find the latest version over at http://grafana.org/download/.
Grafana’s web interface listens on TCP port
3000 by default.
http://Your-Linux-Server-IP:3000 in the browser and confirm you can access the InfluxDB admin panel:
Telegraf is an agent written in Go for collecting metrics from the system it’s running on, or from other services, and writing them into InfluxDB or other outputs.
We will be using the
win_perf_counters plugin for telegraf to collect Windows performance counters and send them over to InfluxDB. More information on the plugin can be found at the telegraf GitHub page.
Install the Telegraf Client
As the Windows agent is still in an experimental phase, head over to its GitHub page at https://github.com/influxdata/telegraf to grab the latest version.
At the time of writing the latest version could be found at http://get.influxdb.org/telegraf/telegraf-0.11.1-1_windows_amd64.zip.
Extract the zip file into a directory, I used
Inside you will see 2 files:
telegraf.exe- this is the application. It is written in Go which compiles nicely into a single
telegraf.conf- all the configuration options for telegraf
The Windows version of telegraf has a configuration file setup to collect some common Windows performance counters by default, so we do not need to change very much for it to work.
The first thing we will change is the collection interval. This is how often the performance counters will be read. I am setting mine to 5 seconds. This configuration option is under the
Next, under the
[[outputs.influxdb]] section, we need to update the
urls option to point to our InfluxDB server at
Deciding What To Capture
As this is a Hyper-V server, I wanted to collect some Hyper-V specific metrics. I found two articles, a post by Ben Armstrong about Dynamic Memory Performance Counters with Hyper-V and Measuring Performance on Hyper-V on MSDN.
These were the parts that stuck out from the articles:
Use the following rule of thumb when measuring disk latency on the Hyper-V host operating system using the “\Logical Disk()\Avg. Disk sec/Read “or “\Logical Disk()\Avg. Disk sec/Write” performance monitor counters:
1ms to 15ms = Healthy
15ms to 25ms = Warning or Monitor
26ms or greater = Critical, performance will be adversely affected
My favorite performance counter is the “Average Pressure” counter under the “Hyper-V Dynamic Memory Balancer” category. This gives you a very simple view of the overall memory allocation of your system
As long as this number is under 100, you know that there is enough memory is your system to service your virtual machines. Ideally this value should be at 80 or lower. The closer this gets to 100, the closer you are to running out of memory. Once this number goes over 100 then you can pretty much guarantee that you have virtual machines that are paging in the guest operating system.
Depending on the type of server you are trying to monitor, you will want to do the same and research a few important performance counters you should be keeping an eye on.
Adding Additional Counters
We have worked out exactly what needs to be monitored, lets add them to the configuration file.
First we will add
\Logical Disk(*)\Avg. sec/Read and
\Logical Disk(*)\Avg. sec/Write.
The configuration file already includes
LogicalDisk monitoring, so we just need to add
Avg. sec/Write and
Avg. sec/Read into the
Counters array for
LogicalDisk in the section in the file.
After doing this, the configuration for the
LogicalDisk counters looks like this:
Next, we want to add the
Hyper-V Dynamic Memory Balancer counter. I wasn’t sure if its full path, so I used PowerShell to find it:
From here I found the full counter path was
\Hyper-V Dynamic Memory Balancer(System Balancer)\Average Pressure (JSON adds the double slashes). This was added to the configuration file:
To run telegraf, open and then we will start telegraf with the following command:
If all went well you should see telegraf starting to collect your metrics and send them over to InfluxDB.
If you get an error saying
2016/03/28 19:48:01 toml: line 1: parse error this is because you used standard old notepad and its line-endings broke things. Use a real text editor!
Installing Telegraf as a service
If you are happy with how Telegraf is functioning, you can install it a service so it starts itself when the system reboots. Follow the instructions here.
Viewing the Data in Grafana
Now you have some metrics being sent into InfluxDB, you can use Grafana to view them.
http://Your-Linux-Server-IP:3000 and login using the default credentials:
Configure a Data Source
Grafana needs to have a data source added so it knows where to look for the metrics.
Data Sources on the left and then
Add new at the top.
Choose the type
InfluxDB 0.9.x for the data source and enter the URL for InfluxDB. Keep in mind that Grafana is running on the same box as InfluxDB, so you can just use
Keep access as
The default database for the telegraf agent is
telegraf. The Grafana form will not let you save unless you enter a User and Password, so just enter in something random as we have not configured any InfluxDB credentials.
Create a Dashboard
To display our data, we will need to create a dashboard. Select
Home from the top menu and click
Add a Graph
In the new dashboard page you will see a little green rectangle over on the left, click it and choose
Add Panel >
Click on the
Metrics tab, and down on the bottom right of the page is the data source dropdown. Choose the data source we added, called
In the data selection section, choose From
win_cpu and match the rest of the fields up to the image below to get a graph of the CPU usage.
You can read more about querying data from InfluxDB in Grafana in the Grafana docs.
Next, click on the
General tab and enter a name for the graph.
Head over to the
Axes & Grid tab. There are a ton of options here. As this is a graph to show CPU usage of one or more Hyper-V servers, I chose to structure and enter a name for the graph.
- As we are looking at the
% Processor Timeperformance counter, set the
Left Y Unitto be
- Set some
thresholdlevels - these just give a nice visual representation of when you should be worried about a the graph entering the danger zone.
- You can also display additional values under the graph next to your metrics, in this example I enabled
Save the Dashboard
Back to dashboard and then up the top of the page, choose the Cog icon >
Give the dashboard a name and save it - I choose
Hyper-V Dashboard and entered the
Create a Table
I added a
Table panel to track disk latency on the Hyper-V server:
The query that I used for this was as follows:
You will notice I used a math function and multiplied the performance counter by
1000. As this performance counter records in seconds with millisecond precision, I had to multiply by
1000 to get a millisecond value for the counter.
From there I went to the
Options tab and set the
Unit value to
milliseconds (ms) and set the thresholds that were recommended by Microsoft.
Create a Single Value Display
Finally I added a
Single Value panel to track Hyper-V memory pressure.
The query that I used for this was as follows:
I then went to the
Options tab and set the
Postfix of the metric to be
avg pressure. I also enabled
Background coloring and set the
Thresholds as recommended by Ben’s blog post.
InfluxDB and Telegraf provide an excellent and simple way to ship Windows performance counters off the server, and Grafana lets us display these metrics in beautiful dashboards.
Hopefully this starts you on your journey to graphing performance data for your systems.
Keep an eye out for another post shortly which will discuss some more advanced usage including using annotations on the graphs so you can correlate events in your infrastructure to system performance.