Metrics

You're Blind, And You Don't Know it

Rationale

When do you most need to know some weird esoteric metric about your process, for a weeks time? When it's down, and that metric is the key to fixing it. Since it's tricky to go back in time to start collecting that metric, you'll need to be collecting it always. And storing it for a while. Don't worry, it's cheap.

First though, we want to talk about metrics, and figuring out which metrics you need.

Types of Metrics

Alarming Metrics

These are the most obvious ones. Disk-space (If it's 100%, odds are that's a problem). Any metric you can get out of your host or application that indicates a fault, you want.

First off, this dovetails into alarming. So you'll want to generate an alert from these metrics. However, you also want to keep them historically. How long has disk space been over 95? Was it a sudden spike, or a long term trend? Do you need to find an abuser, and tell them to knock it off? Or do you need to scale up your system?

You want to collect alarming metrics at a high frequency (1 minute, or more frequently), and keep them a moderate amount of time. (3 months)

Billing Metrics

These are the metrics that most directly relate to your load. Requests handled, bandwidth used, and obviously any metrics you actually DO bill on. You'll want to set up trending, and alarming on abnormalities on these metrics. If your usage drops, that impacts your bottom line, so you want to know. But it may not be a direct failure, it could just be that no one web surfs during the superbowl. But you usually won't set up any absolute threshold alarms on these metrics.

Quality of Service metrics also fall in this category. How long do requests take to finish? What percentage of requests finish? What percentage of requests fail? (You may think requests failing is an alarming metric. Wait until you get to scale, and 0.001% of requests failing means that one fails every 30s, and see if you still think that)

You want to collect billing metrics and a medium frequency (15 min or 1h), and keep them for forever. You might hear the phrase "SOX Compliance" if you don't keep them long enough.

Debugging Metrics

These are the weird esoteric metrics. If you have access to the application author, get his/her help in finding the debugging metrics. Normal metrics for this category are queue lengths, congestion counts, work item counters, CPU usage, interrupt counts on a host.

You will never alarm on them, but you'll find yourself looking through them after/during an outage trying to figure out what's happening.

You want to collect debugging metrics at a high frequency (1-5 minutes) and keep them for a fairly short time. Really, no one is going to care the interrupt rate of a box more than a month back. (Now, if it's free to keep metrics, as I promised above, you might as well keep them. Someone might care.)

Stupid Unit Tricks

Let's talk about a sample metric: Web Hits.

Your webserver tells you total number of web hits since application start. This is a billing metric, in the taxonomy above, and is certainly one you'll want to keep. So just fetch that number every minute and store it, right?

Wrong.

You'll get a nice upwardly moving line that wraps every so often. You won't get the number you want, which is hits over the period.

This will happen a lot. The best format for an application to expose usage is a pure counter. It's safe against race conditions from clients, multiple clients querying. It's real time. It's accurate. If you're writing the application, keep exporting pure counters.

But the number you want to store in your metrics system is the rate. Make sure your metrics collection software does the conversion for you. And make sure it handled rollover/wrapping while you are at it.

Software Considerations

Scaling

Middleware

Metric Count

Monitored Applications

Software

Nagios

Big Brother

Cassandra

Copyright: Douglas Kilpatrick, 2011