Metric Storage

How to keep your metric data for forever. Cheap.

Rationale

As discussed elsewhere, there is great value in collecting various metrics about the systems you have in production. And as discussed elsewhere, if you CAN keep them for years, you'll quickly discover why you want to.

Plus there's that whole SOX compliance bugbear.

So, here I'm going to talk about HOW to keep huge numbers (5billion/day) of data points for forever.

Solutions

Relational Database

Whatever you're doing, Don't Do This. It won't work. Not with the schema you're thinking of.

The basic problem is that you keep too much data. For each data point, you'll have a row. And that row will identify that system that the data point refers to, the time it was collected, and the actual value. So, at a minimum you've multiplied the size of your data by 3. And you'll want an insert rate that will cause your DB sales rep to salivate. So just don't go down that path.

RRD

RRD is a much saner solution. Rather than keep 3 times the data, you keep series. So the "what is being measured" is kept once per series, not once per data point. The "when was it measured" is infered from the series structure, not once per data point. So now we are basically back to keeping just the data. This means you can keep a LOT more data. Definitely the right direction.

Now you just run into the fact that RRD is trying to be your all-in-one metrics collection, storage, and viewing platform. As it's an all-in-one solution, it is hard to split out the components to enable them to scale separately. I'm personally not a fan of it, but there are a lot of people who use it, and use it successfully. Just be aware that it may stop scaling at some point, and that initial ramp up takes a bit longer than you'd expect.

Columnar Database

The basic idea of a columnar database is that it's a normal relational database that, oversimplifying things significantly, stores data in parallel arrays, rather than arrays of structures.

The real effect is that it is a lot faster for certain types of data. And period time-series data fits. I havn't worked with any enough to actually tell you how to spec out a solution, but I've heard good things. If you have the cash.

OpenTSDB

OpenTSDB is what you want. HBase isn't my favorite DB to base things on, because java clusters can be a pain to setup, but it's got all the features you want, and is architected how you want, and is easier than doing it yourself. Also I don't think it took off.

Custom Tool

So you're looking for an excuse to reinvent this particular wheel?

Metric data is time-series data. That is, there is an implicid periocity in your data. Every minute you have a new value, you won't have more than one value in a minute, and you don't have irregular pauses between data points.

So take advantage of that regularity. Don't store each point with a time stamp, store a single time stamp and an array of data points. Too allow yourself to store an unlimited amount of data, you break the series up into blocks. Each block is part of a series, has the timestamp for the first data point, and has a simple array of data.

One could implement this as a custom application using an embedded DB implementation to store the pages, or you could storing the pages in a relational database for sanity. The keys here are

- Store the timestamp once per page, not per data point
- Do batched write operations. Do not do one write per new data point. If you have a large system, you'll might be accepting millions of data points each second (and need to have headroom). You really don't want to write a page out each time, even if you are spreading the load across a cluster.
- Don't store all datapoints in a single underlying datastore. For example, if you store the blocks in a SQL database, don't put them all in the same database ... have a database per application, or some other arbitrary sharding method. Figure out how to route data points to the instance that has that series.
- Do figure out how to back them up. SoX sucks.
- Compression: eh, maybe not the most important thing, but at scale will become so. You can save a lot of space if you remember/realize that the most common data points are either "0" or "+1s/s" counters that snuck in by accident, and optimize your compression algo accordingly. For example, you could keep a "it's zero" bitmask in the header, and maybe a "it's monotonically increasing" bitmask.

Aggregation/Grouping

You have a limited number of applications you monitor, but a huge number of application instances. And a gargantuan number of measurements. So you want to start looking at your metrics from an application perspective, and drill down.

However, you collect metrics from the bottom. So you need to boil them up/aggregate them. I smell Middleware! I also smell integration with your CMDB.

Lets say you have a cluster of web servers. You're collecting hits/minute from them, and adding it all together for a cluster-wide hits/min measurement. You'll be adding hosts to and from the cluster fairly often, so obviously manually maintaining the equation to calculate the aggregate isn't going to happen. So it needs to be automated, so it needs to be integrated in with your configuration management software.

This is where some sales guy will mention discovery. Don't buy it: discover is wrong. But I'll talk about that later. For now, accept that they need to be integrated somehow.

So the easiest solution is to query your CMDB, and use rules to create the rules to add up the metric data points. And yes, that is the easiest solution. The next easiest is arbitrary tags used for aggregation ("everything published to the bob.widgets metric gets aggregated by sum/avg/p50/p99 automatically") but that leaves you with some annoying data-cruft lifecycle issues.

Exploration

So you're now keeping your metrics. How do you find them? If you know exactly what you want, it will be easy enough, but exploring the possible dataset can be trickier. I suggest you keep the datastore of the data you collect completely separate from the datastore that actually has the data.

First question: Why are you looking at the data again?

- Something is wrong, and you want to figure out what
- You want to bill people/pass an audit.

In the first case, you'll want to start off at the application layer, and visually search your way to the problem. In the second case you want to write a program.

So you'll have two patterns of data search:

- Let me browse the data that's currently collected (you)
- Let me browse all data ever (a program)

So rather than store information in the datastore about which series are active, keep that information in your CMDB. And hook your visualization tool up to your CMDB, so that you just start looking at the active stuff, without the metrics datastore knowing what the difference is.