Alarms

How to keep your NOC happy(er) and out of Jail.

Intro

Your sysadmins or NOC workers are busy. They have to keep track of a lot of different services and hosts. They are desperately understaffed (usually by design) for the work load.

They do not have time to look at the graphs of the metrics you are pulling from your applications and hosts. At least not until there is an outage. So they don't want metrics or dashboards to be their interface to the system. They don't want to be bothered unless something is broke. So they want alarms.

General Alarm Advice

Alarms are like criticism. They need to be:

    • Specific

    • Actionable

Actionable

If an alarm goes off, something needs to be fixed. If something doesn't need to be fixed, no alarm should go off.

This sounds easy enough, but the problem is whinny developers. "The Program shouldn't core. So if it does, alarm." I understand your viewpoint, but if there's nothing to fix ("did it just core once? Ignore it") then DON'T ALARM.

Remember, the person who gets the alarm knows nothing about your app. You've been sweating it night and day for the last 3 years. He (or she) had to check the Wiki to find out what it did when the alarm came in.

Unless Revenue is affected, it doesn't matter.

So does this mean you need to turn off all those alarmable conditions you came up with? No. It means you need to separate them out into two conditions:

    • It's Broken! The Sky is Falling! The Sky Is Falling!!

    • Interresting. If these keeps happening, we probably ought to tell someone

The first condition, you alarm. Obviously. The second, you turn into a metric, and apply an alarm threshhold to your metric. The traditional threshhold is X-in-Y (5 events in 10 minutes). In fact, that one is common enough that your alarm display system probably has it built in. It kinda sucks, but has the advantage of being easy to understand. For more discussion, see "Metric Derived Alarms" below.

Specific

The alarm needs to have enough context so that the Sysadmin can figure out what to fix.

If the alarm is about an application on a host, mention which host. If it's a failure of disk, mention which disk. Don't say "the network is down". If the Admin can check google and verify the network is in fact up, he'll ignore the alarm that was trying to say that half of your datacenter went out.

Reasonable in Number

A given NOC worker is only going to be able to work on about 5 alarms at a time. So if there are more than 5 alarms on the screen, some of them are going to be ignored. This will make you angry, as your NOC is apparently not doing their job. This will then make them angry, and they will pull out nice statistics showing the obscene number of alarms they get a day.

Event Triggered Alarms

Event-triggered alarms are the ones that come from log scraping. The kernel complained about a Harddrive. A Ping timed out. A process crashed, etc.

Some of these events will be immediately alarmable. Most won't. Either way, you'll want to publish a metric of "Condition/minute". That way, if you alarm on one, you have a history of your alarms. If you threshhold, you'll see trends.

Log file scraping

A common way of detecting edge triggered alarms is log scrapping. It works, but it sucks. If the log scrapper gets run on an interval, you need to keep log files short for better performance, and you need to coordinate log file rolling. Or you can have a persistent daemon that "tail"s the log file, but then you still need to coordiate log file rolling, and you have an extra process on the box.

The prefered mechanism is to have a dedicated syslog box. Configure everything to log via syslog (or your in-house developed sucks-less replacement format), and do the alarming/scraping on the syslog box. This allows you to keep the logs as required by SOX, and have one place to do the monitoring and counting. On the down side, you add another failure point into the alarming.

Active Monitoring

At some point, you need to do this. You have a system, it has a function. Test that function. On an automated basis. You'll want to do this yourself, because outside people won't get you alerts that it's broken fast enough. (Having an outsider do it would be prefered, if they can be responsive enough, as they are going to be more representative of your customers, by which I mean reliably outside your firewall.)

Also, don't alarm on a single failure. Don't alarm on a failure rate. When you get a huge number of requests a second, or a sufficiently complex system, you will have the occassional failure, and it won't mean a real problem. And if you do have a system that can occassinally fail (but work again next request), you won't want a single failure at 3am (100% outage) to cause an alarm eiither.

Typically, you want to alarm on a scaled combination of rate an count. The higher the traffic, the lower the threshhold. For example, you might not alarm at all unless there are 10 hits, alarm at 4% if there are between 10 and 20 hits, and at 3% if there are more than 20 hits.

Make sure that your automated test tool doesn't just report success or failure. You also need it to report metrics, like latency, success rate, etc. You'll want to alarm if any of these are signficantly outside of bounds.

Metric Derived Alarms

You've collected some metric, and want to alarm if it's "weird". There are a couple of ways you can do this, none of which are what you want. But what you want (alarm if it's weird) is an AI project, so you're about as likely to get it as a pony.

The simpliest solution (and thus the most supported) is to alarm if your metric crosses some arbitrary threshhold. Only slighly more complex is to alarm is some formula crosses some threshhold. If your program cores once, you don't care. If it cores constantly (> 30/min) you do. Simple theshholds like this are best for metrics that should not follow a user curve.

For metrics that do follow a user curve, you'll want to alarm if they change rapidly. True Strength Index is used in financial circles, but is handy for pointing out discontinuities in a graph. It will frequently spike when something breaks, and spike again when it's fixed. So while TSID is quite handy for picking out when something broke, it's less useful as an alarm condition because it doesn't work as a level-triggered alarm.

There are also other varlents of running averages, or curve drops you can use. They also don't work as level-triggered alarms, and so make poor alarms as you can't detect when the problem is over.

An alternative is to compare the current numbers against averages of historical numbers. This is about as close to a "it looks funny" result as you are going to get. The problem with it is that this approach typically false positives a lot. The superbown on? Not a normal sunday. Daylight savings time kick in? Not a normal week. You'll get a lot of alarms for things that do look weird, but have an explanation outside of your system.

So what's left? The ugly work of comparing metrics, and assigning custom, per metric threshholds or threshhold equations. Yes, it's tedious. Yes, it's annoying. There's nothing better, so get started.

Alarm systems

Edge Triggered vs. Level Triggered

An alarm is edge triggered if you can detect the edges of the alarmable condition, but can not tell at any time if you are in the alarmable condition. It is level triggered if you can easily detect if you are still in the alarmable condition, but may not have any way of telling exactly when it started.

SNMP Traps are all edge triggered. If you lose the trap, you lost the alarm. If you are using the SMART system to send out email/alarms on a lost hard drive, you are edge triggered. If you cat /proc/mdstat looking for degraded raid sets, you are level triggered.

Edge Triggered is the assumed state of the art. Netcool, and other software, is usually set up to assume edge triggered.

Level triggered is better, if you have the option. Edge triggered starts faster, but is harder to debug, and it's easier to lose the initial event.

If your app developer insists on creating edge triggered alarms ("Alarm if you see this line in the log file"), get them to publish the alarm condition as a metric (app.bad_harddrive=1). Then you can turn it into a more reliable edge-triggered condition.

Reducing alarm count

At some point, you'll find your NOC is ignoring some of your alarms. You ask why, and they tell you it's because they get unbelievable number of alarms per day, and obviously can't treat them all. This is going to be partly true, and partly not-so-much.

The true part is that any major outage will generate a huge number of alarms. You have a bad switch: Well, the hosts are reported as down by the host monitoring software. The VIPs in front of the web servers reports them down. The router reports a problem with the switch. Now you have 40 alarms for one fault. The consulants will sell you correlation engines. Don't bother, they don't work.

The false part is that some of your alarms will be level triggered (hopefully most), and the NOC will report as a metric the total number of messsages the alarm software receives: not the number of distinct alarmed conditions. Make sure your alarming software can report on the number of distinct alarms, not just the number of messages. You'll need it as a valid metric (and the NOC will generally want it as a valid metric as well.)

So: How do you really reduce alarm counts? The first step is to make sure priorities are set correctly. A host down isn't a highest-priority event, no matter what the host. (If you disagree, you need to figure out your disaster recovery plans). A network outage is likely more important. Root-causes to large outages need to have higher priorities. Then your "correlation engine" is just doing a priority sort of alarms, and getting the NOC to address alarms in priority order.

The second way is to set your threshholds correctly. This will be more effective than you'd expect, and can drop the number of stupid alarms by a factor of 10 in a medium-size operation.

Then, if you still have insanity. Then... consider some correlation engine. But it won't work. Don't say I didn't warn you.

Software

I'm somewhat familiar with Netcool, but not much else. From my experience with netcool, I have to suggest writting a custom app instead. Alarms have a pretty simple DB schema, outside of any integration with your CMDB. You have a alarm code that links to the recovery procedure. You have a alarm node, the thing that's broken. You have some way of going from the node+code to the oncall victim. You have some historical tracking of the distinct alarm(s). (Publish metrics, aggregated by node, and separately aggregated by alarm code). This is a 2h Ruby-on-Rails project. Less than that if you have someone who knows what they are doing. Add in a perl process to catch alarm messages (YAML/JSON), and call it a day.