I Hate Me Some SNMP
All the Efficiency of Assembler with all the ease of use of ... ... Assembler.
SNMP is a protocol commonly used for polling devices via the network. Most network devices can only really be polled via SNMP, but many other types of devices provide SNMP interfaces.
So if you are creating a monitoring and management infrastructure, you have to deal with SNMP. Since you have to deal with SNMP, you may be tempted to tell your developers to make everything new monitorable via SNMP, so you just have one framework to work with. Please do not do this.
SNMP supports 3 main message types: GET, SET, and TRAP. TRAP is an alarm: "X is broke", sent when the thing in question breaks and never resent. Since SNMP is usually done over UDP, I hope you got it the first time. GET is used to get values from the monitored device, and SET is used to configure/control the device.
SNMP security isn't all it could be, by reason of uptake or argument over the security model, or just plain lack of encryption of the password ("Community String"). So don't use SET.
When you do a GET via an SNMP object, you get an OID-value pair. The OID is called a MIB (Management Information Base), and is a dot-delimited long number that looks something like ".126.96.36.199.4.1.2021.13.15.1.1.2.18". Each level of the MIB has a name somewhere, and so a MIB can be mapped into a dot-delimited name like "enterprises.ucdavis.diskTable.dskEntry.diskDevice.18".
The value supports a fairly wide variety of types. Integers, Counters, Strings, Network Addresses, Guages, Opaques, Time ticks. So you have a name, and a value, and even some data to help you understand the value (range on guages, type of the value, etc). What could possibly go wrong?
The first problem is that, because the MIB has to be layed out ahead of time, you have some serious issues figuring out how to deal with spaces where there are a small number of objects that exist but a huge number that could. Let's take hard drives: You host has /dev/sda1 and /dev/sda2, but could have hda1-hdqx99, and sdb1-sdqx99, and mema-z, and md0-99, and ... and ...
So there are two approaches you can take when you lay out the MIB. You can enumerate all the objects, and thus waste a lot of space in your MIB definition, or you can do something like what they did in the MIB I referenced above. Let's look at the data for my sda1 partition. The net-snmp agent reports 4 metrics for the drive: NRead, NWritten, Reads, and Writes. In a text name-value-pair protocol, the data might look like this (well, if we went "it was hard to write, it should be hard to read" and forgot to add unit information. Then it might look like this):
disk.sda1.NRead = 521216
But the way that works is I assume the reader can parse the name portion and pull out the device name. Since SNMP uses an object id instead of a name, you have to do something else. Something like
UCD-DISKIO-MIB::diskIOIndex.18 = INTEGER: 18
Ignore the first line, because I really don't know why that first line (188.8.131.52.4.1.2021.13.15.1.1.1.X) is there, since the value is just the last part of the MIB/disk index. .2.X is the "diskIODevice.X" line. That tells you the name of the device. Then you have the 4 metrics we had before. And since they are in .3.X, .4.X, and .5.X they are not next to each other as you walk the MIB tree. So you have to put them together later.
And since the number of drives in the device can change between polls, you can't assume that the .18 you get this cycle is the same .18 you get next cycle. Since it will take multiple queries to get all the metrics you are going too get from a device, you have to worry about the index changing between queries. That's not too much of a worry if you are getting metrics for hard drives, but what if the MIB describes a routing table? How is that even going to work?
The thing that gets the metrics needs to know the MIB to name mapping, and the thing that produces the metrics needs to have the mapping. This means that to start monitoring a new device, you have to go back to the developer and get a precise description of the MIB so that you can figure out what the numbers mean. With out that, you're stuck noting that "184.108.40.206.4.1.2021.13.15.1.1.3.18" is "521216", and wondering what that means. If it's a gauge, at least you know the maximum and the minimum. Hope that helps.
As with any system that involves a mapping from name to number, updates to the mapping need to be synchronized. With SNMP, you're really not supposed to ever change your MIBs, but as we all know, the different between theory and practice is that in theory there is no difference. So when this happens, make sure you have a flag day. Or at least notify your customers, and make the new MIB have a MIB version or something. Or perhaps you just hate your customers, that a popular approach.
SNMP devices also frequently alarm over SNMP. A frequent alarm would be something like "Fan died", or "Interface Hot". As with normal for edge-triggered alarms, it is sent out once and not sent out again until it's fixed and breaks a second time. As a result of the popularity of SNMP, many alarming displays (Netcool) assume edge-triggered alarming.
The problem with Edge Triggered alarming is that it's fragile. If you have a bug that allows you do miss the message, you'll never see it again. A app crashes at the wrong time, you'll never see it again. If the network is too hot and the UDP datagram is dropped, you'll never see it again.
Did I mention that one of the common traps is the "Network Congestion" one? Did I mention that SNMP uses UDP? Does anyone else see the problem?
The correct way to do alarming is by scanning metrics. Instead of a "Fan Died" alarm, you send in a "FansDead=1" guage metric. (See here, where I mention how cheap it is to keep metrics) When that number is non-zero, you alarm until it's zero again. A glitch, and you alarm the next minute. You miss a metrics report? No problem, you'll alarm on next minutes.
SNMP has 3 versions in use. Er, I'll get back to that.
Version 1: Plain Text everything. The Community String (password) is on plain text floating around on the network. Devices are generally configured to not allow any SETs, because everyone knows the password.
Version 2: Adds a bulk GET request. This is very important. Adds an acknowledged Trap, but AFAIK no one uses this. It also added in a security model. Or two. Incompatible security models. To the point where the version people actually use is SNMPv2c, which still uses plain text community strings. But at least the protocol added better ways to interate through object stats. So it's still a large improvement.
Version 3: Adds encryption! And Authentication! And Integrity! And no one uses it!
So, your SNMP messages are going to all float over the network unencrypted, with plain text passwords. Which means you can't use any of the SET operations. But that's fine, doing SET via UDP seems pretty silly anyway.
SNMP support on devices is frequently one of those "Oh yeah! We outta support that" features. And so sometimes the functionality isn't fully tested when the device is shipped. And so, sometimes, probing a specific MIB can result in the device crashing.
No, I'm not making this up.
So your home-grown internal NMS code becomes this hodge-podge of special cases. Conversions, bug fixes, unit hackery, work arounds of vendor bugs. And, since you can't change a MIB, some portion of them will never be fixed.
"disk.sda1.read (bytes) = 521216" How hard is that? Now, you can send a bunch of these over a TCP stream pretty quickly. If you want encryption, use SSL. CPU is cheap these days, even on routers, for something that only happens once a minute or less.
Remote syslog. Supports UDP, supports TCP, and everyone knows how to set this up. Just please, if you have control over the thing being monitored, don't use log files for alarming. If you don't, take solace in the fact that almost everythign that uses SNMP TRAPs also can log to syslog instead, and do your alarming from the TCP-delivered syslog log files.
Collect all the metrics. They are cheap. Mine them for alarms. Keep alarming until a new metric shows up that shows the problem went away.
Copyright 2008 Douglas Kilpatrick, with help from Steve K. and Will P.