Nagios is my hero
Our company is growing fast. When I started in March, 2003, we were hosting our
website with some company in Texas. I assumed the position of "Systems
Administrator" (I knew Linux fairly well at the time, but I had no formal
training in administration) when we moved from the hosting company to our first
dedicated server colocated at Xmission.
Since that time, we moved from a shared rack to our own cabinet, now housing 10
servers (one is currently powered off, because if we turn it on, we risk blowing
our power circuit... yes, we are waiting on a power upgrade).
As the number of servers we're using adds up, so does the stress of having to
manage it all. There are a lot of little things to keep track of to make sure
everything is running smoothly, and doing so can sometimes be a lot of work,
especially when my official title here is "Programmer", and not "Systems
Administrator". Earlier this month, I toured another companies data center and I
learned of something that, as of yesterday, is going to make my life a
LOT easier: Nagios.
From their site: "Nagios is an Open Source host, service and network monitoring program." To me, that doesn't quite sum up the capabilities of this awesome system. Here's what we now use it for:
- Monitoring the RAID setups on every server (we currently use 3ware, LSI, and software RAID setups; each one is monitored separately). If a drive/array goes bad, our admin staff gets an email, and I get a text message.
- Monitoring disk usage on every server; you can set a warning and a critical threshold - for example, if disk usage goes over 80% you get a warning, if it goes over 90% you get a critical notice.
- Monitoring system load (thresholds here are also completely customizable)
- Monitoring MySQL replication status
- Each server is PINGed periodically to make sure it is still up. If a machine goes down, a notification is sent out
- Each web server is monitored to make sure i's receiving HTTP connections
- Each server is monitored to make sure it's receiving SSH connections
- Monitoring MySQL status (number of connections, slow queries, etc)
- Each service on the mail server (POP, IMAP, SMTP, the mail queue) is monitored
For every item you monitor, you can specify when, how, and how often you get
notified of events (if at all). You can also tell the system you want to be
notified when something returns to normal.
Nagios uses plugins to monitor different items. It comes with a bunch to monitor
commonly used systems and services (like http, ftp, disk usage, etc). There is
also a community website where people post plugins that they have written here:
http://www.nagiosexchange.org. If
it's not included, and you can't find one on nagiosexchange, they are super easy
to write in almost any language.
Not only that, Nagios gives you a nifty web interface to see what's going on. Here are some screenshots:
The Status Map

... I
mostly included this screenshot because it looks cool. It looks tons cooler when
you have hundreds of machines being monitored - if one of the machines has a
problem, the green circle around it shows as red.
The Host
Overview

As you can see
in that last screenshot, we currenly have one critical notice. That's a degraded
array that, even though I had written a script to check for such a situation, we
wouldn't have known about without installing Nagios (for whatever reason, the
script I wrote failed to notify us). That array is currently rebuilding thanks
to this sweet system.
In conclusion, if you're feeling growing pains at
your company when it comes to monitoring your servers, I highly recommend you
give this a try. I'll warn you: It's not exactly hard to set up, but it is
rather tedious. In my opinion, it's worth the time you'll spend.
Restored from VimTips archive
This article was restored from the VimTips archive. There's probably missing images and broken links (and even some flash references), but it was still important to me to bring them back.