Entries in nms (1)


Network Monitoring Nightmares

Network Monitoring Systems (NMS) are often cumbersome, ugly, hard to maintain, painful to install, and tedious to configure. In many cases, more energy is spent working around problems with the NMS than actually monitoring the network it's deployed on! And if you think this is a problem that can be solved by purchasing a commercial product, you're in for a real shock. So what do we do about the state of network monitoring?



First, some background. Not background of network monitoring software- there's far too many of them out there. Instead, this is  background on the problem faced by a typical IT or Operations department. First, there is a network. The network initially has only a few servers used by a few people, and everything hums along. When there's a problem, the person next to you notices and asks you to look into it. Because there are so few devices making this network run, it's easy to find where the problem is and typically easy to fix. But then the network grows. You need more storage, more servers, more switches. Next you get into complex networks designed for maximum stability, redundant connections, high-availability products, load balancers, routers, firewalls, and so on. Before you know it your tiny network has grown into a multi-datacenter goliath and you're spending most of your time trying to put out fires and find faults in your design. And like everything in life, the more complex you make things the harder they are to unravel when you need to find the source of an error.

Now you need a network monitoring system. Some piece of software that can watch each individual component on your networks to make sure they're operating properly. This system also needs to watch services running on servers to make sure they don't fail or return unexpected results. And of course, there are countless scripts and applications running in the background to make sure the plates keep spinning. The picture is simply too large for one person to watch manually, and you'll never catch problems before users are impacted. No matter how good you think you are.


Step 1. - We can do this ourselves!

The first step most administrators take is to write a collection of checks themselves. Most of the time they pick a language they know already and get to working. First they check they can get to the web servers. Childs play for all but the newest admin. Open a connection, retrieve a page, make sure the page downloaded, and we pass. Otherwise we fail. Chalk up a success for the admin team!

But then the web sites change, the file moves, and the test fails even though the server is up and working fine. So the admin checks to make sure any page is returned. Eventually, the site changes again and the page that's being successfully returned is actually an error! Now we think the site is working fine, but it turns out customers have been reporting errors to the support line for the past three hours! This game of cat-and-mouse goes on and on, and the collection of scripts gets larger and larger until finally the administrators decide it's becoming too complex to manage the network monitoring system themselves. Now they start step two- searching for a monitoring product.


Step 2. - Just pick a monitoring system!

Now that the administrators (there are a team of them now) have admitted that their time would be best spent on administration of their company's infrastructure and not on writing a monitoring system, the search for an NMS begins. Because the team is used to doing things manually or modifying the monitoring system they've cobbled together every time there's a change, they will almost certainly pick a product that offers a small feature set and requires lots of manual intervention. This isn't because they want a system that's difficult to use, but rahter it's what they're accustomed to and they don't know any better yet. So they suffer through the initial setup and maintenance.

Eventually, someone will notice a system that offers more features and more automatic functionality exists and it would make the job of monitoring much easier for everyone. Unfortunately, so much effort has been invested in the current solution, and it has collected so much historical data that it's deemed too difficult to switch NMS products. This typically happens several times, and every time the story is the same. But inevitably something so drastic happens- either a failure in the monitoring system, a loss of data, or a lack of expandability- that the team agrees the time has come to once again change monitoring systems.


Step 3. - Maybe we should pay for this?

Once it has been deemed that the monitoring system is critical to the business, a project can be created and actually assigned money. This can go one of two ways depending on what the administration group looks like- completely commercial or mostly commercial.

A completely commercial system would be an HP OpenView or an IBM Tivoli system. These are large packages that have tons and tons of functionality, professional development, lovely graphical views, fantastic charts and graphs, prediction models, event correlation engines, inventory modules, expensive support contracts, and serious system requirements. It's typically not enough to just buy the monitoring system- you need database software to manage all of the information generated, too! And that can add thousands to the price tag. But, as long as there is money in the budget, and the sales people do their jobs, these solutions seem like they have endless capabilities and it's a no-brainer to go with a completely commercial offering. But the price tag is often so high that the sticker shock is insurmountable. Even worse, if you do end up with a completely commercial offering, you quickly find out that you're essentially on your own to write the components that check your environment again! Now you're back to step one, but you're tens or hundreds of thousands of dollars poorer. And you've learned a very valuable and very expensive lesson.

By contrast, a mostly commercial system is typically a product that has a free or free/open source component that is expanded on by the commercial branch. These companies lure you in by offering extra value, product support, training courses, development resources, plugin packs, and things of that nature. Most of the mostly commercial offerings compare themselves directly to the completely commercial products, and sometimes they even offer truly better products. Sometimes they can be rougher around the edges than the highly polished products offered by HP and IBM, but for the most part they offer exactly the same functionality at a reduced cost. But like everything, the buyer needs to beware. The development teams at these companies are usually pretty busy building new features and fixing bugs. Getting new features developed for your organization can be difficult or take a very long time, and this can be a hard lesson to learn when you've spent tens of thousands of dollars on a product that you're beginning to realize does barely more than the product you've just replaced. What's worse is that you're beginning to realize that the features in the commercial offering are barely worth purchasing, and you probably could have used the free offering instead.



Step 4. - Just settle.

So here we are. Your team has spent several man years trying to solve a problem that the success of your business has created. If you've made it to step three, you probably realized the network monitoring system space is a wasteland of half-baked, half functional solutions. There are countless solutions to choose from, but you now realize all of them have the same short-comings and none of them address the one killer feature you need.

What's worse is that if you've purchased a solution, you no doubt realize that the cost of developing extensions or plugins for the monitoring system is so high that it's effectively out of your reach. So, once again, instead of renewing your support contracts or paying a programmer to write the same things you wrote what feels like forever ago, you decide it's time to look at the market again.

I've been living this nightmare for the past decade. I've used every piece of monitoring software you can imagine, and I've used many of them more than once. Little has changed in the past 10 years, which is amazing! There are few software industries that have the same approach and concepts that they had 10 years ago, but somehow monitoring just seems to become stagnant the instant it's released. They all look the same, they feel the same, and for the most part they all have the same failures and successes. What's worse is that severan newcomers base their products on offerings that haven't seen development in so long that the projects have effectively been abandoned!

So what do you do now? Where do you go from here? Well, if you're like the majority of administration groups, you settle. You'll find a package that does most of what you want, you'll find someone skilled enough to add the functionality you need, and you'll be frustrated every single day you use it. That's the sad state of monitoring, and we're all in the same boat.



I'd personally like to call on Google, Yahoo, Akamai, Facebook, and other massive networks to tell the rest of us what they use. They must be going through the same pains as startups, and with the skilled people they have onboard they must have found a solution. So what is it? What's their silver bullet? Are they willing to release the code for their tools, to host talks about monitoring, to teach us their ways? I certainly hope so, because almost every group of admins could benefit from their knowledge.