SpamapS.org – Full Frontal Nerdity

Clint Byrum's Personal Stuff

Nagios is from Mars, and MySQL is from Venus (Monitoring Part 2)

In my previous post about Nagios, I showed how the rich Nagios charm simplifies adding basic monitoring to Juju environments. But, we need more than that. Services know more about how to verify they are working than a monitoring system could ever guess.

So, for that, we have the ‘monitors‘ interface. Currently sparsely documented, as it is extremely new, the idea is simple.

  • Providing charms define what monitoring systems should look at generically
  • Requiring charms monitor any of the things they can, ignoring those they cannot.

This is defined through a YAML file. Here is the example.monitors.yaml included with the nagios charm:

# Version of the spec, mostly ignored but 0.3 is the current one
version: '0.3'
# Dict with just 'local' and 'remote' as parts
monitors:
    # local monitors need an agent to be handled. See nrpe charm for
    # some example implementations
    local:
        # procrunning checks for a running process named X (no path)
        procrunning:
            # Multiple procrunning can be defined, this is the "name" of it
            nagios3:
                min: 1
                max: 1
                executable: nagios3
    # Remote monitors can be polled directly by a remote system
    remote:
        # do a request on the HTTP protocol
        http:
            nagios:
                port: 80
                path: /nagios3/
                # expected status response (otherwise just look for 200)
                status: 'HTTP/1.1 401'
                # Use as the Host: header (the server address will still be used to connect() to)
                host: www.fewbar.com
        mysql:
            # Named basic check
            basic:
                username: monitors
                password: abcdefg123456

There are two main classes of monitors: local, remote. This is in reference to the service unit’s location. Local monitors are intended to be run inside the same machine/container as the service. Remote monitors are then, quite obviously, meant to be run outside the machine/container. So, above, you see remote monitors for the mysql protocol and http, and a local monitor to see if processes are running.

The MySQL charm now includes some of these monitors:

version: '0.3'
monitors:
    local:
        procrunning:
            mysqld:
                name: MySQL Running
                min: 1
                max: 1
                executable: mysqld
    remote:
        mysql:
            basic:
                user: monitors

The remote part is fed directly to nagios, which knows how to monitor mysql remotely, and so translates it into a check_mysql command. The local bits are ignored by Nagios. But, when we also relate the subordinate charm NRPE to a MySQL service, then we’ll have an agent which understands local. It actually converts those into remote monitors of type ‘nrpe’ which Nagios does understand. So upon relating NRPE to Nagios, each subordinate unit feeds its unique NRPE monitors back to Nagios and they are added to the target units’ monitors.

Honestly, this all sounds very complicated. But luckily, you don’t have to really grasp it to take advantage of it in a charm. The whole point is this: All one needs to do is write a monitors.yaml, and add the monitors relation with this joined hook to your charm:

#!/bin/bash
# .. Anything you need to do to enable the monitoring host to access ports/users/etc goes here
relation-set monitors="$(cat monitors.yaml)" target-id=${JUJU_UNIT_NAME//\//-} target-address=$(unit-get private-address)

If you have local things you want to give to a monitoring agent, you can use the ‘local-monitors’ interface, which is basically the same as monitors, but only ever used in container scoped relations required by subordinate charms such as NRPE or collectd.

Now you can easily provide monitors to any monitoring system. If Nagios doesn’t support what you want to monitor, its fairly easy to add support. And as more monitoring systems are charmed and have the monitors interface added, your charm will be more useful out of the box.

In the next post, which will wrap up this series on monitoring, I’ll talk about how to add monitors support to some other monitoring systems such as collectd, and also how to write a subordinate charm to communicate your monitors to an external monitoring service.

September 5, 2012 at 5:50 am Comments (0)

Juju and Nagios, sittin’ in a tree.. (Part 1)

Monitoring. Could it get any more nerdy than monitoring? Well I think we can make monitoring cool again…

 

If you’re using Juju, Nagios is about to get a lot easier to leverage into your environment. Anyone who has ever tried to automate their Nagios configuration, knows that it can be daunting. Nagios is so flexible and has so many options, its hard to get right when doing it by hand. Automating it requires even more thought. Part of this is because monitoring itself is a bit hard to genercise. There are lots of types of monitors. Nagios really focuses on two of these:

  • Service monitoring – Make a script that pretends to be a user and see if your synthetic monitor sees what you expect.
  • Resource monitoring – Look at the counters and metrics afforded a user of a normal system.

The trick is, the service monitoring wants to interrogate the real services from outside of the machine, while the resource monitoring wants to see things only visible with privileged access. This is why we have NRPE, or “Nagios Remote Plugin Executor” (and NSCA, and munin, but ignore those for now). NRPE is a little daemon that runs on a server and will run a nagios plugin script, returning the result when asked by Nagios. With this you get those privileged things like how much RAM and disk space is used. Normally when you want to use Nagios, you need to sit down and figure out how to tell it to monitor all of your stuff. This involves creating generic objects, figuring out how to get your list of hosts into nagios’s config files, and how to get the classifications for said hosts into nagios. Does anybody trying to make sure their pager goes off when things are broken actually want to learn Nagios? So, here’s how to get Nagios in your Juju environment. First lets assume you have deployed a stack of applications.

juju deploy mysql wikidb                # single MySQL db server
juju deploy haproxy wikibalancer        # and single haproxy load balancer
juju deploy -n 5 mediawiki wiki-app     # 5 app-server nodes to handle mediawiki
juju deploy memcached wiki-cache        # memcached
juju add-relation wikidb:db wiki-app:db # use wikidb service as r/w db for app
juju add-relation wiki-app wikibalancer # load balance wiki-app behind haproxy
juju add-relation wiki-cache wiki-app   # use wiki-cache service for wiki-app

This gives one a nice stack of services that is pretty common in most applications today, with a DB and cache for persistent and ephemeral storage and then many app nodes to scale the heavy lifting.

Now you have your app running, but what about when it breaks? How will you find out? Well this is where Nagios comes in:

juju deploy nagios                          # custom nagios charm
juju add-relation nagios wikidb             # monitor wikidb via nagios
juju add-relation nagios wiki-app           # ""
juju add-relation nagios wikibalancer       # ""

You now should have nagios monitoring things. You can check it out by exposing it and then browsing to the hostname of the nagios instance at ‘http://x.x.x.x/nagios3′. You can find out the password for the ‘nagiosadmin’ user by catting a file that the charm leaves for this purpose:

juju ssh nagios/0 sudo cat /var/lib/juju/nagios.passwd

Now, the checks are very sparse at the moment. This is because we have used the generic monitoring interface which can just monitor the basic things (SSH, ping, etc). We can add some resource monitoring by deploying NRPE:

juju deploy nrpe                          # create a subordinate NRPE service
juju add-relation nrpe wikibalancer       # Put NRPE on wikibalancer
juju add-relation nrpe wiki-app           # Put NRPE on wiki-app
juju add-relation nrpe:monitors nagios:monitors # Tells Nagios to monitor all NRPEs

Now we will get memory stats, root filesystem, etc.

You may have noticed we left off wikidb, that is because it will show you an ambiguous relation warning when you try this:

juju add-relation nrpe wikidb # Put NRPE on wikidb

ERROR Ambiguous relation 'nrpe mysql'; could refer to:
  'nrpe:general-info mysql:juju-info' (juju-info client / juju-info server)
  'nrpe:local-monitors mysql:local-monitors' (local-monitors client / local-monitors server)

This is because mysql has special support to be able to specify its own local monitors in addition to those in the usual basic group (more on this in part 2). To get around this we use:

juju add-relation nrpe:local-monitors wikidb:local-monitors

 

This is a perfect example of how Juju’s encapsulation around services pays off for re-usability. By wrapping a service like Nagios in a charm, we can start to really develop a set of best practices for using that service and collaborate around making it better for everyone.

Of course, Chef and Puppet users can get this done with existing Nagios modules. Puppet, in particular, has really great Nagios support. However, I want to take a step back and explain why I think Juju has a place along side those methods and will accelerate systems engineering in new directions.

While there is some level of encapsulation in the methods that Chef and Puppet put forth, they’re not fully encapsulated in the way that they interact with other components in a Chef or Puppet system. In most cases, you still have to edit your own service configs to add specific Nagios integration. This works for the custom case, but it does not make it easy for users to collaborate on the way to deploy well known systems. It will also be hard to swap out components for new, better methods as they emerge. Every time you mention Nagios in your code, you are pushing Nagios deeper into your system engineering.

With the method I’ve outlined above, any charmed service can be monitored for basic stats (including the 80 or so that are in the official charm store). You might ask though, what about custom Nagios plugins, or specifying more elaborate but somewhat generic service checks. That is all coming. I will show some examples in my next post about this. I will also go on later to show how Nagios + NRPE can be replaced with collectd, or some other system, without changing the charms that have implemented rich monitoring support.

So, while this at least starts to bring the official Nagios charm up to par with configuration management’s rich Nagios ability, it also sets the stage for replacing Nagios with other things. The key difference here is that as you’ll see in the next few parts, none of the charms will have to mention “Nagios”. They’ll just describe what things to monitor, and Nagios, Collectd, or whatever other system you have in place will find a way to interpret that and monitor it.

August 7, 2012 at 11:47 pm Comments (0)

UDS Maverick – day2 highlights

  • btrfs – BTRFS is pretty awesome, with filesystem level snapshotting and compression, it promises to make some waves on the server and small devices. Unfortunately, its still marked as EXPERIMENTAL by its own developers, and there are known bugs. However, you can choose to play with it in Ubuntu 10.04, which should be helpful for people finding and submitting bugs so the developers can feel better about people using it. There is a desire to have it as the default filesystem for the next Ubuntu LTS release, which is pretty exciting.
  • Monitoring is too easy – Any time I see 10+ implementations of the same idea, I figure its probably something that is easy enough that people tend to write their own instead of searching for a solution. Monitoring and graphing seem to be in this category, with many solutions such as nagios, opennms, zenoss, munin, ganglia… the list goes on and on. We talked a lot about what to do in Ubuntu Server to make sure this is done well and makes sense, and basically ran out of time. The best part of the session though, was that we decided to focus on solving the data collection problem first, so each server takes responsibility for itself, and then allow centralized aggregation on another level.
  • Server Community – There is some desire to have people test Ubuntu Server before a release, especially for the LTS releases. A beta program was proposed, but there is some doubt (my own included) that this will actually get people to test before the .0 release. Basically I have to think that as a server admin, people aren’t interested in even trying something in an unstable state. They’ll take the .0 and build a new server rev, but they’re not going to go around upgrading stable servers. This needs more thought and discussion definitely.

Sitting in the first session for Wednesday now listening to a session about the next 6 months of Ubuntu Enterprise Cloud and Eucalyptus development. Very exciting stuff!


May 12, 2010 at 7:57 am Comments (0)

Ubuntu Developer Summit Day 1 survived

After about 16 hours in the air and waiting on the tarmac, I arrived here in Brussels, Belgium for my first day on the job at Canonical.

I actually really love the feeling one gets when pushed to their limits of sleep deprivation. For me, my ego tends to shrink and go away after this long without sleep. I did catch a few winks on the plane, but they were mostly drunken winks, so they weren’t quite as restful as, say stretching out on a pile of broken glass. With the sun hanging in the air while my body wanted it to be under foot safely blocked out by a ball of mud, magma and water, I arrived feeling pretty much like I was in outer space.

That feeling was rather fitting, given that the first Canonical employee I met at lunch was none other than Mark Shuttleworth, who actually *has* been in outer space. It was quite random, I grabbed a place in the salad line, and there he was. We had a pretty good discussion ranging from why people still choose CentOS to Darwin’s Theory of Sexual Selection, its really awesome knowing that the guy at the helm gets what we’re doing.

The afternoon was spent in sessions, and I have some quick take aways from them:

  • Puppet integration in Ubuntu Server is about to get really damn good. Some things that have been discussed are making client registration automatic and decoupled from hostname, allowing re-provisioning without reconfiguring anything.
  • PPA’s for volatile software that releases nightly builds are continuing to flesh out. This makes upstream bug reports much easier, as when they say “try the latest nightly build” you can, in fact, try it without suddenly shifting from the package installed version to a custom compiled, or trying to build your own package. I think one challenge for that is going to be making sure that users know that the PPA exists.

Well, I got some sleep, so now I’m up at 0-dark-thirty and ready to attend some sessions. My favorites for the day are:

https://blueprints.edge.launchpad.net/ubuntu/+spec/server-maverick-monitoring-framework

Monitoring is near and dear to me. Mathiaz has some awesome ideas about how to make it fault tolerant.

https://wiki.ubuntu.com/Specs/ARMServer

A full rack of servers using only 7.5kw  … can’t wait to see this presentation.

https://blueprints.edge.launchpad.net/ubuntu/+spec/server-maverick-hadoop-pig

I want to learn more about this and play with it.


May 11, 2010 at 12:11 pm Comments (0)