SpamapS.org – Full Frontal Nerdity

Clint Byrum's Personal Stuff

So what is Ensemble anyway?

Have you heard of Ensemble? Are you excited about Cloud/Service Orchestration? What? Ok you’re not alone if you are scratching your head.

Ensemble is an implementation of a new idea that has been taking shape the last couple of years. Ever since Amazon hooked up a remote API to thousands of machines to provide access to their virtual infrastructure (and called it macaroni? err.. AWS), people have been dreaming up ways to take advantage of what is basically a robotic “NOC guy”. No longer do you have to pre-rack servers or call your vendor frantically to get servers sent next-day to your colo. Right?

Naturally, the system administrators that would normally be in charge of racking servers, applied their existing tools to the job, to mixed success. Config management is really good at modelling identical hosts. But with virtual hosts instantly available, this left those thinking at a higher level wanting more. Chef in particular implemented a nice set of tools and functionality to allow this high level “service” definition with their knife tools and simple ruby API.

But how easy are Chef’s cookbooks to share and use without modification? How easy are they to integrate together? Puppet has modules that are also capable of similar functionality, and the recent integration of Mcollective, plus puppet Faces, has certainly added a lot of the same things Chef had to support this kind of application modelling, but again, the modules seem to require a lot of convention and assumption, and tweaking to get useful.

Its my opinion, that this is very much like the way tarballs+autoconf became the de-facto standard for distributing free software. It was *so much* better than writing a Makefile by hand, and it achieved an enormous amount of portability, so developers adopted it rapidly. In fact, it is still the dominant way to distribute portable open source applications.

But at some point, the limitations of this became clear. There was a need for something more concise, that could distribute both the source, and binaries, built for a platform. There was some limited early success with tarballs built by convention. But then, Enter RPM and DPKG. These included ways to express facts about software, like its dependencies, architecture, and the revisions made to it to work on the target platform. This allowed distributors of software to more easily maintain their systems, and enabled users to manage the software in their environments.

At that point, some smart guy figured out that we should be able to download and automatically configure all of the software needed for one application to work properly, just from its packaging information. To my mind, apt-get was my first experience with this, though FreeBSD ports authors may disagree there. Either way, this made it very easy for admins and users to install software without spending hours in the 7 levels of dependency hell.

In many ways, Service Orchestration is a way of bringing the benefits of packaging to the cloud. It should allow us to build out our cloud in a sane way, taking advantage of the knowledge that has been gained by others. For the bits that we need to finely tune, it should step aside and allow that without compromising the system.

Ensemble is an implementation of this idea, and Principia is a collection of “Formulas” for Ensemble. They are tightly coupled to Ubuntu, as they are in many ways meant to be the dpkg and apt-get for Ubuntu in the cloud.

Its pretty easy to try out Ensemble and Principia on Ubuntu. Right now you’ll need an EC2 account with an access key setup, though we’re working on making this work with just your local machine for rapid development.

Its been pointed out to me that the version of principia-tools that was available at the time of this writing didn’t include /usr/share/principia-tools/tests. I’ve uploaded a fixed version to the ensemble PPA, so if you tried these instructions and failed, please try updating principia-tools. If that fails, you can get the tests with bzr branch lp:principia-tools.


sudo add-apt-repository ppa:ensemble/ppa
sudo apt-get update
sudo apt-get install principia-tools
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxx
export AWS_SECRET_KEY_ID=0123456789ABCDEF
ensemble bootstrap
principia getall /some/path/for/formulas
/usr/share/principia-tools/tests/mediawiki.sh /some/path/for/formulas

What does this give you, well it should give you a 7 node mediawiki cluster of t1.micro’s in the us-east-1 region of EC2. I just ran it and now I have this:

machines:
  0: {dns-name: ec2-50-19-158-109.compute-1.amazonaws.com, instance-id: i-215dd84f}
  1: {dns-name: ec2-50-17-16-228.compute-1.amazonaws.com, instance-id: i-8d58dde3}
  2: {dns-name: ec2-72-44-49-114.compute-1.amazonaws.com, instance-id: i-9558ddfb}
  3: {dns-name: ec2-50-19-47-106.compute-1.amazonaws.com, instance-id: i-6d5bde03}
  4: {dns-name: ec2-174-129-132-248.compute-1.amazonaws.com, instance-id: i-7f5bde11}
  5: {dns-name: ec2-50-19-152-136.compute-1.amazonaws.com, instance-id: i-755bde1b}
  6: {dns-name: '', instance-id: i-4b5bde25}
services:
  demo-wiki:
    formula: local:mediawiki-62
    relations: {cache: wiki-cache, db: wiki-db, website: wiki-balancer}
    units:
      demo-wiki/0:
        machine: 2
        relations: {}
        state: null
      demo-wiki/1:
        machine: 6
        relations: {}
        state: null
  wiki-balancer:
    formula: local:haproxy-13
    relations: {reverseproxy: demo-wiki}
    units:
      wiki-balancer/0:
        machine: 4
        relations: {}
        state: null
  wiki-cache:
    formula: local:memcached-10
    relations: {cache: demo-wiki}
    units:
      wiki-cache/0:
        machine: 3
        relations: {}
        state: null
      wiki-cache/1:
        machine: 5
        relations: {}
        state: null
  wiki-db:
    formula: local:mysql-93
    relations: {db: demo-wiki}
    units:
      wiki-db/0:
        machine: 1
        relations: {}
        state: null

At the top you see the machines that ensemble spun up in EC2 in the ‘machines’ section. The numbers there correspond to the ‘machine: #’ in the service/units definitions below. If you look through, you’ll see above that wiki-balancer is machine 4, which has a hostname of ec2-174-129-132-248.compute-1.amazonaws.com. If you go to that hostname, once all relations are up (I like to use ‘watch ensemble status’ to see when this happens), you should see a working mediawiki. But not just a working mediawiki, a scalable one. If you want to pour on the traffic, spin up 3 more demo-wiki’s to handle the app server load:


ensemble add-unit demo-wiki
ensemble add-unit demo-wiki
ensemble add-unit demo-wiki

These will of course take a minute or two to spin up. Once they’re ready they’ll show up in the status output:

services:
  demo-wiki:
    formula: local:mediawiki-62
    relations: {cache: wiki-cache, db: wiki-db, website: wiki-balancer}
    units:
      demo-wiki/0:
        machine: 2
        relations:
          cache: {state: up}
          db: {state: up}
          website: {state: up}
        state: started
      demo-wiki/1:
        machine: 6
        relations:
          cache: {state: up}
          db: {state: up}
          website: {state: up}
        state: started
      demo-wiki/2:
        machine: 7
        relations:
          cache: {state: up}
          db: {state: up}
          website: {state: up}
        state: started
      demo-wiki/3:
        machine: 8
        relations:
          cache: {state: up}
          db: {state: up}
          website: {state: up}
        state: started
      demo-wiki/4:
        machine: 9
        relations:
          cache: {state: up}
          db: {state: up}
          website: {state: up}
        state: started

How about a little test then? After I got to this point, I logged in as WikiSysop (change the password folks! its change-me) and imported the Wikipedia exports for “Ubuntu” and “EC2″. After that I used harvestman to spider the site and then saved all the urls in a file, urls.txt. Alright! Now lets fire up *siege* from a machine outside the cluster, but in the same availability zone / security group (so at least we’re only dealing with EC2′s latency and not my net connection), and see if we can take this cluster down!


$ siege -i -c 5 -f urls.txt
...
Transactions: 563 hits
Availability: 100.00 %
Elapsed time: 95.58 secs
Data transferred: 2.64 MB
Response time: 0.35 secs
Transaction rate: 5.89 trans/sec
Throughput: 0.03 MB/sec
Concurrency: 2.04
Successful transactions: 544
Failed transactions: 0
Longest transaction: 13.54
Shortest transaction: 0.00

This is, btw, the best run I got out of t1.micro’s. Sometimes it would get quite ugly:


Transactions: 892 hits
Availability: 99.55 %
Elapsed time: 221.69 secs
Data transferred: 3.64 MB
Response time: 0.61 secs
Transaction rate: 4.02 trans/sec
Throughput: 0.02 MB/sec
Concurrency: 2.45
Successful transactions: 849
Failed transactions: 4
Longest transaction: 27.41
Shortest transaction: 0.00

Lets try the whole thing over with m1.small. First I edit ~/.ensemble/environments.yaml and add an override for the default-instance-type:


ensemble: environments

environments:
  sample:
    type: ec2
    default-instance-type: m1.small
    control-bucket: ensemble-12345678901234567890
    admin-secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Then I re-run the whole test:


Transactions: 290 hits
Availability: 98.98 %
Elapsed time: 81.79 secs
Data transferred: 0.78 MB
Response time: 0.53 secs
Transaction rate: 3.55 trans/sec
Throughput: 0.01 MB/sec
Concurrency: 1.89
Successful transactions: 277
Failed transactions: 3
Longest transaction: 1.50
Shortest transaction: 0.00

Oops! I forgot to add my 3 extra nodes. Note that these two m1.smalls are already almost keeping up. Now as I add these, I keep siege running. Its pretty cool to watch the response times drop as nodes come online to carry some of the load.

Now with 5 m1.small’s:


Transactions: 273 hits
Availability: 100.00 %
Elapsed time: 54.27 secs
Data transferred: 0.99 MB
Response time: 0.47 secs
Transaction rate: 5.03 trans/sec
Throughput: 0.02 MB/sec
Concurrency: 2.38
Successful transactions: 260
Failed transactions: 0
Longest transaction: 19.92
Shortest transaction: 0.00

And with higher concurrency raised from 5 to 10:


Transactions: 327 hits
Availability: 100.00 %
Elapsed time: 42.20 secs
Data transferred: 1.30 MB
Response time: 0.66 secs
Transaction rate: 7.75 trans/sec
Throughput: 0.03 MB/sec
Concurrency: 5.12
Successful transactions: 318
Failed transactions: 0
Longest transaction: 25.51
Shortest transaction: 0.00

And now if we add 2 more, for a total of 7 nodes, concurrency of 10 gets even better:


Transactions: 531 hits
Availability: 100.00 %
Elapsed time: 53.37 secs
Data transferred: 1.75 MB
Response time: 0.44 secs
Transaction rate: 9.95 trans/sec
Throughput: 0.03 MB/sec
Concurrency: 4.35
Successful transactions: 507
Failed transactions: 0
Longest transaction: 15.49
Shortest transaction: 0.00

And with 2 more (total of 9 units in demo-wiki serving the app):


Transactions: 354 hits
Availability: 100.00 %
Elapsed time: 34.41 secs
Data transferred: 1.23 MB
Response time: 0.41 secs
Transaction rate: 10.29 trans/sec
Throughput: 0.04 MB/sec
Concurrency: 4.22
Successful transactions: 337
Failed transactions: 0
Longest transaction: 11.45
Shortest transaction: 0.00

Anyway, this isn’t a Mediawiki benchmark. This is to show you how easy it is to scale up and down in response to load with Ensemble. We all know that scaling out works, these graphs show it nicely:

Response Time
Transactions per Second

Notice how the transactions/second went up all the time, but the response time went up drastically with the jump in concurrency. This is where you need to have the ability to scale quickly, and where, if you can live with the other limitations of EC2 or any other IaaS provider, the cloud should actually win you business, since better response time means more happy users.

Now that my siege is over, I can safely remove the unnecessary units one by one with ‘ensemble remove-unit demo-wiki/9′, etc. etc. There’s still a lot of room for sugar to be added. We could say “ensemble resize-service demo-wiki 5″ and it might just pick 5 to keep and remove the rest, or add 3 to fulfill the request. There are also a ton of other ideas just bubbling up that are really exciting.

Come say hi and hack on ensemble with us in Freenode, #ubuntu-ensemble and on the mailing list on the mailing list.

June 3, 2011 at 6:53 pm Comments (0)

Puppet Camp Report: Two very different days

I attended Puppet Camp in San Francisco this month, thanks to my benevolent employer Canonical’s sponsorship of the event.

It was quite an interesting ride. I’d consider myself an intermediate level puppet user, having only edited existing puppet configurations and used it for proof of concept work, not actual giant deployments. I went in large part to get in touch with users and potential users of Ubuntu Server to see what they think of it now, and what they want out of it in the future. Also Puppet is a really interesting technology that I think will be a key part of this march into the cloud that we’ve all begun.

The state of Puppet

This talk was given by Luke, and was a very frank discussion of where puppet is and where it should be going. He discussed in brief where puppet labs fit in to this discussion as well. In brief, puppet is stable and growing. Upon taking a survey of puppet users, the overwhelming majority are sysadmins, which is no surprise. Debian and Ubuntu have equal share amongst survey respondants, but RHEL and CentOS dominate the playing field.

As for the future, there were a couple of things mentioned. Puppet needs some kind of messaging infrasturcture, and it seems the mCollective will be it. They’re not ready to announce anything, but it seems like a logical choice.  There are also plans for centralized data services to make the data puppet has available to it available to other things.

mCollective

Given by mCollective’s author, whose name escapes me, this was a live demo of what mCollective can do for you. Its basically a highly scalable messaging framework that is not necessarily tied to puppet. You simply need to write an agent that will subscribe to your messages. Currently only ActiveMQ is supported, but it uses STOMP, so any queueing system that uses STOMP should be able to utilize the same driver.

Once you have these agents consuming messages, one must just become creative at what they can do. He currently has some puppet focused agents and client code to pull data out of puppet and act accordingly. Ultimately, you could do much of this with something like Capistrano and parallel ssh, but this seems to scale well. One audience member boasted that they have over 1000 nodes using mCollective to perform tasks.

The Un-Conference

Puppet Camp took the form of an “un conference”, where there were just a few talks, and a bunch of sessions based on what people wanted to talk about. I didn’t propose anything, as I did not come with an agenda, but I definitely was interested in a few of the topics:

Puppet CA

My colleague at Canonical, Mathias Gug, proposed a discussion of the puppet CA mechanics, and it definitely interested me. Puppet uses the PKI system to verify clients and servers. The default mode of operation is for a new client to contact the configured puppet master, and submit a “CSR” or “Certificate Signing Request” to it. The puppet master administrator then verifies that the CSR is from one of their hosts, and signs it, allowing both sides to communicate with some degree of certainty that the certificates are valid.

Well there’s another option, which is just “autosign”. This works great on a LAN where access is highly guarded, as it no longer requires you to verify that your machine submitted the CSR. However, if you have any doubts about your network security, this is dangerous. An attacker can use this access to download all of your configuration information, which could contain password hashes, hidden hostnames, and any number of other things that you probably don’t want to share.

When you add the cloud to this mix, its even more important that you not just trust any host. IaaS cloud instances come and go all the time, with different hostnames/IP’s and properties. Mathias had actually proposed an enhancement to puppet to add a unique ID attribute for CSR’s made in the cloud, but there was a problem with the ruby OpenSSL library that wouldn’t allow these attributes to be added to the certificate. We discussed possibly generating the certificate beforehand using the openssl binary, but this doesn’t look like it will work w/o code changes to Puppet. I am not sure where we’ll go from there.

Puppet Instrumentation

I’m always interested to see what people are doing to measure their success. I think a lot of times we throw up whatever graph or alert monitoring is pre-packaged with something, and figure we’ve done our part. There wasn’t a real consensus on what were the important things to measure. As usual, sysadmins who are running puppet are pressed for time, and often measurement of their own processes falls by the way side with the pressure to measure everybody else.

Other stuff

There were a number of other sessions and discussions, but none that really jumped out at me. On the second day, an employee from Google’s IT department gave a talk about google’s massive puppet infrastructure. He discussed that it is only used for IT support, not production systems, though he wasn’t able to go into much more detail. Also Twitter gave some info about how they use puppet for their production servers, and there was an interesting discussion about the line between code and infrastructure deployment. This stemmed from a question I asked about why they didn’t use their awesome bittorent based “murder” code distribution system to deploy puppet rules. The end of that was “because murder is for code, and this is infrastructure”.

Cloud10/Awstrial

So this was actually the coolest part of the trip. Early on the second day, during the announcements, the (sometimes hilarious) MC Deepak mentioned that there would be a beginner puppet session later in the day. He asked that attendees to that session try to have a machine ready, so that the prsenter, Dan Bode, could give them some examples to try out.

Some guys on the Canonical server team had been working on a project called “Cloud 10” for the release of Ubuntu 10.10, which was coming in just a couple of days. They had thrown together a django app called awstrial that could be used to fire up EC2 or UEC images for free, for a limited period. The reason for this was to allow people to try Ubuntu Server 10.10 out for an hour on EC2. I immediately wondered though.. “Maybe we could just provide the puppet beginner class with instances to try out!”

Huzzah! I mentioned this to Mathias, and he and I started bugging our team members about getting this setup. That was at 9:00am. By noon, 3 hours later, the app had been installed on a fresh EC2 instance, a DNS pointer had been created pointing to said instance, and the whole thing had been tweaked to reference puppet camp and allow the users to have 3 hours instead of 55 minutes.

As lunch began, Mathias announced that users could go to “puppet.ec42.net” in a browser and use their Launchpad or Ubuntu SSO credentials to spawn an instance.

A while later, when the beginner class started, 25 users had signed on and started instances. Unfortunately, the instances died after 55 minutes due to a bug in the code, but ultimately, the users were able to poke around with these instances and try out stuff Dan was suggesting. This made Canonical look good, it made Ubuntu look good, and it definitely has sparked a lot of discussion internally about what we might do with this little web app in the future to ease the process of demoing and training on Ubuntu Server.

And whats even more awesome about working at Canonical? This little web app, awstrial, is open source. Sweet, so anybody can help us out making it better, and even show us more creative ways to use it.


October 21, 2010 at 4:54 pm Comments (0)

Puppet Camp: Learn More About Open Source Data Center Automation | Puppet Labs

I’ll be attending Puppet Camp in San Francisco tomorrow and Friday. Come say hi if you’ll be there too!

Puppet Labs

Puppet Camp: Learn More About Open Source Data Center Automation | Puppet Labs.


October 6, 2010 at 4:51 pm Comments (0)

Ubuntu Developer Summit Day 1 survived

After about 16 hours in the air and waiting on the tarmac, I arrived here in Brussels, Belgium for my first day on the job at Canonical.

I actually really love the feeling one gets when pushed to their limits of sleep deprivation. For me, my ego tends to shrink and go away after this long without sleep. I did catch a few winks on the plane, but they were mostly drunken winks, so they weren’t quite as restful as, say stretching out on a pile of broken glass. With the sun hanging in the air while my body wanted it to be under foot safely blocked out by a ball of mud, magma and water, I arrived feeling pretty much like I was in outer space.

That feeling was rather fitting, given that the first Canonical employee I met at lunch was none other than Mark Shuttleworth, who actually *has* been in outer space. It was quite random, I grabbed a place in the salad line, and there he was. We had a pretty good discussion ranging from why people still choose CentOS to Darwin’s Theory of Sexual Selection, its really awesome knowing that the guy at the helm gets what we’re doing.

The afternoon was spent in sessions, and I have some quick take aways from them:

  • Puppet integration in Ubuntu Server is about to get really damn good. Some things that have been discussed are making client registration automatic and decoupled from hostname, allowing re-provisioning without reconfiguring anything.
  • PPA’s for volatile software that releases nightly builds are continuing to flesh out. This makes upstream bug reports much easier, as when they say “try the latest nightly build” you can, in fact, try it without suddenly shifting from the package installed version to a custom compiled, or trying to build your own package. I think one challenge for that is going to be making sure that users know that the PPA exists.

Well, I got some sleep, so now I’m up at 0-dark-thirty and ready to attend some sessions. My favorites for the day are:

https://blueprints.edge.launchpad.net/ubuntu/+spec/server-maverick-monitoring-framework

Monitoring is near and dear to me. Mathiaz has some awesome ideas about how to make it fault tolerant.

https://wiki.ubuntu.com/Specs/ARMServer

A full rack of servers using only 7.5kw  … can’t wait to see this presentation.

https://blueprints.edge.launchpad.net/ubuntu/+spec/server-maverick-hadoop-pig

I want to learn more about this and play with it.


May 11, 2010 at 12:11 pm Comments (0)