Sysadmin 4 lyfe


How I built my own validation environment

We’re currently in the process of a massive upgrade of almost 200 Linux servers. While most of these servers have the same purpose, there’s probably around 25 core servers that have different functions.

Due to a number of factors such as:

  • Inherit risk in having a failed migration and roll-back
  • Lack of interest from the developers to include our local customisations of the platform in their releases
  • A large amount of extra functionality we’d developed without their help

it became clear to me we needed to carefully migrate our environment in a staged approach only after validating the upgrade steps.

The fact the new platform uses Puppet as the main source of software and configuration management made things easier in some respects and harder in others. While I was able to port almost all our customisations from custom scripts that ran when a virtual machine was built to Puppet manifests, when those manifests needed to define configurations that were already defined by our Development team (who create this platform for 6 deployments across 6 continents), things got a little tricky. I had to instead write patches to their manifests, which made these customisations configurable from Puppet’s site.pp file, and feed them back to be part of the next release.

Over the past 2-3 months I’ve been writing custom puppet manifests, testing our customisations and how they interact with the standard platform and validating the upgrade steps, finding bugs in the upgrade process and providing bug fixes back to our developers.

All of this was possible because we built a virtual “Validation” or “Pre-Production” network completely isolated from our production hosts and only accessible through a multi-homed Debian box. This meant we had slightly different configurations for DNS hosts and http proxies, but Puppet made is very easy to update this once the hosts were promoted to production. We are lucky enough to have a Checkpoint VSX firewall that was capable of creating a virtual routing context so we could give this network the same IP address range (and all our Servers the same IPs) as Production. This greatly simplified setting up and using the environment as we didn’t have to worry about re-IPing hosts when we go live and we could use a DNS export from Production in our Pre-Production network.

Once this Pre-Production network was built and before I really started working on the nitty gritty details of Puppet, I took a clone of the 9 VMs for our future “Test” environment that will be used to validate future releases.

This process has proved invaluable so far, while other continents in our company are screaming and pulling their hair out with issues encountered during the upgrade which have forced a roll back, we’ve only had two major bugs (both causing the same symptom, I love those) which only intermittently impacted the platform and have since been fixed.

In my next post I will go into detail about how we’re using Puppet and SVN to provide a structured version and release management process.