27th
Jul '09

Redundancy helps with scheduled maintenance too

Click here to read more about SpamStopsHere, the e-mail security company that brings you this blog.

Although many system architects concentrate on redundant systems in case of system failure, redundant systems also help with scheduled maintenance.

A couple weeks ago, three technical support personnel, our lead developer, and I worked late to make some changes to the core architecture of the SpamStopsHere service. One saturday night, unbeknownst to any SpamStopsHere customers, a well planned system upgrade was performed. No announcement was made, because system maintenance was a regular part of managing the service, and no downtime has ever resulted from system maintenance due to redundant systems.

The system upgrade involved replacing every single node in the filter clusters to more powerful hardware that could process 10 times as many messages per minute. After several weeks of testing out the new configuration, we switched the nodes from the mostly dual-core CPUs running 32 bit operating systems on 4GB of memory and SATA drives to dual quad-core CPUs running 64 bit operating systems on 12GB of memory and solid state drives (SSD).

Replacing the hardware of approximately 100 servers in one night with zero downtime was made painless due to triply redundant systems, well documented installation instructions, well documented upgrade procedures, a shared google docs spreadsheet for coordination, and some pizza.

The satisfaction of replacing every single node, the core of the SpamStopsHere system, without a single glitch was enlightening. In the past, we had simply added more nodes to the clusters when more capacity was needed. The nodes were all of differing hardware because they were slowly added, and different hardware was available almost every time we added a node.  Some hardware performed better than others. This was the first time that we did a full replacement of every single node, rather than just adding more nodes.

Knowing that we have a well established procedure to basically build the entire system from scratch in under 12 hours is quite exciting. Should we ever experience a catastrophic failure of all four data centers, we know that we can have the service restored in just a few hours, and back at 100% capacity in under 12 hours. We also have the satisfaction of knowing that we don’t need to rely on any vendor. Should one of our data center vendors offer a service that we’re not completely satisfied with, we can very quickly move to another data center. Many hosted services are able to keep customers simply due to the imagined difficulty of switching vendors, and may provide shoddy customer service as a result. SpamStopsHere doesn’t need to worry about that. We have proven how quickly and easily we can reorganize our entire architecture, when needed.

Replacing aging hardware with brand new systems with brand new life cycle clocks is also very satisfying. SpamStopsHere expects to spend fewer hours doing system maintenance for awhile, and those hours can instead be invested in improving our service.

Although I would typically be hesitant about recycling this much hardware, when it’s still usable and not ready for the landfill, all of the hardware that was taken out of service was leased equipment that our data center vendors will recycle and offer to other customers. Although some of it is rather old, most of it will be reformatted and offered at special rates to new customers. The benefits are significant for our company’s carbon footprint. Our power consumption has dropped significantly, running at 40% of our previous usage.

If you don’t have redundant systems, or haven’t tested your redundancy, we urge you to give it a try. You’ll have many more restful nights after you’ve done a complete test of your redundant systems. The confidence in new hardware versus aging systems is also a satisfying experience that I recommend you take advantage of when the timing is right. Redundancy isn’t just for failures, it’s for doing software upgrades, hardware upgrades, and even complete hardware replacement, without service interruption.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

Leave a Reply