Data center redundancy
Click here to read more about SpamStopsHere, the e-mail security company that brings you this blog.
Recently, a large well known dedicated server provider in Texas had an explosion in the power control room of one of their data centers, causing the data center to lose power. SpamStopsHere had four servers in the affected data center. This wasn’t a big deal for us as we often have more than one server down just for maintenance.
We here at SpamStopsHere feel for all of the organizations that had all of their servers located in the affected data center. We luckily have data center redundancy. Even if all of the servers that we have hosted at a single provider had been lost, we would have simply been operating near capacity if it had happened during peak hours. We could have quickly established some new servers at alternate data centers to replace the downed servers if needed.
However, the provider had a whole other problem that did force us to operate at almost 100% capacity for one afternoon. Our 15 servers at the provider’s other data centers that still had power lost their DNS PTR records. A DNS PTR record is a DNS record that resolves an IP address to a hostname. Many recipient e-mail servers would have rejected our e-mail had we attempted to deliver e-mail from these servers, so we decided to take the 15 servers offline when this occured Wednesday night on June 4, 2008. For more information, on this, please see my previous blog article.
In order to start implementing a contingency plan, such as replacing 15 servers, one must first determine how much time and money it will take to implement the contingency and weigh that against the estimated time of resolution of the orginal problem to determine whether one should switch one’s efforts to the contingency plan and scrap the original plan. Unfortunately we had been consistenly provided with widely inaccurate estimates, which had affected our determination on whether to implement contingencies. After 36 hours and many false assurances that the PTR records problem would be resolved shortly, we decided to implement our contingency plans so as not to have another day of peak usage that brought us near peak capacity.
I think that if I’ve learned one thing in this it is that contingency plans should immediately be implemented if the estimated time of resolution is being provided by and is in the hands of a third party. I think the entire industry has likely learned the same thing. It’s likely that many organizations have also learned to weigh the costs of data center redundancy over the costs of downtime. It’s unfortunate that many organizations simply did not consider a disaster plan that involved the loss of a data center and would have gladly paid the cost had they considered the possibility.
Leave a Reply