Saturday outage report

January 27th, 2014 by pdw

Edit: we’ve now received a report from Telecity, so have updated this report to take account of this.

Further edit: explanation for extended outage in one rack added.

Summary

A power interruption occurred at around 8:09am on Saturday 25th January, affecting multiple floors in Sovereign House.
For the most part, the interruption was momentary (around 500ms), but long enough to cause a reboot of affected equipment.
One of our racks was without power until 10:38am, due to a tripped circuit breaker.
Our staff were onsite at 11:15am, and then worked to restore services that had not come back up cleanly. One such server was our SOV DHCP server which will have affected any virtual servers configured to boot via DHCP.

Details

The power outage was caused by an interruption to the external mains power supply, followed by a failure of the DRUPS (Diesel Rotary Uninterruptible Power Supply) system that is supposed to ensure that power to the data centre is maintained during such a power cut.

The DRUPS system contains three separate units with sufficient capacity to cope with the failure of any one unit. Unfortunately, in this event, the unit that failed did so in a manner that triggered a shutdown of the other two. From the Telecity report:

… one of the units on DRUPS System 1 experienced a fault on its synchronisation card. This fault caused the unit to go into overload which, in turn, had a direct impact on the remaining two units. During the overload condition, the faulty unit back-fed the other two units which, for protection and per design, automatically shut down.

At this point the system went into raw mains bypass mode (i.e. bypassing the UPS systems, and connecting the data centre load directly to the mains). This occurred around 2 minutes after the original mains supply failure, by which point the mains supply had been restored, but there was a 500ms interruption as the bypass occurred.

This much is consistent with our observations, which is that in all but one rack, the logs on our remote PDUs did not record an outage, but the vast majority of equipment attached to them did: the management interfaces in these PDUs draw very little electricity and are known to be able to survive very short power supply interruptions.

As noted above, one of our racks experienced a more extended outage. This was due to the circuit breaker on the power bar being tripped. This was noticed and rectified by data centre staff inspecting racks following the initial outage.

At this point, the faulty DRUPS unit is out of service, meaning that whilst the power supply is protected, there is no redundancy until the unit is repaired and tested.

Conclusion

Whilst we are certainly unhappy about the outage, at this point we have no cause to question our choice of data centre provider. Sovereign House is a major UK internet hub, and is a purpose-built 6 floor data centre, built to the highest industry standards. With the best will in the world, there will always be faults that can take an entire DC, or significant parts of it, off-line, and for this reason, we would always recommend that mission-critical applications are served from multiple sites. Independent routing ensured that our facilities at other sites were unaffected by the Sovereign House outage.

That said, the aftermath of the outage has revealed some areas in which we can improve. In particular, the extended outage of one rack had a knock on effect to connectivity of others. Following Sunday’s scheduled maintenance work, we’re now in a position to improve our network topology to make it more resilient. We are also planning improvements to our Virtual Server hosts and database servers to ensure that they can recover more quickly following such an outage, and we have already made changes to our support systems to make them more resilient.

Beyond directly fixing the affected units, Telecity are also planning improvements to their communications during such an incident. This will help us direct our efforts more effectively.

Notes

For the avoidance of doubt, this interruption was completely unrelated to the network upgrades scheduled for Sunday evening, which went ahead as planned.

Finally, thank you to all customers who monitored our status page during the outage.

Posted in Services