Router fails, no packets dropped.

January 29th, 2014 by

This morning one of our routers in our Cambridge data centre stopped reporting bandwidth data to our billing system. We investigated and whilst it was still routing packets without issue, it appeared to be experiencing hardware failure.

We’ve powered the router down, pending full investigation on our data centre visit this afternoon. Currently all traffic from our Cambridge site is being handled by our other router. This seamlessly failed over with no customer impact.

Depending on your choice of terminology ‘Redundancy has been reduced to N’, or ‘The network is at-risk’. In Mythic Beasts we like to speak English so this translates to, if something else fails before the router is restored to service, there is a risk of a network outage to our Cambridge data centre.

Update : Friday 31st we fully restored our network to it’s usual redundant configuration by replacing the router with a similarly over specified replacement. Customers may have received free bandwidth for some of this period.

Saturday outage report

January 27th, 2014 by

Edit: we’ve now received a report from Telecity, so have updated this report to take account of this.

Further edit: explanation for extended outage in one rack added.

Summary

  • A power interruption occurred at around 8:09am on Saturday 25th January, affecting multiple floors in Sovereign House.
  • For the most part, the interruption was momentary (around 500ms), but long enough to cause a reboot of affected equipment.
  • One of our racks was without power until 10:38am, due to a tripped circuit breaker.
  • Our staff were onsite at 11:15am, and then worked to restore services that had not come back up cleanly. One such server was our SOV DHCP server which will have affected any virtual servers configured to boot via DHCP.

Details

The power outage was caused by an interruption to the external mains power supply, followed by a failure of the DRUPS (Diesel Rotary Uninterruptible Power Supply) system that is supposed to ensure that power to the data centre is maintained during such a power cut.

The DRUPS system contains three separate units with sufficient capacity to cope with the failure of any one unit. Unfortunately, in this event, the unit that failed did so in a manner that triggered a shutdown of the other two. From the Telecity report:

… one of the units on DRUPS System 1 experienced a fault on its synchronisation card. This fault caused the unit to go into overload which, in turn, had a direct impact on the remaining two units. During the overload condition, the faulty unit back-fed the other two units which, for protection and per design, automatically shut down.

At this point the system went into raw mains bypass mode (i.e. bypassing the UPS systems, and connecting the data centre load directly to the mains). This occurred around 2 minutes after the original mains supply failure, by which point the mains supply had been restored, but there was a 500ms interruption as the bypass occurred.

This much is consistent with our observations, which is that in all but one rack, the logs on our remote PDUs did not record an outage, but the vast majority of equipment attached to them did: the management interfaces in these PDUs draw very little electricity and are known to be able to survive very short power supply interruptions.

As noted above, one of our racks experienced a more extended outage. This was due to the circuit breaker on the power bar being tripped. This was noticed and rectified by data centre staff inspecting racks following the initial outage.

At this point, the faulty DRUPS unit is out of service, meaning that whilst the power supply is protected, there is no redundancy until the unit is repaired and tested.

Conclusion

Whilst we are certainly unhappy about the outage, at this point we have no cause to question our choice of data centre provider. Sovereign House is a major UK internet hub, and is a purpose-built 6 floor data centre, built to the highest industry standards. With the best will in the world, there will always be faults that can take an entire DC, or significant parts of it, off-line, and for this reason, we would always recommend that mission-critical applications are served from multiple sites. Independent routing ensured that our facilities at other sites were unaffected by the Sovereign House outage.

That said, the aftermath of the outage has revealed some areas in which we can improve. In particular, the extended outage of one rack had a knock on effect to connectivity of others. Following Sunday’s scheduled maintenance work, we’re now in a position to improve our network topology to make it more resilient. We are also planning improvements to our Virtual Server hosts and database servers to ensure that they can recover more quickly following such an outage, and we have already made changes to our support systems to make them more resilient.

Beyond directly fixing the affected units, Telecity are also planning improvements to their communications during such an incident. This will help us direct our efforts more effectively.

Notes

For the avoidance of doubt, this interruption was completely unrelated to the network upgrades scheduled for Sunday evening, which went ahead as planned.

Finally, thank you to all customers who monitored our status page during the outage.

More bits

January 10th, 2014 by

At the end of last year we took the decision to significantly upgrade our two connections to LINX – our busiest connections to the outside world.

This turned out to be a good plan as Mythic Beasts got a Christmas present in the form of a new company bandwidth record, thanks to two customers, Blinkbox Music and Raspberry Pi getting a substantial spike in hits as people unwrapped their Christmas presents.

And it seems that the excitement of all the presents hasn’t worn off, as the Christmas day record has just been toppled by a new all time high yesterday. With the Blinkbox apps very high in the free music app charts, we’re not expecting it to stand for long.

Raspi.tv

January 9th, 2014 by

Here’s an unsolicited customer review of a migration of a dedicated server to one of our managed virtual machines from Alex at raspi.tv who’s building a 9inch HDMI 1080p screen.

New Year, New Server At mythic Beasts

You can find the original twitter conversation at @Mythic_Beasts.

Coping with Christmas

January 7th, 2014 by

Our latest blog post is on the Raspberry Pi website. Coping with Christmas

LINX now running at 2x10Gbps

November 29th, 2013 by

Today we’ve upgraded both of our connections to the London Internet Exchange (LINX) from 1Gbps to 10Gbps.

Over the past few weeks we’ve repeatedly broken the company bandwidth record. And since we’ve recently secured more peering agreements — including every major UK connectivity provider — a greater proportion of our traffic is now going out over LINX. So at peak times our bandwidth usage has been enough that in the unlikely event of a failure of one of the LINX LANs, we would have come close to running out of capacity on our other link. Clearly an upgrade was in order!

Our network engineers performed the upgrade this morning, with no disruption as traffic was automatically and transparently rerouted during the brief down time. After the upgrade, we have 10Gbps from our data centre in Telecity Sovereign House to LINX Juniper; and 10Gbps from our Harbour Exchange data centre to LINX Extreme.

In the event of the failure of either link or router, traffic will automatically reroute around our internal fibre ring to our other site and out to the peering exchange via our other connection. And, for the time being, we have plenty of capacity to spare.

Sender Verify vs Hotmail

November 26th, 2013 by

We aim to give our users the choice of a range of anti-spam measures. One of the options we provider is sender verify, a simple check whereby before you accept a mail, you check that the sender of that email exists, and would accept mail from you. You can argue about how effective this is as an anti-spam measure, but it seems a perfectly reasonable check to want to make, in the same way that many people choose to not answer their phone to those who withhold caller ID.

Unfortunately, some people object to you asking the question.

We recently had some complaints from users who said that they couldn’t receive mail from people with addresses hosted on Microsoft’s Hotmail servers, and sure enough, Hotmail have blacklisted one of our servers’ IPs for daring to enquire about whether particular sender addresses were valid. This affects not just hotmail.com, but various other Microsoft domains.

Sadly, Microsoft aren’t going to change their policy for us, so we needed to whitelist them. This isn’t entirely trivial as what matters is where the sender’s email address is hosted, which means looking up the MX records for that domain. Fortunately, Exim makes this easy enough, provided that you’re not offended by curly brackets. Adding the following condition to a sender verify ACL will disable the check for Hotmail hosted domains:

!condition = ${if forany{${lookup dnsdb{>: mxh=$sender_address_domain}{$value}fail}}{match {$item}{\Nmx.\.hotmail\.com\N}}}

I should note that for quite some time, we’ve used a dedicated IP address for performing our sender verify checks in order to minimise the impact of exactly this type of blacklisting. If we hadn’t done this, the blacklist would have made it impossible for any users to send mail to Hotmail-hosted addresses too. As it was, the problem only affected users who had elected to use sender verify on their domains.

IPv6 Reverse DNS

November 20th, 2013 by

You can now configure reverse DNS for IPv6 through our customer control panel. If you’ve previously been handling reverse DNS for your allocation through delegation and would prefer to use the control panel, then please get in touch.

If you’ve got a server with us and are interested in trying IPv6 and don’t already have an allocation then please email support and we’ll be happy to provide you with a block of addresses.

Tricky debugging

November 12th, 2013 by

After cloning a server for a customer we noticed that something was a little bit odd:

# md5sum /etc/sudoers

worked fine but:

# sudo -l

responded with:

sudo: unable to stat /etc/sudoers: Permission denied

How odd we thought. More odd was:

# su - username
Cannot execute /bin/bash: Permission denied

A bit of time with Google and strace revealed that we’d managed to set the permissions on / wrongly:

drwx------  27 root root  4096 Jun  4 11:48 ..

rather than:

drwxr-xr-x  27 root root  4096 Jun  4 11:48 ..

What amazed us was not that the machine didn’t work properly but that we could log in at all.

If this is the sort of problem you’d be able to fix, you should look at our jobs page. If you’d like someone else to fix it for you then our Managed hosting is probably of a lot more interest.

Migrating the Science Media Centre

November 12th, 2013 by

Over the past week or so we’ve given the Science Media Centre a hand in moving their WordPress site into a virtual machine hosted by Mythic Beasts. They’re a charity who work with journalists, scientists and engineers to try and improve the quality of science reporting and removing the misleading rubbish that otherwise gets written. Mythic Beasts is a company founded by science graduates who are very easily angered by terrible science articles in the papers. We’re hoping the saving on destroyed laptops and monitors will easily cover all the management and consultancy services we’ve donated.

If we have fewer idiotic articles proving that Coffee cures cancer* and Coffee causes cancer* and rather more articles that our talented university friends pioneer new cancer treatments we’ll consider the time and effort we’ve put in to helping them well spent.


* Actual links removed in the name of good taste. Here’s something more interesting to read, and if you’re still curious, you can look up coffee in the index.