rm -rf /var

August 10th, 2017 by

Within Mythic Beasts we have an internal chat room that uses IRC (this is like Slack but free and securely stores all the history on our servers). Our monitoring system is called Ankou, named after Death’s henchman that watches over the dead, and has an IRC bot that alerts through our chat room.

This story starts with Ankoubot, who was the first to notice something was wrong with the world.

15:25:31

15:25:31 ankoubot managed vds:abcdefg-ssh [NNNN-ssh]: 46.235.N.N => bad banner from `46.235.N.N’: [46.235.N.N – VDShost:vds-hex-f]
15:25:31 ankoubot managed vds:abcdefg-web [NNNN-web]: http://www.abcdefg.co.uk/ => Status 404 (<html> <head><title>404 Not Found</title></head> <body bgcolor=”white”> <center><h1>404 Not Found</h1></center> <hr><center>nginx/1.10.3</center> </body> </html…) [46.235.N.N www.abcdefg.co.uk VDShost:vds-hex-f]

15:31:42

15:31:42pete I can’t get ssh in, I’m on the console.

15:38:16

15:38:16pete This is an extremely broken install. ssh is blocked, none of the bind mounts work

Debugging is difficult because /var/log is missing. systemd appears completely unable to function and we have no functioning logging. Unable to get ssh to start and fighting multiple broken tools due to missing mounts, we restart the server and mail the customer explaining what we’ve discovered so far. This doesn’t help and it hangs attempting to configure NFS mounts.

15:53:53

15:56:35

Boot to recovery media completes ready for restore from backup.

15:58:34

16:05:07

16:08:36

16:08:36 ankoubot managed vds:abcdefg-ssh [NNNN-ssh]: back to normal
16:08:36 ankoubot managed vds:abcdefg-web [NNNN-web]: back to normal

16:14:22

16:31:42

Customer confirms everything is restored and functional and gives permission to anonymously write up the incident for our blog including the following quote.

Mythic Beasts had come highly recommended to me for the level of support provided, and when it came to crunch time they were reacting to the problem before I’d even raised a support ticket.
This is exactly what we were looking for in a managed hosting provider, and I’m really glad we made the choice. Hopefully however, I won’t be causing quiet the same sort of problem for a looooong while.

In total the customer was offline for slightly over 30 minutes, after what can best be described as a catastrophic administrator error.