Our backup report caught a warning from the backup on our monitoring server:
WARN - [child] mysqldump: Error 2013: Lost connection to MySQL server during query when dumping table `log` at row: 6259042 .... ERROR - mysqldump --all-databases .... exited with 3
We investigated, indeed this is an error and we’ve created a truncated backup. As we think backups are very important we investigated immediately rather than adding it to the end of a very long task list that would be ignored in favour of more user visible changes.
An initial guess was that it might be a mismatch in
max_allowed_packet between the server and the dump process, a problem that we’ve seen before. We set
mysqldump to the maximum allowed value, reran the backup manually and watched it fail again. Hypothesis disproven and still no consistent backup.
Checking the system log, it quickly became apparent that we were running out of memory. The out of memory killer had kicked in and decided to kill
mysqld (an unfortunate choice, really). This was what had caused the dump to terminate early.
Now we understand our problem, one solution is to configure a MySQL slave and back up from the slave, another is to move to a bigger MySQL server, another is to exclude the ephemeral data from the backup. We chose to exclude the ephmeral data and now our backup is complete and we’ve tested the restore.
While working on this, our engineer noticed that there was an easy extra check we could make to ensure the integrity of a MySQL dump. When the dump is complete we run the moral equivalent of:
zcat $dump | tail -1 | grep -q '^-- Dump completed'
to check that we have a success message at the end of the dumped file. This is an additional safety check. Previously we were relying on mysqldump to tell us if it found an error, now we require mysqldump to report success and the written file to pass automated tests for completeness.
We pushed out our updated backup package with the additional check to all managed customers yesterday. On World Backup Day, we’d like to remind the entire Internet to check that your backups work. If that sounds boring, we’ll check your backups for you.