Last week 10Centuries suffered a total of 6h23m of downtime after the database server found itself filled to capacity with error logs. To make matters worse, the vast majority of this downtime was the result of me being asleep at the time of failure, and not having my phone set to ignore the Do Not Disturb settings when messages come from a specific recipient1. Without the incessant buzzing, there was no way for me to know of the problem that shook a number of people's confidence in the 10Centuries project. It's this loss of face that bugs me more than the server downtime. Systems fail, but can be brought back up to a previous state incredibly easy. Confidence, like trust, takes a great deal of time to earn and only a split second to lose.
Ultimately the problem was completely my fault. The servers were not being hacked. There was no DDoS or other malicious activity going on. Nothing from the 50-or so people who were using the service that day did anything to contribute to the problem, either. It was instead created by a confluence of poor decisions:
- I installed server updates without first testing them in the development environment against heavier than average loads
- I failed to configure
systemdproperly on the database server last year when it was set up
- I failed to proactively respond to server error messages that had been popping up three or four times a day for two weeks, warning me of a configuration problem that affected MySQL
Had I properly managed just one of those three issues, then last week's downtime likely wouldn't have occurred at all.
This episode of Doubtfully Daily Matigo goes into slightly deeper detail explaining how the problem snowballed after just a tiny rush of traffic2 hit the service, and what I've done to ensure the same problem does not happen again. Hopefully there will not be a repeat of this problem anytime soon.