So much for my five-nine's of availability1 in 2019. Today I had a couple of minutes between meetings at the day job, so decided to connect to the web server hosting 10Cv4 and install some operating system updates. This is something that I've done hundreds if not thousands of times with various servers over the years. After the installation scripts completed I saw that I was within the 38-minute "lull period" where traffic to the service is generally at its lowest for a Wednesday and issued a
sudo shutdown -r now command, telling the server to reboot.
Less than 30 seconds later I was reconnected and checking available storage space when my phone notified me of an issue with 10C. The site was offline. I checked with the notebook and found that the service was indeed unresponsive. The server was running, as I was connected via SSH. Apache was running on the server. The database was also operating well. But no traffic was being received. I checked to ensure that the firewalls were configured correctly, and that the IP address of the server handn't changed2. I cycled the software. I rebooted the machine. I checked error logs, installation logs, and configuration files. Everywhere I looked, the server appeared to be fine.
By this time the service had been down for five minutes and a recovery plan needed to be enacted pronto. There were three viable options:
- Restore the VPS: This would essentially see me wipe the server clean and start with a fresh installation of 10Centuries. A backup would be pulled down and restored, returning the system to its previous state seconds before the reboot that brought the service down. Total recovery time: 90 minutes.
- Transfer 10Cv4 to the backup VM: As one would expect, I have a virtual machine image set up on the same server that is running 10Cv5. The machine could be brought online in less than 30 seconds with the most recent database restored and ready less than 45 seconds after that. I test this process every morning and it consistently takes between 73 and 75 seconds to complete. Once done, I would need to ensure the routing and forwarding was properly configured on the v5 server, which could interfere with some of the Apache settings that allows v5 to do what it does. Total recovery time: 15 minutes.
- Migrate v4 to v5: With the virtual server in Osaka slated to be decommissioned in two weeks when the annual service package expires, the v4 service would have to be migrated to v5 in the very near future anyway. One could argue that it's better to rip off the band-aid now rather than buy time and delay the process any further. Total recovery time: the rest of the day.
Yes, I went with the third option.
While it may not seem like the wisest decision given the lack of complete documentation, the lack of notice, and the stunning lack of functional code in various parts of the system, forcing the migration to v5 should work out to be a net positive. There will be more incentive to complete the outstanding items, as if there wasn't enough already, and it will be possible to see how well the home network can handle the traffic. If problems crop up right away, then it will still be possible to renew the VPS service with the Osaka data centre3, set up a newer infrastructure, and move everything over as a single package.
This is the plan, anyways. And with everyone on the same version on the same server, there will be a singular place to read updates rather than the plurality of timelines that has existed for the last eight months.
To the people who use 10Centuries on a semi-regular basis, I am very sorry for the downtime and hassle that will come from changing DNS records, workflow processes, source code, and preferences. One thing is for certain, though: once the migration is complete (along with a little more documentation and coding), people will prefer what v5 has to offer.
Five nines generally means a service is accessible and usable 99.999% of the year, which means the system must be down for less than 315.6 seconds per year. My servers can generally shutdown and reboot in 23 seconds when everything is running properly, allowing for regular maintenance windows for security patches and other items to be installed.
This would be weird, given that the 10Cv4 server is running in a data centre in Osaka with an IP that hasn't changed in years.
10Cv4 used a 2G VPS with 50GB of SSD for the web server and a 4G VPS with 100GB SSD for the database server.