Server Down

So much for my five-nine's of availability1 in 2019. Today I had a couple of minutes between meetings at the day job, so decided to connect to the web server hosting 10Cv4 and install some operating system updates. This is something that I've done hundreds if not thousands of times with various servers over the years. After the installation scripts completed I saw that I was within the 38-minute "lull period" where traffic to the service is generally at its lowest for a Wednesday and issued a sudo shutdown -r now command, telling the server to reboot.

Less than 30 seconds later I was reconnected and checking available storage space when my phone notified me of an issue with 10C. The site was offline. I checked with the notebook and found that the service was indeed unresponsive. The server was running, as I was connected via SSH. Apache was running on the server. The database was also operating well. But no traffic was being received. I checked to ensure that the firewalls were configured correctly, and that the IP address of the server handn't changed2. I cycled the software. I rebooted the machine. I checked error logs, installation logs, and configuration files. Everywhere I looked, the server appeared to be fine.

Cloudflare's Dreaded Error 523

By this time the service had been down for five minutes and a recovery plan needed to be enacted pronto. There were three viable options:

  1. Restore the VPS: This would essentially see me wipe the server clean and start with a fresh installation of 10Centuries. A backup would be pulled down and restored, returning the system to its previous state seconds before the reboot that brought the service down. Total recovery time: 90 minutes.
  2. Transfer 10Cv4 to the backup VM: As one would expect, I have a virtual machine image set up on the same server that is running 10Cv5. The machine could be brought online in less than 30 seconds with the most recent database restored and ready less than 45 seconds after that. I test this process every morning and it consistently takes between 73 and 75 seconds to complete. Once done, I would need to ensure the routing and forwarding was properly configured on the v5 server, which could interfere with some of the Apache settings that allows v5 to do what it does. Total recovery time: 15 minutes.
  3. Migrate v4 to v5: With the virtual server in Osaka slated to be decommissioned in two weeks when the annual service package expires, the v4 service would have to be migrated to v5 in the very near future anyway. One could argue that it's better to rip off the band-aid now rather than buy time and delay the process any further. Total recovery time: the rest of the day.

Yes, I went with the third option.

While it may not seem like the wisest decision given the lack of complete documentation, the lack of notice, and the stunning lack of functional code in various parts of the system, forcing the migration to v5 should work out to be a net positive. There will be more incentive to complete the outstanding items, as if there wasn't enough already, and it will be possible to see how well the home network can handle the traffic. If problems crop up right away, then it will still be possible to renew the VPS service with the Osaka data centre3, set up a newer infrastructure, and move everything over as a single package.

This is the plan, anyways. And with everyone on the same version on the same server, there will be a singular place to read updates rather than the plurality of timelines that has existed for the last eight months.

To the people who use 10Centuries on a semi-regular basis, I am very sorry for the downtime and hassle that will come from changing DNS records, workflow processes, source code, and preferences. One thing is for certain, though: once the migration is complete (along with a little more documentation and coding), people will prefer what v5 has to offer.

  1. Five nines generally means a service is accessible and usable 99.999% of the year, which means the system must be down for less than 315.6 seconds per year. My servers can generally shutdown and reboot in 23 seconds when everything is running properly, allowing for regular maintenance windows for security patches and other items to be installed.

  2. This would be weird, given that the 10Cv4 server is running in a data centre in Osaka with an IP that hasn't changed in years.

  3. 10Cv4 used a 2G VPS with 50GB of SSD for the web server and a 4G VPS with 100GB SSD for the database server.

Sakura Earns a Customer for Life

I've been a happy customer of Sakura for the better part of two years now and it seems that my loyalty has actually paid off in a really big way … although not directly due to how long I've had an account with them. Earlier this year Sakura started phasing out their lower tiered VPS services in order to make room for another higher tier in their four-choice brackets. When they did this, they also bumped up everybody's virtual servers to the next level up, as the prices for each their remained the same. That means that across my 6 virtual servers, I managed to score an extra 4GB RAM and over 700 GB of extra disk space. But that's not the best part.

Zero Work On My Part

The most impressive aspect of this entire upgrade process has been the fact that I haven't had to lift a finger. The company sent out email notifications in March, April, and two weeks before the update to let everyone know exactly what was going to happen and when. They asked that everyone make a backup of their instances in the event something goes terribly wrong and let us know that we shouldn't have to do anything to receive the updates. They would be rolled out automatically at 7:00 AM Japan time on May 31st. All of our virtual machines would be shut down and then brought up again with the new resource limits.

And this is exactly what happened.

My sites were down for a grand total of 4 minutes while everything rebooted, resynchronized, and reconnected. Once done, the servers started communicating with each other like nothing had ever happened, and I was gifted with double the RAM in each and every instance. Beautiful.

The disk space is still the same, though, and I'll need to destroy the instance and build a new one to get the full amount of storage, but this really shouldn't be necessary as I only store logs and the websites themselves on the servers. Images and other media are stored elsewhere. All in all, this was a completely painless upgrade and it couldn't have gone any smoother … which is why I've recommended them to everyone I know who's looking for a reliable web server in Japan. Unfortunately, they don't have any referral promotions1, but they do have some excellent prices for virtual servers with no extra charge for bandwidth.

Here's a quick breakdown of their virtual server offers:

So, even though I don't really have any incentive to say so, if you're looking for some incredibly reliable VPS services in Japan, you can't go wrong with Sakura. They don't have service in English, but even this can be worked around with a little bit of machine translation in the purchase process if you get lost.

Feel free to ask me any questions about the service, too, as I can probably give you the answer you're looking for. I haven't worked with anyone at Sakura, but I've pushed their servers hard enough over the last few months to know what can and should not be done to them.

Note: This is not a sponsored post. I'm just really, really happy with these guys right now.