Worse Than Failure

A "faulty server migration" is being blamed for the extinction-level removal of songs, photos and video from myspace. Tragic as it may seem, I very much doubt that the company has invested the necessary resources required to recover the data as it clearly does not "add value" for the current management team. What is disappointing about this affair isn't so much that it's the second time that myspace has suffered some catastrophic data loss1, but instead that the company has given up on trying to recover the four years of data for people who still actively use the service. An argument could certainly be made to skip or delay data restoration for accounts that have been idle for over a year and most people would likely agree that this makes sense. But never give up on your die-hard fans.

Scratched Platters

You Can Call It "Protection"

At the day job, management is making some similar noises about data. We're 16 months into a 30-month project that will see every location around the world move onto a single, cohesive set of cloud-based solutions. Students and instructors will finally have a consistent set of resources to use, regardless of where they happen to be on the globe. The company tried to do this once before with an in-house CMS and failed. This second attempt is being built on a business-oriented cloud platform with a larger group of people and expensive vendors. Failure is not going to be an option. That said, one of the decisions that was made early on is that we're really only going to import the last two years of data from the myriad of databases around the world. Some of our digital systems have student and lesson information going back to the late 90s. Does the company really want to toss all this away?

Some people are terrified of what might happen if two decades of information is intentionally left out of the new system and rightly so. Businesses cannot always make the best decisions about the future without understanding the past. Importing everything into the new system would be cost prohibitive2, which means there needs to be another system set up somewhere that can contain all of the data that was not converted for the new software. But how does one go about putting several dozen databases from different platforms with different schemas into a unified system that can be effectively indexed, searched, and reported from?

This is where Microsoft's Azure Data Lake may make sense, and I've been pushing hard to make it happen.

The day job currently has systems that use SQL Server, MySQL, Oracle, FileMaker, Access, and Excel as a back end3. A couple of schools in Europe even managed to build some tools with NoSQL databases. There's no way for all of this to be logically put into a single, unified system. Instead a data lake could be used to store unstructured and semi-structured data from all of these systems. This would make it possible for reports to be generated against the larger data set, pulling from all regions or just the specific locations a person wants to query. Data going back to the 90s would reside in this data lake as well as data that was recorded yesterday. More than this, data that will be created in the future could also be put into this data lake through regular synchronization processes, making it possible to have a comprehensive source of reporting data.

But there's more to the data lake idea than just reporting. My ultimate goal for this massively complex collection of data is not just to help the business answer questions, but to ensure the people who rely on our systems don't have to live through a myspace moment. Systems fail. Data gets lost accidentally. Vendors become undesirable. At no time should my employer have a single point of risk when it comes to our student (and instructor) data. Not having a backup strategy in place would be worse than failure.

One of the many services that I offer a lot of my freelance clients is the peace of mind of being their off-site backup keeper. Fortunately this is something I can manage pretty decently as few archives are over 50GB in size4. By using a data lake or something similar, I can ensure the day job has viable options should the unthinkable happen.

  1. The first time (that I can remember) myspace lost a bunch of data was in 2013 after a redesign that required everyone to rebuild their communities from scratch. This was one of the many problems that pushed the less-dedicated into Facebook's waiting arms.

  2. This is what I'm told, anyways. If it's true, then I'm quite upset with the senior executives who signed off on the vendor contracts, as they would have known full well what sort of lock-in we were getting into.

  3. Yes, I know that Excel is not a database. I know it should not be used as a "back end" for anything. Yet here we are …

  4. I generally burn backups to a DVD or BluRay disc, and my BluRay recorder only supports up to 50GB discs. I'd love to get a BD-XR burner at some point, though, as fitting 100GB on a disc would free up a lot of media binders and reduce the amount of data I keep on the NAS at any point in time.