Why That Data Sucks

This past week I've invested far more hours than I should have needed to clean up a database in preparation for an upcoming migration. When getting data ready to move from one system to another, there is often a little bit of work that's required to ensure information is not lost and that the most important details are as complete and correct as possible. What I've been doing over the last few days, however, is on a completely different scale. The question that keeps repeating in my head is both simple and absurd: How can a company that deals with long-term, face-to-face interactions operate without ever knowing the names of their paying customers?

A Little Background

At the day job we are migrating our systems from internally-developed solutions to a rather large cloud vendor. This is a bit more complex than spinning up virtual machines and migrating our databases to off-site servers, though, as we're taking our SQL Server, MySQL, PostgreSQL, and ancient FileMaker-based systems and putting them into something that — I think — operates on top of an Oracle database. The work is generally pretty straightforward, though there is a great deal of verification and validation that is necessary to ensure the information we upload is correct.

It's the penultimate step, the verification and validation bit, where I seem to invest the bulk of my time; particularly when it comes to names and contact details.

Every company has its own little quirks about how it uses its databases. Here in Japan, one of the things that has long bugged me has been the way staff at the schools will change a person's name in the database to help with quick identification or search. So, if we have two people named "John Smith", one might be changed to "John Smith (Old)" or "John Smith (Student)". This would be shown in the search results when someone looks for names matching or resembling "John Smith". While this seems like a logical solution to a problem, what this means is that in the database we'll have a last name of "Smith (Old)" and a first name of "John". If there are any reports to print out, the comment in the parentheses is included.

The schools have come up with a whole lexicon of short codes, symbols, and words to help quickly identify customers of all kinds. The first time I ran into this on a large scale was when I started importing customer names from the big CMS into the LMS I developed a few years back. These comments would appear on a teacher's schedule, on attendance lists, and in printed reports that went to the student. This was something I adamantly refused to let happen, so wrote a little function that would strip the codes out of a name and present just the proper name. It's worked well for several years and the state of the data in the Japanese database, while not perfect, is consistent and reliable. There will not be any problem whatsoever ensuring that the names and other details that we upload to the new system will be devoid of these "meta notes".

Knock It Up a Notch

The database I've been working with this week, however, is not from any Japanese system. This means people from a whole different culture and background who have used the same software have created their own form of meta notes over the years … and it's terrifying.

One of the first things that I noticed when working with the database was that a person's entire name was written into the "First Name" column along with some extra details, such as the type of contract they have and maybe even the name of a colleague for when two people are taking Lessons on the same contract1. In the "Last Name" column there will be other details, such as a person's family name … or their full name … or their name in the native language … or the name of the employer plus their name … or the name of the employer, the type of contract, and the full name of the student. And I need to parse this out to have given names in their own column, family names in their own column, and names written in the native language written in a third.

But wait! There's more!

Some of the more interesting uses of the "Last Name" column is a school's habit to write the relationship of a student. These are some of the values that I have found in the database2:

  • Tom's sister's friend
  • The mother is taking the class
  • Afternoons at the cafe
  • John's new wife
  • The president of ABC Company

These are notes, but they're written in the "Last Name" column. In the first name will be the whole name, sometimes in the standard alphabet, sometimes in the native language, sometimes with both, and sometimes with the contract type thrown in as well. When a school is feeling particularly frugal, there might not be a name at all and instead something like "xxx" or just an empty string. As I've already said, I need to provide a proper list with family names, given names, and native language names — when they exist — in separate columns.

Most database people I've met over the years would take one look at this and send it back to the schools, telling them to "fix their crap data", otherwise nothing will be migrated. While I would love nothing more, this is not really an option. Instead I went and created a series of SQL queries that would clean the data as much as possible. After a few days of work, I'm generally confident in 95% of the data. I could go through the last 5% line by line and fix issues by hand when I find them, and I have done this with some of the most egregious issues, but it's not really the best use of my time. There are other countries and other databases that need attention as well.

Why in the … ?

Blaming the schools for their "crap data management policy" would be easy, but I really don't think the fault lies with the schools. People were compelled to do this in order to get around very real problems with the home-grown corporate software. Problems that could have been avoided had people paid attention, asked questions, and sought solutions.

The key problem is one that I've already mentioned: it's too hard to find the right person when searching by name. This is completely true, and every country has their equivalent of James Smith or Maria Garcia3. The solution is not to mess up a person's name field, though.

What are the options, though?

Having solved this problem a couple of different ways in the past, including in the soon-to-be-retired LMS, I see two relatively quick changes that could resolve the issue.

The first is to make it possible to assign tags to a person's record. The tags could be completely free-form and allow a good amount of text so that contract types, descriptions, and relationships could be easily recorded, searched, and displayed in a results list.

The second is to add a short comment field — distinct from the main comment fields — that would also be part of the search and returned for display in the results list. This option would generally require more processing power, but may be easier to implement within a database.

Either one of these options would ensure that printed reports, attendance lists, and other items showing a person's name are free of superfluous information. The company would win because printed materials would look more professional and teachers would win because they wouldn't have to try and parse the meaning behind the meta. Of course, I would win, too. With less "gunk" to filter out and process, I could more easily prepare data for migration from one system to another.

Over the years I've seen a lot of very strange things put into a database. By looking at the reasoning behind why, it becomes easier to think about how data can remain complete and valid while also solving genuine business problems. The hardest part is being vigilant and proactive when oddities in the data are discovered and reported.

  1. This is not at all required. The system can have a million people on the same contract. I have no idea why the people who used the database in question have such a fear of making new records.

  2. These are not the actual names, as that would be a giant breach of trust and would justify an immediate firing. The names have been changed, but the gist is completely accurate.

  3. These two names are among the most common in the world according to this blog post.