From Data to Information

For a lot of people it seems there is no difference between data and information. In reality, the two could not be more different. Data is primarily unstructured or semi-structured elements that exist without meaning or context. Information is structured data that has context and can be used to make decisions. One of the first questions I ask when I begin to design systems that will store data is what sort of information a person wants to get out of the tool later. The answer will fundamentally guide the sorts of data that is collected to ensure that the desired information can be returned and, more importantly, be expanded upon as time goes on. This second part is just as crucial as the first.

Earlier today I found myself reverse-engineering a system designed by a vendor in order to solve a rather serious business problem created by the very same system. The problem amounted to an API that was designed to present a great deal of data rather than actionable information. As a result, teachers in the classroom have found it incredibly difficult to deliver their lessons. What struck me as odd about the software is that an API is generally used to present data in a structure, thereby ensuring it is parsed as information. This particular tool, however, appeared to have a consistent structure but wound up being little more than a data dump of keys and values. It was up to the JavaScript that read the data to determine the context and convert the data into information. Unfortunately, the implementation resulted in a website that would crash older tablets or present just a partial subset of information, which put the onus of "filling in the gaps" onto the teacher who had neither the time nor the resources to do anything of the sort.

So, being the corporate fool, I quietly waded into the mess and started reverse-engineering this system in order to extract the data to populate a database of my own design, then structure the data in a logical manner for the business need, then present it to the teacher in a format they can use. Given that this is a system designed to show textbooks, the lack of structure and clarity in the vendor's system has me questioning whether they understood the actual problem the business needed to solve in the first place.

In the space of six hours, I managed to reverse out the entire system and copy the bulk of the data from the vendor's system into my own, then build a preliminary API structure to return to a browser. Tomorrow's task will be to take the information and turn it into a textbook with the same formatting and features, plus a bunch of other details that should have existed from the first day the system went live. Two days of work on my part and a brand new system can replace nearly two years of development from a high-priced vendor. This sort of turn-around for problem-solving solutions is probably why a lot of senior managers at the day job allow me to break rules from time to time.

While I can generally turn around and solve problems like this through sheer force of will1, how can others avoid making the mistake of leaving data bloated and without form?

It comes down to understanding what a person wants out of the system, that early question I ask before writing the first line of code.

For this example, the goal of the project was to have an API return enough information to dynamically construct a textbook. Leaving the front-end code out of this, what would an effective structure be for a textbook, or a group of textbooks? Let's break down what sort of data makes up the information that is a textbook.

At a minimum we would need:

  • title ⇢ the title of the book
  • chapters ⇢ the sections of the book, allowing for a table of contents to be built
  • pages ⇢ the pages associated with the book, and possibly a chapter object

There is a whole host of meta data that could be included, such as a cover image, authors, publisher, ISBN numbers, MSRP, inventory on hand, search keywords, access permissions, and the like. The sky is really the limit when it comes to metadata, but the receiving software needn't be overloaded with data it never reads. If an API is going to return structured data, most of it should be used. If a complete dataset is only sometimes required, then an API filter show allow an application to request a limited amount of data or the whole shebang. What's nice about going this route is that websites that call the API will not be returning large amounts of data to discard or uselessly store. The less data there is to transfer, the faster everything can operate.

The original API decided to include everything about a digital textbook, including elements that would never be read by the front-end code. Details relating to the source system with index keys and when the chapter or page was last edited in that tertiary system. Details outlining the amount of storage space remaining on the API server, which is of no value unless regularly uploading. Details that appeared to be just random numbers thrown into an array. Details that included the address and contact information for the publisher of the book … which was attached to every page object, resulting in 477 sets of duplicated publisher information for one common textbook. The entire package was 6.68 MB to download, which took an average of 4.1 seconds.

Not cool.

My solution, which is probably not the best solution, stripped a lot of this information out. I put the title, chapters, and pages into their own objects and ensured the basic metadata was in place to show ISBN numbers, and similar details. The entire package now weighs in at 682 KB and can be downloaded in under a quarter second. With some compression on the server, the JSON object can be reduced even further and expanded at the browser. The next step is to replicate the front end with less code and more functionality to aid the teacher in the classroom.

How did this happen?

The people who made the current system are not stupid. I've worked with them on a number of occasions in the past and know the main developers are doing the best they can within the bounds of the client-vendor relationship. One of the problems that I've seen time and again, though, is that people often fail to ask about the ultimate goal of any system. This one started out with a colleague saying "We need a digital textbook system" and then answering a hundred questions around the idea. Looking at the early notes from the project2. Not once did the question of "What does the teacher see?" get asked. Heck, from the meeting notes, that question wasn't asked until 7 months into the project! Well after the database and API were designed.

I'll admit that I tend to look at business problems from the point of view of the person who'll be stuck using the things I create rather than the managers authorizing my wage. This often means that I may not create something that leaders ask for and instead provide the solution their people want, which involves quickly turning data into information and getting the heck out of the way. Being an internal resource means I have a lot more flexibility and access to these people than a vendor might, which gives me an unfair advantage. Fortunately it's one that the right people have appreciated a few times in the past.

When it comes time to solve a business problem, one of the very first questions needs to be "what do you want out of the solution?" Everything else is just window dressing.

  1. Sheer force of will … and a quarter-century of experience writing software. I've made every mistake in the book, plus a bunch that have never been documented. It's important to remember past mistakes and their solutions so that future endeavours can be more successful from the start.

  2. Everything is recorded in JIRA … which is both a good and sad thing. Good because documentation is key. Sad because someone had to put all of this stuff into JIRA.