Why Developers Are Often Frustrated

Today was supposed to be an "easy" day at work. On the agenda were four items:

  • respond to {person}'s email regarding progress report scoring matrix
  • export system usage data from LMS for {person}
  • load CSV1 data from vendor into migration SQL Server instance
  • transform data from CSV into rational information

The first two items would require less than half an hour, the third was expected to take 15 minutes, and the last one would consume the remaining time in the day, as the work that needs to be done can only be semi-automated. A person still needs to sanity check the output and make revisions where necessary.

Unfortunately, that third item refused to cooperate. Regardless of what tool I tried to import the data with, the contents simply could not be consistently read without illogical errors being thrown. I could open the CSVs in Notepad on the server just fine, but the data could not be read by anything else, including SQL Server Management Studio!

This was going to require some pre-processing.

Pulling the data onto my development machine, I took a look at the file and discovered, much to my horror, that the problem boiled down to the encoding of the file. The vast majority of data I work with is encoded as UTF-8 as a great deal of my day involves using non-Latin character sets. The files I needed to inject into a database was encoded as EUCJP-WIN, which is pretty much the penultimate file format for storing anything of value. Before I could continue, all of the CSV files would need to be re-encoded, sent back to the server, and then imported.

Annoying, but doable. I had my machine convert the 9 files to UTF-8 and sent the data2 back to the SQL Server for insertion. The first file went in just fine. The remaining eight, however, refused. I looked at the files in Notepad on the server and saw that some of the column names had upper-ascii characters. Annoying, but workable. I changed the column names, saved the file, and tried the import again.

No dice.

This time there were some rows that had more columns than should have existed, meaning that a value somewhere contained commas and the field was not properly escaped. Given that these files contained hundreds of thousands of rows each, it did not make sense to use Notepad to try and deal with the problem. Instead, I would attempt to solve this programmatically. The sun was going to set soon and Nozomi needed her afternoon walk. What should have been a 15-minute task was now into it's fourth hour.

I hammered out some quick code to read through the file line by line, comparing the number of columns to the definition on Row 0, and found that all 8 files had several hundred records with unescaped commas. While I could resolve this by hand, it would take the rest of the day and would be more an exercise in patience training than anything else. The solution I opted for was "more code".

Visually examining some of the bad data, I noticed that the problem only happened with data that belonged to a column with one of three different names. Fortunately, the 8 files did not have more than one of these columns defined, meaning a programmatic solution could very quickly be worked out.

  1. Split the values into an array by using the commas as separators
  2. Moving from left to right, copy the good values into a new row until reaching the offending column
  3. Moving from right to left, copy the good values into the same row until reaching the offending column
  4. Concatenate the values that were not copied, separating them with a comma
  5. Write the new string into the row, properly escaped so that it could be imported into SQL Server

Annoying, but doable. The computer is doing all the work with the data validation and correction. I'm simply providing the instructions on how to do it.

Once done, I copied the data back to the server and tried to import. Errors again. This time there was a problem with data types. A table was auto-created for the CSV table, and the system chose the wrong data type for a column that appeared to contain only integers at first but started looking like AB48910 from row 48,913.

Lovely.

Going back to the code, I wrote some additional logic that would read the file into memory, construct a list of column names, then run through all of the rows in the file and examine the data type and its length or maximum value depending on the type the machine thought it was. If an Integer column suddenly became a string, then items would be updated accordingly. The output of this code was an auto-generated SQL Table creation script, minus the Indexes3.

Looking at the output, I saw something strange. One of the columns that should have only been numbers between the range of 58,000 and 114,000 was given a datatype of NVARCHAR(25). Why?

Writing yet more code, I listed the non-numeric values in that column and found this: 112066

Full-width ASCII characters where half-width, proper ASCII characters should have been for the numbers. FUN! So the next round of updates to the code would identify values like this and convert them to their proper lower-ASCII equivalents everywhere. Maybe now it would be accepted and imported into the database?

No, of course not. There was more dirty data in the CSVs. This time in the form of URL-encoded values in a dozen or so rows where plain Kanji should have been ... like in the other 100,000+ rows from the same file.

Six hours later a solution was completed. I had written a small application for the sake of 8 files that would:

  • generate SQL scripts to drop and create tables complete with accurate data definitions and sensible-length strings when working with NVARCHAR values
  • ensure that 100% of the data going into the database was properly encoded as UTF-8
  • ensure that numbers and the letters A through Z were all half-width, lower-ASCII values
  • ensure that malformed rows were corrected in the event of having too many — or too few — columns of information
  • generate insert statements for the data, grouped by 50 rows, and save the statements to .sql files
  • write a report after each process outlining what was wrong with the CSV so that a bill could be sent back to the vendor for wasting my time

With all of this done, I was finally ready to have the data loaded into the SQL Server database. I copied the files over, loaded them into SSMS4, pressed [F5] to run the statements, and held my breath.

Success.

So what should have been a 15-minute job wound up taking just over 7.5 hours in total because the vendor who supplied the data5 seems to have a serious case of GIGO when it comes to customer data. Thankfully, their services are being terminated later this month. May I never have to turn their infuriating garbage into my consistent data ever again.


  1. Comma-Separated Value
  2. All 1.87GB of it
  3. Indexes would be added later manually, because like heck I'm farming that out for 8 tables when a person would be so much better at the job than a couple lines of code
  4. SQL Server Management Studio — a tool for working with SQL Server databases
  5. after we had to wait 12 business days for a data dump