An Accidental Search Engine

How is it that some of the most well-received work I've produced at the day job this past year has been the result of what can only be described as an accident? This is a question I ponder from time to time when listening to feedback from schools, reading emails, or skimming through the public chat places to see what sort of tools people struggle with or would like to see improved. As with everything software related, there's always plenty of room for improvement and being receptive to people's concerns is important. However, this year a number of items that I've created to quickly solve one problem or another have generated a disproportionate amount of positive feedback. Much of this is within a digital textbook platform named Mimosa, and the latest popular feature pulls from work I had put into the 10Cv5 platform to solve a similar problem: building an effective search algorithm.

This likely goes without saying, but textbooks contain a lot of text. A single book can consist of student pages, teacher's pages, audio and video scripts, workbooks, handouts, and supplementary resources, not to mention the metadata about the book itself, such as the names of the publisher, authors, contributors, rights holders, and the like. There are also ISBNs, references to other materials, online activities, audio files and videos, and images to think about. This year, while converting many of the resources used in class from the source PDF to something a little more flexible, I discovered that the typical, recently-published textbooks used in the classrooms each use about 550MB of storage space. Almost all of this is tied up in audio files and high-resolution scans of the student book, but there is also a large amount of text-based data that gets read into the system and organised for rapid collation and dissemination. This is not a large amount of data by any stretch of the imagination1, though does introduce some challenges when trying to quickly search through the data for a spontaneous in-classroom activity.

At the moment there are about 107 complete textbooks in the Mimosa delivery system plus a bunch that have not yet been fully converted. The goal is to have 180 books in the system by the summer of next year so that teachers around the world2, and more as we continue to adapt the older, legacy materials as well as bring in new items. These are not crazy numbers by any stretch of the imagination, yet with 180 textbooks we can expect there to be several million words across 20+ languages stored in the database. Generally when databases start to fill up with this quantity of data, search results begin to take much, much longer. Mimosa, however, does not appear to have an appreciable performance difference whether there are 180 books in the system or 1500+. This is because the search mechanism is an almost carbon-copy of the one used in 10Cv5.

When I started designing search tools at work the approach was generally the same. Tables would be flat, and queries would make liberal use of LIKE '%phrase%' to search through tens of thousands of records for something resembling the provided phrase. This was an inefficient and computationally expensive search method, so this design pattern was quickly abandoned. Something smarter had to be created.

Over the intervening years new methods of improving search results were tested out but it is the most current implementation that has resolved many of the performance issues that other systems could not overcome.

How It Works

Textbooks can easily contain tens of thousands of words. However, when counting the number of unique words, there are perhaps 1,000. So, much like 10Cv5, when a textbook is added or edited, the distinct words that make up a page are extracted and added to a separate table in the database. Going this route, 180 books at about one thousand words each would result in a data table with 180,000 words pointing to the page(s) the word belongs to. Searches are then conducted on this separate table.

This isn't where the system stops, though, because this is just part of the solution. In addition to returning posts that contain specific words, it's important to assign weight to the search results as well. I generally do it like this:

Search Term: glass design engineer

  1. Query the word table for posts that contain glass or design or engineer3.
  2. Items that contain one of the words get 1 point. Two words 2 points. Three points are three.
  3. Items that contain glass design or glass engineer or design engineer receive an additional 2 points, because the words are in the same order as the search string.
  4. Items that contain all three words in the same order get an additional 5 points.
  5. Items that belong to the language -- or languages -- that I use get an additional 2 points
  6. Items from books newer than 5 years ago get an extra 5 points.
  7. Posts are then sorted based on their score and the first X results are returned to the person using the system.

This is not a perfect mechanism, but it's pretty accurate most of the time. The results span every text-based record, whether it's in a student book, teacher's book, audio or video script, a workbook, or anywhere else. The search results for the above returns the audio script to a track for a Level 2 book in under 0.2 seconds. Because it is quick, people use it.

However, one of the recent updates that has people really using it is the addition of metadata searching, which includes reading tags associated with images and other non-text resources. So now rather than just search through textbooks, the search engine has a rudimentary Google Image Search mechanism in place. The feedback has been overwhelmingly positive. Future updates will likely make this more powerful and dynamic.

Several years ago I said that I would love to work on a search engine product and develop part of the ranking algorithm. While my little textbook search feature is nowhere near as complex or capable as something made by Google, Microsoft, Yahoo!, or DuckDuckGo, it has certainly been a worthwhile challenge that goes a long way to solving the problem of letting teachers quickly pull up a resource from any textbook they've used at any time.

Good opportunities seem to come to those who wait.

  1. 550MB is not a lot for a textbook until you realise that a CD can only hold one complete textbook.

  2. Teachers around the world and employed by my employer. The textbook systems I'm developing are very much an internal project, though I can see a market for this worldwide.

  3. This is done in a single query using IN, not three individual queries.