This morning, while staring at the day-job workload waiting for me in the form of GitHub tickets, I started thinking about a problem on 10Centuries that I have long wanted to solve. It's a topic that has come up again and again over the years, and I think it's almost "solved". The problem of course, is search.
Search on a website is theoretically pretty simple. People enter some words in a field, those words are checked against the content in the database, and results are returned. In its simplest form, only the exact search term is sought. This means that if I were to ask for all posts containing the words bright and yellow, I would see only this result as there is an exact match for "bright yellow". But if I were to ask for yellow and bright, nothing would come back. This is clearly suboptimal, so it's better to have all of the words split apart, with results that include all posts with the words bright or yellow, ideally scoring the posts in such a way that the above-referenced post is at the top of the list. A lot of effective software uses this weighted search result method to return relevant results, but I wanted to do something different still.
I wanted people to see something instantaneously.
One of the tricky parts of instantaneous results is dealing with network latency, server load, and all sorts of less-than-desirable problems that can make a theoretically semi-decent idea practically untenable. More than this is the general response times of the service. Most people can type several characters per second. Sending multiple calls to the API just for the illusion of supplying decent results in realtime seems silly. So I decided to go about solving the problem a little differently: a subset of every post is loaded into memory and called when requested.
This blog post is number 2,428 in publication order — so long as I haven't back-dated anything since this post went live — and the average size of each post is roughly 618 words. That's 1.5-million words. A crazy number one might say, but then I have been blogging for almost a decade. 150,000 words a year is nothing compared to the number of words that have been published on various social networks, forums, and IRC channels over the years. Loading all of these words into a browser would be absolute overkill so, instead, I am loading just a subset of the words that constitute a post. As people type their search query into the box, the browser scans through the data stored in memory, finds matches, scores them, and then updates the results. People with relatively recent hardware will see that the operations are pretty much smooth and responsive. People with hardware as old as this blog ... will unfortunately suffer some stuttering. People searching other 10C-powered sites will likely not notice a hiccup at all.
The browser is working with a subset of the posts, though. What's not included? The content.
For the moment, search will pull from titles, URLs, tags, and author names. Future updates will include the content of the pages and posts. Yet before it can happen, two things must first take place.
- I need to see that people are able to use the search in any browser on any platform. This is still in testing.
- I need to create a cached result for every post that contains just a single copy of every word in the article, excluding certain common words in various languages.
Once these two things are done, then I can build on the existing search tool in order to provide much better, more specific results.
In the meantime, people using the default blog theme on 10Centuries will see an "Archives" link in their navigation bar. Every post will be listed in reverse chronological order, and the search bar up top can be used to quickly find published items. If you don't see this link, it's because the cache for your site has not been refreshed. Simply write a new post (or update an existing one) to force the system to regenerate your website.
This isn't a perfect solution by any stretch of the imagination, but it solves a number of problems that I've been thinking about for quite some time, and it does it in the browser rather than taxing my own servers with Google-like search speeds. Hopefully this same search method will be employed in every theme going forward.