Thinking About "Smarter" Crawlers

This past week has seen a number of much-needed updates to 10C get released, one of which was a direct result of seeing an excessive number of requests coming from unabashed slurping bots. There is no denying that organisations that exist for the sole purpose of earning money from the efforts of others tend to follow each and every URL that is found on our websites, but is this really the best use of everyone's resources? Any crawler that wants to index each and every bit of public content that is hosted on 10C will need to read several million pages worth of text. Even accessing 5 URLs per second would mean that a crawling engine would need about 291 days to access each an every post. This is terribly inefficient for both the content-scraping scum and my infrastructure.

This got me thinking ….

As of this moment, every post on 10C works out to about 261 megabytes of text. Compressed it would be closer to 64MB. Updates are sequentially tracked and can be rolled back to any point at the drop of a hat. Would it not make more sense for crawling engines to have a URL that could generate a complete package of content on demand for them to download then, when the machines stop by in the future to look for updates, they can use the same URL but with a query string saying something akin to "give me all updates since last Tuesday at 3 o'clock" to download a much smaller package than the first. Doing this would be a win-win for everyone in that web servers could save their resources for actual visitors and companies that crawl every page online will not need to wait darn near a year to completely archive a platform as sprawling as this one.

Bandwidth consumption would go down. Page load times would drop. Server hosting requirements would decrease ever so slightly. All for the sake of an open mechanism that allows marketing companies to get what they want.

Then again, why would we want to make this easy for marketing organisations? No … I'm happy to have the recent updates that identify the bots and block them outright with a nice and clear message. The bandwidth usage has dropped. SQL operations per second have been halved. The server can actually relax a little bit now.