Bots On Notice

What do web crawling bots do and why do they do it? It's a simple question with simple answers, but today I found myself asking "Why do they do it so much?". There are about 20,000 requests from almost 100 self-admitted bots hitting my web server every day, which works out to one every 4.32 seconds. Given that the average amount of computational time given to each request is less than a second, this shouldn't be much of an issue. However, looking at what many of these tools are for, I don't see why they should have the luxury of crawling around each of the websites hosted on 10C without offering something of value.

Bot Hits

AhrefsBot and MJ12Bot both enjoy hitting several thousand pages a day from a multitude of servers while others limit themselves to a few hundred or a couple dozen. Doing the research on these crawlers, it appears that they're serving the goals of advertising companies in search of trends and "free content". There are some valid bots, such as Feedly, though these are few and far between.

So, rather than encourage advertising companies from building complete link maps of every site on 10Centuries, I'll put in a little bit of effort to sour the milk. AhrefsBot ignores the rules set out in Robots.txt as to a number of other content scrapers, so it makes no sense to provide anything of value to them. I considered two options:

  1. Return a blank page
  2. Return a dynamically generated page that contains 10,000 randomly generated links to imaginary places across the web embedded within a giant Lorem Ipsum

Both of these options would be rather easy to implement, though the second one would be more interesting to create. I already have a Lorem Ipsum generator that can go as far as 500 paragraphs and having every couple of words turned into a link to (potentially) false URLs across the web would reduce the perceived quality of 10C-hosted content to junk status in a matter of days. The bandwidth requirements wouldn't be an issue for the most part so long as I try to keep the pages smaller than 50KB when compressed … which is a rather large HTML document, I must admit.

By setting up one of these mechanisms directly into the 10C core functions there should quickly be a drop in the number of undesired bots accessing the site to see what's new. More importantly, less spam traffic will mean a better response time for people who are legitimately using the service.