The subject of tracking seems to be in the news a great deal lately as people are understandably nervous about applications that send data back to organisations such as Facebook, Google, Apple, Amazon, and the myriad of advertising networks that clamour to know more about us through apps, web sites, and an ever-growing number of connected appliances. Given that a lot of the online publications that are currently shouting the loudest also have an unruly number of tracking mechanisms on their website, and given that 10C seems to operate in complete isolation from external services, I figured it would be a good idea to outline just how much data my platform is collecting on each and every person who visits this website, subscribes to an RSS feed, or downloads a podcast.
I don't track this in any appreciable way. The server knows how many bytes of data has been sent, but not the name of the file nor who it sent the data to, because I don't care.
I don't track this in any direct way. When an RSS service or an RSS reader comes by to grab the most recent XML or JSON file, the User Agent1 and the IP address of the system that initiated the request is recorded in the
UsageStats table along with details such as which website was accessed, which RSS feed (because there are many ways to request one), the type of HTTP request, the response code, and how long the whole process took.
Just as with RSS subscriptions, the User Agent and IP address of the system that inited the request is recorded in the
UsageStats table along with details such as which website was accessed, which URL, the type of HTTP request, the response code, and how long the whole process took. If a person is signed into the service at this time, then the authentication token ID is also recorded.
Why Do It?
This data is collected to answer a couple of fundamental questions:
- how long are people waiting for data?
- is the current hardware sufficient to meet demand?
- where are the bottlenecks2?
It's with this data that I can tell you the average response time for a request is 0.3 seconds start to finish, and that 87 of the sites hosted on 10C represent 99% of the traffic. Is the current hardware sufficient? Yep. Not bad for a five year old laptop-turned-server.
And Then …?
Because the statistics table generally grows by about 350MB a day, it's not something that I want to keep around in a request-by-request format. Aside from mild curiosity to compare performance metrics from the past to the present, there is very little value gained from the numbers. For this reason, statistics are summarised by site on a daily basis and deleted from the system after 30 days. Backups of the database are also kept for 60 days before being discarded as a waste of space. This means that at no point will I have request-by-request statistics older than 91 days3.
What about the "Popular Post" feature? Where does that data come from?
Yep, this come from the
UsageStats table as well, but to say that this summarised data is equivalent to tracking a group of people would be a stretch.
How can I verify this?
The code that powers 10C is open source. The function that records the data into the
UsageStats table can be found in
/lib/functions.php on (or around) line 1918 as
recordUsageStat(). The SQL query can be found in
/sql/system/setUsageStat.sql. Want to get a copy of your data from
UsageStats or any other place in the database? Just get in touch and we can make it happen.
As someone who has taken a number of steps to reduce the number of sites and services that can follow me around the web, I understand the importance of collecting just the information that is needed to answer basic system questions and offer general functionality. None of the systems I create will go beyond the amount of statistics collection that is outlined above because, to be completely clear, tracking what people do just isn't that interesting. I'm much more interested in what the system does than the visitors.
User Agents are not to be trusted 100%. They can be anything, and it's incredibly easy to claim to be a valid browser when the connection is in fact an automated process.
Long-running API requests, etc.
30 days of recent data, plus 60 days is 90, plus today means "Generally nothing older than 90 days, 23 hours, 59 minutes, and 59 seconds.