Logfiles, webstats, Google Analytics and the fight against blog comment spammers continued
This second, sadly wine-free, instalment of the series on how to fight comment spam (read the first here) is looking into identifying and fighting spammers in a slightly different way. While the method I describe in this posting is not for everyone, the information may still be of interest to you, especially if you want to understand how many visitors your website has, what Google Analytics does, what server logfiles are and how to interpret these numbers through software such as Webalizer or AWStats.
Like most bloggers we are curious to know who is reading our blog. Some of our readers we know through the comments they leave, emails that they send, through Twitter or even personal contacts - which, I hasten to add, makes them more than just 'readers' but partners in a conversation. Even so, as a blogger you also want to know about those who just read your blog and do not directly engage with you – maybe to boost your ego ('Hundred people visit my blog every day.') or because you want to know if you are doing a good job engaging the visitors, i.e.: do they return? do they spend much time on the site? what proportion of your readers leave comments? where are they from?
Basically, there are three ways of finding out about this.
Many blogging environments give you some statistics about your visitors. I will not say very much about this as those of you using platforms such as WordPress will have easy access to that information (but even if you are one of those you will still find some useful information here, I hope). The second way of learning more about your visitors is to use an external service to track your visitors. Google Analytics is by far the most important one, a powerful and free web application by Google. Well, it is not entirely free, but we will come to that. A third way is looking at the logfiles of the web server that hosts your blog.
Server logfiles, IP-addresses, user agents and how to understand them
Let's start with the logfiles. Whenever someone (and that someone can be a computer too) looks at something (an image, a text, a PDF) on the internet, a server will have to process that request. In this process, web servers create a logfile that logs every single request they get. Such a logfile entry looks as follows:
66.249.72.134 - - [01/Jun/2009:02:03:24 +0200] "GET /blog/london-wine-snobs-go-vino HTTP/1.1" 200 16269 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
This gives you quite a bit of information. First of all it tells you that someone wanted to see my blog posting entitled 'London Wine Snobs go for Vino', and that request came in the early morning of the first of June 2009. Even better, the logfile also tells you who that someone was. One clue is the number at the beginning of the entry, the so called IP-number. Every computer on the Internet (or any network, for that matter) has a unique number to identify it. Domain names solely exist for humans who could not possibly remember that the Wine Rambler is 85.13.136.242. Even your computer has such a number right now. The internet service providers assign these numbers to their customers, so by analysing the IP you can get an idea where a computer that requested a page from your server is located. If you visit http://whatismyipaddress.com/ you may find that it is possible to even identify the city you are based in through your IP address.
This method works better with web servers than personal computers at home though as not every computer keeps the same number all the time. Chances are that you might get a different number every single time you connect to the internet – your provider just assigns whatever IP is available at the time to make use of the numbers more economical.
Using a type of service called 'whois' you can get an idea where your visitor came from. Entering the above IP into http://www.coolwhois.com/ will tell you that it belongs to a company called Google. This is not a surprise, as Google were actually kind enough to identify themselves – this is called the 'user agent': Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). Turns out our visitor is a robot from Google, the Googlebot, and it came to index the site for the Google search engine. There are many user agents and you too have one – learn more on http://wiht.link/wmuseragent
This brings us to the problem of understanding log files. They are very accurate as such as no one can get a file from your website without requesting it, but the analysis is a bit more difficult. If you want to boast about the high number of visitors, for instance, do you count Googlebot or not? If your friend Sarah visits your blog once a week, how can you tell that it is her when her provider assigns her a different IP address every time? Sure, the user agent will tell you that she uses Firefox 3.6, but millions of people do so.
Pageviews, visits, unique visitors: understanding who is out there
Luckily, terminology exists to help us handle some of these problems: 'pageview', 'visit', 'unique visitor'. A pageview is counted whenever someone visits a page on your website; it does not include images or files, such as PDF documents, that are part of the (HTML) page - in the same way as a page in a book is still just one page, no matter how many images or footnotes it has. A visit is the whole process of someone going to your website, browsing around for a while and leaving again. Usually, software that analyses logfiles assumes that a visit ends when there is no more activity from the same IP within 30 minutes. This means that Sarah visiting your website, then having a lunch break and reading a few more posts later would be be counted as two visits.
This is why people are keen to know about the unique visitors. Some web analysis software assumes an IP address stands for a unique visitor. While this identifies a computer, it does not identify a human being though – your iPhone, private and work computer have different IP addresses. Furthermore, some companies make employees share one address and even the same computer does not have to have the same IP all the time. This brings you to the sad truth that logfiles cannot exactly tell you how many unique visitors you have, and neither can software that is used to analyse those, such as AWStats or Webalizer.
Google Analytics, free but evil?
Google Analytics use a different approach that, in some ways, is much more accurate. It is also problematic. To enable GA, you need to register with Google and add a small JavaScript to your website. Whenever someone visits your site, the script reports certain information about the visitor to Google. The good thing about this is that bots used by search engines, such as the Googlebot from above, but also the bots used by spammers, do not process JavaScript, so they do not get reported. Usually only browsers used by humans use JavaScript, so the visitors counted by Analytics are humans, Furthermore, the script may also be able to tell that it is the same browser accessing your site a day later, even if the computer does not have the same IP address. So to sum up, Google Analytics identifies humans, while logfiles log all access.
In theory that means we should all use the free Analytics package and then we know exactly how many real visitors we have. It is, of course, not that easy. First of all Google Analytics cannot count people who decide to disable JavaScript or to specifically block Analytics. Why would they do that? Well, not everyone likes all the data about their daily browsing to be processed by Google, and I cannot blame them. Leaving aside that the JavaScript can make your website appear slower to respond, this in itself can be reason enough not to give all that data to Google. Sure, it is free, but then again you pay with the data of your visitors.
In Germany, for instance, Google Analytics can be seen as violating privacy rights as your visitors do not have the chance to opt out of it. While legally it is not decided whether bloggers who use Analytics in Germany violate the law, there may be moral reasons not to use it.
Spambots vs. human visitors - the sad truth
On the Wine Rambler, we have used Google Analytics for a couple of weeks as part of an experiment – I wanted to understand how much the Google data differs from the data I get from the logfile analysis I do with a software called Webalizer. The conclusion, to cut it short, is that webalizer reports four times as many visits to this site than GA. Does that mean that 75% of our visitors block GA or do not use JavaScript? No. It means that a significant part of the visits webalizer counts comes from machines. Yes, only about a quarter, if not less, of all access to the Wine Rambler over the last few months directly originated from humans.
How can this be? Well, it is really simple. First of all there are all the search engines that crawl our site, and they come pretty much every day, if not more often. Post a link on Twitter, put a new comment up, the bots will come. Then there are RSS feed readers and others. Some companies harvest blogs to find data relating to their clients. Write something about Coca Cola in your blog? They sure will want to know.
And then there are spammers. Actually, most of the difference between the logfile analysis and Google comes from spambots, at least as far as the Wine Rambler is concerned - over 50% of all traffic on our site. How do we know?
The Webalizer software can give you a list of the IP addresses that visited your website. This list is fairly long, but it is actually quite easy to identify spammers. Even you have very active readers, it is unlikely that they visit your website ten or more times a day. If an IP address does that it probably is a spammer. So I went through the most active IP addresses and found out that almost all of them are spammers.
Identify spammers through their IP
How do I know this? Again, it is actually very simple. Just use a search engine. If you think the IP address 202.99.29.27 looks suspicious, just search for it. One of the hits that will come up is from Project Honey Pot and it will tell you: 'The Project Honey Pot system has detected behaviour from the IP address consistent with that of a comment spammer and rule breaker.' There are several initiatives that specialise in identifying spammers, for instance by putting up blogs and analysing comments. When you search for an IP that looked suspicious in your logs and find that it is identified as a spammer on several such websites you can be fairly certain the IP is used by a spammer. There is no guarantee though, so make sure to not trust in just one website and also read what they say about it – some IP addresses show up in their logs, but have not actually misbehaved; it could just be a new search engine, and you do not want to confuse those with spammers.
Using this method you will now have identified the most active spambots hitting your website. And now it gets very simple, you just block their access to your site. This reduces the number of spam submissions and it also helps reducing pressure on your website when many spambots hit you at the same time, slowing down the site for human visitors.
Denying spammers access to your site through .htaccess
A word of warning before I tell you how to block these IP addresses. First of you should make sure to check an IP carefully before blocking it – you may end up blocking your most active human visitors or search engines. Secondly, check the list of blocked IP addresses once in a while. Just because a spammer uses an IP in March it does not mean that a human cannot use it in December. It may be a good idea to use a whois service to find out where the IP is located geographically and to which network it belongs. Again, using a search engine and paying particular attention to reliable websites (such as http://www.stopforumspam.com/ http://spam-ip.com/ http://www.projecthoneypot.org http://www.forumpostersunion.com/) you can identify the most active spammers. Blocking the IP addresses, be warned, is not the route to ultimate victory though – they can always get new ones. Using this approach you will reduce the impact of spambots, but you cannot defeat them.
So how do you block access for a certain IP address? If you have access to your web server it is very simple. The root or main directory for your website will contain a file called .htaccess (if not just create it). You can edit it with any text editor. Add the following lines to it (depending on your setup there may already be a similar set of rules defined in your htaccess file, so you need to modify it) 'Order allow,deny', followed by "Allow from all" and 'Deny from XXX' (each on its own line), where XXX stands for the IP address you want to block. As soon as you have done that the IP is blocked. If you do not have access to the web server, you will have to configure your content management system or blogging software to block access from certain IPs. Using .htaccess is better though as that completely denies the IP access to your server - it does not even reach your blog.
Concluding thoughts
Obviously, if you find that your blog is protected well enough against spam and is hosted by a provider that can ensure that the website runs fast enough even when hit by many spambots you can save yourself the trouble. Even I will probably stop doing it as it is a manual process that takes time - and it only works after spammers have hit your blog.
Still, I found it very interesting to go through this experience as I now have a better idea of what happens to my website and how to understand and fight comment spam. It also made me more aware how difficult it is to get a reliable idea about how many readers you actually have, making me more likely to be suspicious of blogs boasting high numbers without giving more detailed analysis of how they came to these numbers.
In the next instalment of this series I will introduce you to two services that we have just implemented on the Wine Rambler - they promise to take care of the CAPTCHA issues I described in the previous posting and to, to a certain extent, automatise the process I have just described.