When I was at university I would help my dad with house painting for some cash. So when I hear the term ‘Web Scraping’ I think of scraping paint off some kind of virtual Internet wall. I suppose in some kind of way that is what it is; only the paint is data. As the world becomes a more data driven place data has become a valued commodity, so acquiring good quality free data can be quite a tricky thing. Enter web scraping, web scraping is the action of automatically parsing webpages with a script or a bot to extract and download data. You may not realise that web scraping plays an integral part of the Internet. For example all of Google’s search results are collected using Google’s automated web scraping spider (Googlebot). Actually it is estimated that approximately 50% of all internet traffic is from web spiders according to this 2016 report by Incapsula:
https://www.incapsula.com/blog/bot-traffic-report-2016.html
The first known web spider was called ‘World Wide Wed Wanderer’ and was created by Matthew Grey in 1993. Matthew created the web spider crawl around the internet collect data on the number of websites in order to measure its size.
http://history-computer.com/Internet/Conquering/Wanderer.html
Fast forward almost 25 years and technology and the internet has become ubiquitous with almost every facet of modern society with a fear that soon AI algorithms using big data will reach singularity and take over the world. Obviously a dramatic fantasy, but the data revolution has definitely arrived and will continue to transform our lives. The reality also is that while there is more data that you can poke a stick at, useful data is hard to come by. So where can one find quality free data? Of course you guessed it – the internet. Interestingly I couldn’t find quantitative information on how much internet data feeds companies, with the closest thing I could find being a list of different uses:
http://www.entropywebscraping.com/2017/01/01/big-list-web-scraping-uses/
In my opinion this is just the beginning, as we become more and more connected the associated methods of harvesting data will continue to evolve rapidly. So right now what are some ways this can be done? Well there is a list as long as my arm of different frameworks to web scrape, one of the ones I use is a Python library called ‘Mechanize’. Mechanize is headless browser web browser that can be controlled programmatically, however as I found out not all websites like you scraping their data. In a recent project of mine I was trying to scrape horse racing data. In this example I was downloading files from URLs with random(ish) ID numbers, so I thought I would just brute force and try to download every file ID number. However I soon discovered that I began receiving a ‘HTTP Error 403’. A 403 response is the result of the web server being configured to deny access to the requested resource.
So a bit of research lead me to find that every website (well at least most websites) have ‘robots.txt’ file that basically explains what web bot behavior is allowed and not allowed with their website. For example:
http://www.netflix.com/robots.txt
I found this post interesting post by Ben Fredrickson that analyses the robots.txt rules from the top one million websites, in my opinion it provides quite an interesting insight.
http://www.benfrederickson.com/robots-txt-analysis/
So how do servers enforce robot.txt rules, well that’s the trick it wouldn’t be any good if servers blocked actual people, so if you can make your bot act like a real person then you should be ok to continue to scrape. Whether this is legal or not is an entirely different question. The answer; at least for American law, maybe not according to this article:
In part two I’ll actually provide some examples of how to get started with your web scraping career, ‘til then happy New Year!