Very few coding activities expose us to so many exciting technologies at once - the web (HTTP/HTML/XML/CSS/JS), data mining, NLP, and security (IPS/WAF evasion) like the Web bots. Yet, with the proliferation of the automated tools and Google, this once mainstream skill is becoming either black hat only or too "old-school". None of this should be the case and in this series of tutorials I will introduce you, who has never done it before, to the web scraping and show how easy and exciting it could be.By the end of this series, you will have programmed and gained knowledge enough to write your own decent web scrapers suited for the ideas of your own.This series will cover:
- programming your own web scraper in Python, starting with the simplest ones of few lines of code long and progressing to the fault-tolerant robust scrapers.
- creating scrapers first without advanced modules, so to understand how it works under the hood.
- to take into account all the aspects of the HTTP technologies in code like authentication, cookies, error codes, redirects
- progressing then to using dedicated for the task modules like BeautifulSoap, Scrappy, Selenium
- learning how to make our scrapers tolerant to the websites change
- how to control the behavior of the scrapers so not to cause excessive load on the websites
- how to make it the most website friendly possible but still functional
- how to process the scraped information into suitable for further processing format
- measures to make the scraper more stealthy to prevent it from being blocked by WAFs and other anti-scraping systems
LEGAL ADVICE: I am not a lawyer, so my only legal advice for you is, do NOT do anything that may require you to look for legal advice. The best way to follow it – use anything I talk about only on a system you own, which is very easy to have today – virtual machines, your own website.On the real life experience side of things, web scraping is NOT illegal by itself. After all, the scraper does mostly what the website owner allows it and meant to do – requesting website pages. The problems start when a bad scraper shows at the website doors. By the ‘bad scraper’ I mean the one which behaves way too aggressively and differently from a regular human browsing the site. I will talk about making our scrapers neighborhood friendly but still, it is not a guarantee that the site owner will be happy with it, so to be on a safe side read my ‘legal advice’ above. In addition, any scraping can be made illegal by the ‘Terms of Service’ agreement enforced by a website which explicitly prohibits using automated tools for whatever reason or way.Why use web scrapers (or don’t we have browsers for that)?
The web scrapers, also called web bots, web spiders, screen scrapers, web crawlers, site downloaders and web site mirror downloaders have actually existed since the very Internet became commonly available. Httrack, https://en.wikipedia.org/wiki/HTTrack
one of the most popular open source scrapers, has been available starting 1998. The very first web crawlers were the search engines themselves which proved so crucial to the usability of the Internet. The reason to have them today are many, ranging from very legitimate to very illegitimate, there are few things people do with them based on real life cases.Legal ones:
- In penetration testing compiling keywords for brute force dictionaries tailored for a specific company or scraping sites like pastebin.com for the data dumps
- web development, when the developer has no access to the site files but needs to work on them
- running a link verification web bot on your website
- creating price/reviews comparing digests or other types of content aggregating websites
- search engine indexing, including custom search engines for internal use
- live statistics or any information monitoring sites (quotes, traffic or weather reports)
- gathering, even if open, personal data like emails/names/phones, further to be used for spam/fraud
- intellectual property theft in forms of publishing the scraped content as their own (my blog yurisk.info posts get ‘stolen’ on such a regular basis that I do not care to do anything about it anymore, not that it is ok) click and other ads related fraud.
- The real story I heard about – the black SEO fraudster was scraping and then posting the stolen content which included time limited discounts in multiple locations on the Internet this way lowering search rating of the original sites in Google and at the same time promoting the site owner's affiliated programs and ads of his own thus making them more attractive.
- Ad click fraud is a variation of the above when a fraudster automatically scrapes and clicks on the competitor website ads causing abuse complaints and cause the withdrawal of the ad.
But can’t we just use browsers, especially with built-in scripting? Yes and no. Yes, we can browse manually or use iMacros/Autoit and other tools we can script standalone browsers as well to some degree, but this has few disadvantages. They require a user to use them, even if automated they don’t scale well, they don’t have ‘brains’ or logic built in them, so the user has to decide what is relevant for her and what is not, the speed of retrieving data is also incomparable slow. Web scrapers make weaknesses of a browser their strengths.What (do scrapers do)?
That is simple – given the website, they crawl all of its pages following internal and (optionally) external links, saving the content to where you want to.How do they do it? They connect over the network to the server hosting the website, request via HTTP protocol some page(s), parse them to distinguish between the content and HTML/XHTML/CSS mark up, save the content locally to a file or a database, extract links from the pages, follow them and do the whole sequence again.Assumptions: you know basics of Python
or any other programming language, the code will be very essential and therefore easy to understand, no object oriented programming, no complex data structures (linked lists/stacks – hey, we are having fun and not surviving a CS course!), the error checking/exception handling will be done when needed.All the code will be posted also in a Github repository here https://github.com/yuriskinfo/Web-scraping-with-PythonSoftware used:
- latest stable version of Python 3.x ( I use 3.6.2 but earlier versions should work as well)
- requests module from https://docs.python-requests.org/en/master/
- latest stable version of Beautiful Soup 4.x from https://www.crummy.com/software/BeautifulSoup/
- additional packages as we progress
- yurisk.info for 2 reasons: I own it (see legal advice above) and it was originally Wordpress blog turned into static HTML site which made it quite messy in its markup (good for learning). As long as you use the code from this series you can also practice it on my website with the condition that you do not scrape it endlessly in a loop and that you introduce a delay of at least 5 seconds in any of your loops.
- I will add additional subdomains for specific purposes as the need arises (for example to emulate dynamic websites, anti-scraping protection).
Ok, enough with theory, let’s code something. We will start with the 1st step in any web scraping, that is connecting to a server and requesting its home page. We have many modules and possibilities here, but to start simple we do it with a standard urlopen
module and de-facto standard module requests
.The 1st one is installed by default but we have to download the requests
module.My personal tip – for learning purposes I always use Anaconda from https://www.continuum.io
free package distribution which is ideal for experimenting with Python code and also makes installing packages just a matter of search/click/install. I highly recommend it, especially if you don’t have the already set up environment to use it for learning.The usual way will do as well:`#pip install requests`While we are on it go ahead and install BeautifulSoup 4 (BS4) as well:`# pip install beautifulsoup4`Task 1:
Connect to the server using both HTTP and HTTPS, request its home page, and check for errors in doing so.NOTE: If the code gets formatted badly look in the Github repository I mentioned.Using standard module urlopen:Solution 1.0:
`from urllib.request import urlopenfrom urllib.error import HTTPErrordef getSinglePage(url):try:rawPage = urlopen(url)print(rawPage.read)except HTTPError as ee:print(ee.reason)print(ee.code)return NoneurlToGet = r'http://yurisk.info/'ourPage = getSinglePage(urlToGet)`First, we import the needed functions to retrieve a page, also getting access to the errors it can potentially produce. Next, we create a function getSinglePage() to get us a page from the supplied URL, it is better to start separating functionality into functions from the very beginning. This function uses urlopen() function which returns object representing the retrieved page from the server. Then we invoke read() to read the page content and print it to the console. If some error happens we catch it with the built-in HTTPError object in the except part of the function. We can even know the exact error by printing its response code (part of HTTP standard – the 200 means ok, 404 being the dreaded Not Found error) with .code property and its string explanation with .reason
property. I introduce error processing for a reason – the errors in retrieving data happen all the time, and your scraping code has to account for them. Some errors may be ignored, some should be acted upon. If you decide not to process errors – your scraper won’t go too far. We will just notice the errors for now.Finally, we use this function getSinglePage by supplying it with the URL http://yurisk.info .You will notice that we got the page as a stream of not formatted printable bytes. That is the good and the bad of urlopen library – it is more low level than requests library and expects us to make most of the decisions. One of the decisions is about the encoding of the retrieved page – urlopen assumes nothing, you have to explicitly tell it. The requests, on the other hand, look at the server HTTP response header and uses this advertised encoding to represent the page.Let’s see the difference by adding manually encoding for the page which in this case is UTF-8 and using HTTPS on the way:Solution 1.1
from urllib.request import urlopenfrom urllib.error import HTTPErrordef getSinglePage(url):try:rawPage = urlopen(url)print(rawPage.read .decode('utf-8'))except HTTPError as ee:print(ee.reason)print(ee.code)return NoneurlToGet = r'https://yurisk.info/'ourPage = getSinglePage(urlToGet)As you see the page is formatted much better now, this is to show that the advantage of urlopen
is in being standard module while the disadvantage is it requires more manual work. Switching to https is just a matter of specifying HTTPS in the URL.We will do the same task with the requests module:Solution 1.2import requestsdef getSinglePage(url):try:rawPage = requests.get(urlToGet)print(rawPage.text)except:print(rawPage.status_code)print(rawPage.reason)urlToGet = r'http://yurisk.info/'getSinglePage(urlToGet) The differences are:
- we have to specify HTTP retrieval method, here it is GET
- the contents of a page is available as a .text property of the object
- we have access to the same properties of the retrieved page as object: .reason for a string explanation and .status_code for numerical response status code (which is more important as the string explanation reason is arbitrary and can be changed by the server administrator according to her/his imagination).
Today we have learned to connect to a server by HTTP/HTTPS and request the page specified in a user-supplied URL. In the next lesson we will continue with our best friends urllib
and learn to extract hyperlinks from web pages and then use them to build a small link verification bot.
Resources:- Code https://github.com/yuriskinfo/Web-scraping-with-Python
my site with Web scraping/crawling and other recipes in other languages like C# / C