Ready to Start Your Career?

Custom Python Script: Webscraping with Mechanize and Beautifulsoup

prometheus 's profile image

By: prometheus

August 12, 2016

Custom Python Script: Webscraping with Mechanize and Beautifulsoup - CybraryHello, fellow Cybrarians!I'm back with another post. With this script, we're gonna mainly scrap websites without actually interacting with the browser! Why web scraping? What's in it for me?Webscraping has an advantage when it comes to faster analysis of data or spidering a whole website for important links, which may interest you during your recon work or work that you carry out in your security assessment cycle. Well! That's cool! How would I do that?I'll explain how to scrap a live website (I'll scrap my university website to fetch my results on to my terminal). Implemented in Python, there are many libraries available. I'd specifically use two to carry out the aspects of automating and analyzing the date: "Mechanize" and "Beautifulsoup". Getting the ModulesGo to your terminal and, if you're running Kali Linux, Python setup tools are already installed and there's an easy_install option already available to you.If it's not available (ex: Ubuntu): sudo apt-get install python-setuptoolsIf it's already installed, type this in terminal:

easy_install mechanize

easy_install beautifulsoup

 Writing the Script:Let's start with the basic operations of the modules. Here, Mechanize is used for automating the form actions, such as fetching the page and interacting with the main page elements. Beautifulsoup is mainly used for pulling the specific content of the HTML tags for specific analysis of data.As I already mentioned, I'll be scraping my university website. First, automating the entering of my university number and fetching the resulting webpage and finally, scraping for the required result from the fetched web page and display according to our format. Import the modules into Python script :import mechanizefrom BeautifulSoup import BeautifulSoup Instantiate the Mechanize object and open a website using Mechanize:

br = mechanize.Browser()#instantiatebr.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:')] #since no physical browser,request using a manual headerbr.set_handle_robots(False)#used to ignore robotsip=raw_input("enter the university seat numbern") #take USN'')#my university website is

optional : br.set_proxies(proxy)#request through a proxy

 Next, you have to find the form element and fill in the opened web page. This can be done by inspecting the page and finding the form of your choice and the id of the field to be filled.

br.select_form(nr=0)#this selects the first form of the pagebr['rid'] = ip#this finds the form element of id "rid" and sets it with taken inputres = br.submit()#submits the form details

 Next, you'll store the target web page and parse it with BeautifulSoup.

html = #read and store the result html pagesoup = BeautifulSoup(html) #pass it for further parsing with beautiful soup

 Now, we have to find the specific tag of interest to display or specific content of a tag, which is important to us. After analyzing the fetched HTML page, I found out that my interest was to find content of a specific table tag element. The raw snippet is shown below:<table cellspacing="0" cellpadding="0" width="515" bgcolor="#ffffff" border="0"><tbody><tr><td><h4 class="style1"> RESULTS (PROVISIONAL)</h4><h4><br /><br /><br />&nbsp;</h4></td></tr><tr><td width="513"><b>K B GOUTHAM MADHWARAJ (4mh12is021) </b><br /><br /><br /><br /><hr /><table><tr><td><b>Semester:</b></td><td><b>8</b></td><td></td><td> &nbsp;&nbsp;&nbsp;&nbsp;<b> Result:&nbsp;&nbsp;FIRST CLASS WITH DISTINCTION </b></td></tr></table><hr /><table><tr><td width="250">Subject</td><td width="60" align="center">External </td><td width="60" align="center">Internal</td><td align="center" width="60">Total</td><td align="center" width="60">Result</td></tr><br /><tr><td width="250"><i>Software Architectures (10IS81)</i></td><td width="60" align="center">63</td><td width="60" align="center">23</td><td width="60" align="center">86</td><td width="60" align="center"><b>P</b></td></tr><tr><td width="250"><i>Information Retrieval (10IS842)</i></td><td width="60" align="center">59</td><td width="60" align="center">19</td><td width="60" align="center">78</td><td width="60" align="center"><b>P</b></td></tr><tr><td width="250"><i>System Modeling and Simulation (10CS82)</i></td><td width="60" align="center">55</td><td width="60" align="center">23</td><td width="60" align="center">78</td><td width="60" align="center"><b>P</b></td></tr><tr><td width="250"><i>Project Work (10IS85)</i></td><td width="60" align="center">96</td><td width="60" align="center">98</td><td width="60" align="center">194</td><td width="60" align="center"><b>P</b></td></tr><tr><td width="250"><i>Information and Network Security (10IS835)</i></td><td width="60" align="center">47</td><td width="60" align="center">25</td><td width="60" align="center">72</td><td width="60" align="center"><b>P</b></td></tr><tr><td width="250"><i>Seminar (10IS86)</i></td><td width="60" align="center">0</td><td width="60" align="center">49</td><td width="60" align="center">49</td><td width="60" align="center"><b>P</b></td></tr></table><br /><br /><table><tr><td></td><td></td><td>Total Marks:</td><td> 557 &nbsp;&nbsp;&nbsp; </td></tr></table> </td></tr><tr><td width="513"><ul></ul><p align="right"><b>Sd/-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</b></p><p align="right"><b>REGISTRAR(EVALUATION)</b></p><table cellspacing="0" cellpadding="0" border="0"><tbody><td width="513"></td></tr><tr><td width="513"></td></tr><tr><td width="513"><table cellspacing="0" cellpadding="0" border="0"><tbody><tr><td width="5"><img alt="" border="0" height="1" src="../image/px.gif" width="5" /></td></tr></tbody></table></td></tr></tbody></table> Since my results are in a table tag that has no attribute and is nested within the another table tag with an attribute of unique width, I'd use that as a reference checkpoint for further parsing.

table = soup.findAll("table",{ "width" : "515" })[0]#this would find the first occurrence of a table tag in the html page with unique attribute width=515.this holds my name that needs to be grabbed

innertable=table.findAll("table")[1] #This would find the 2nd occurrence of the table tag with no attribute from the reference tag which holds my result

innertable2=table.findAll("table")[0]#This would find the 1st occurrence of the table tag from the reference tag which holds my class and semester of the result

 Finally, fetch the important details of the page from the respective table to form an aggregated display of important information.

for rowb in table.findAll('tr')[1:]:#find 3rd tr tag skipping first two rows of tablecol2 = rowb.findAll('td')[0].findAll('b')[0]#find td tag and finally b tag inside it. store it in col2.print "NAME: "+col2.string#fetch the content of b tag which was stored in col2breakfor rowa in innertable2.findAll('tr')[0:]: #find 2nd row of "innertable2" skipping first row as it has header info.

col1 = rowa.findAll('td') #fetch all td tags each in a iterationprint "nSEMESTER: "+col1[1].b.string #print content of 2nd td tag inside col1print col1[3].b.string #print content of 4th td tagprint "n"

print "format is n"+"subjectttexternaltinternalttotaltresult(p/f)n"print "your complete result:n"print "USN:"+ip+"n"

for row in innertable.findAll('tr')[1:]: #find each/iteration tr tag inside inner table skipping 1st row

col = row.findAll('td') #similar to same operation of fetching as for prev tablesubject = col[0].i.string#print subject+"/n"external = col[1].string#print external+"/n"internal = col[2].string#print internal+"/n"total = col[3].string#print total+"/n"result = col[4].b.string#print result+"/n"record = (subject, external, internal, total, result) #make a record of all contentprint "t".join(record) #join each record

 Complete working code: number (trail purpose): 4mh12is021screen shot: (scrapped data)Screenshot from 2016-08-03 23-48-35
Schedule Demo