Custom Python Script: Webscraping with Mechanize and Beautifulsoup

August 12, 2016 | Views: 5888

Begin Learning Cyber Security for FREE Now!

FREE REGISTRATIONAlready a Member Login Here

Hello, fellow Cybrarians!

I’m back with another post. With this script, we’re gonna mainly scrap websites without actually interacting with the browser!

 

Why web scraping? What’s in it for me?

Webscraping has an advantage when it comes to faster analysis of data or spidering a whole website for important links, which may interest you during your recon work or work that you carry out in your security assessment cycle.

 

Well! That’s cool! How would I do that?

I’ll explain how to scrap a live website (I’ll scrap my university website to fetch my results on to my terminal). Implemented in Python, there are many libraries available. I’d specifically use two to carry out the aspects of automating and analyzing the date: “Mechanize” and “Beautifulsoup”.

 

Getting the Modules

Go to your terminal and, if you’re running Kali Linux, Python setup tools are already installed and there’s an easy_install option already available to you.

If it’s not available (ex: Ubuntu): sudo apt-get install python-setuptools

If it’s already installed, type this in terminal:

easy_install mechanize

easy_install beautifulsoup

 

Writing the Script:

Let’s start with the basic operations of the modules. Here, Mechanize is used for automating the form actions, such as fetching the page and interacting with the main page elements. Beautifulsoup is mainly used for pulling the specific content of the HTML tags for specific analysis of data.

As I already mentioned, I’ll be scraping my university website. First, automating the entering of my university number and fetching the resulting webpage and finally, scraping for the required result from the fetched web page and display according to our format.

 

Import the modules into Python script :

import mechanize
from BeautifulSoup import BeautifulSoup

 

Instantiate the Mechanize object and open a website using Mechanize:

br = mechanize.Browser() #instantiate
br.addheaders = [(‘User-agent’, ‘Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6’)] #since no physical browser,request using a manual header
br.set_handle_robots(False) #used to ignore robots
ip=raw_input(“enter the university seat numbern”) #take USN input
br.open(‘http://results.vtu.ac.in/’) #my university website is results.vtu.ac.in

optional : br.set_proxies(proxy) #request through a proxy

 

Next, you have to find the form element and fill in the opened web page. This can be done by inspecting the page and finding the form of your choice and the id of the field to be filled.

br.select_form(nr=0) #this selects the first form of the page
br[‘rid’] = ip #this finds the form element of id “rid” and sets it with taken input
res = br.submit() #submits the form details

 

Next, you’ll store the target web page and parse it with BeautifulSoup.

html = res.read() #read and store the result html page
soup = BeautifulSoup(html) #pass it for further parsing with beautiful soup

 

Now, we have to find the specific tag of interest to display or specific content of a tag, which is important to us. After analyzing the fetched HTML page, I found out that my interest was to find content of a specific table tag element. The raw snippet is shown below:

<table cellspacing=”0″ cellpadding=”0″ width=”515″ bgcolor=”#ffffff” border=”0″>
<tbody>
<tr>
<td>
<h4 class=”style1″> RESULTS (PROVISIONAL)</h4>
<h4><br />
<br />
<br />
&nbsp;</h4>
</td></tr><tr>
<td width=”513″>
<b>K B GOUTHAM MADHWARAJ (4mh12is021) </b><br /><br /><br /><br /><hr /><table><tr><td><b>Semester:</b></td><td><b>8</b></td><td></td><td> &nbsp;&nbsp;&nbsp;&nbsp;<b> Result:&nbsp;&nbsp;FIRST CLASS WITH DISTINCTION </b></td></tr></table><hr /><table><tr><td width=”250″>Subject</td><td width=”60″ align=”center”>External </td><td width=”60″ align=”center”>Internal</td><td align=”center” width=”60″>Total</td><td align=”center” width=”60″>Result</td></tr><br /><tr><td width=”250″><i>Software Architectures (10IS81)</i></td><td width=”60″ align=”center”>63</td><td width=”60″ align=”center”>23</td><td width=”60″ align=”center”>86</td><td width=”60″ align=”center”><b>P</b></td></tr><tr><td width=”250″><i>Information Retrieval (10IS842)</i></td><td width=”60″ align=”center”>59</td><td width=”60″ align=”center”>19</td><td width=”60″ align=”center”>78</td><td width=”60″ align=”center”><b>P</b></td></tr><tr><td width=”250″><i>System Modeling and Simulation (10CS82)</i></td><td width=”60″ align=”center”>55</td><td width=”60″ align=”center”>23</td><td width=”60″ align=”center”>78</td><td width=”60″ align=”center”><b>P</b></td></tr><tr><td width=”250″><i>Project Work (10IS85)</i></td><td width=”60″ align=”center”>96</td><td width=”60″ align=”center”>98</td><td width=”60″ align=”center”>194</td><td width=”60″ align=”center”><b>P</b></td></tr><tr><td width=”250″><i>Information and Network Security (10IS835)</i></td><td width=”60″ align=”center”>47</td><td width=”60″ align=”center”>25</td><td width=”60″ align=”center”>72</td><td width=”60″ align=”center”><b>P</b></td></tr><tr><td width=”250″><i>Seminar (10IS86)</i></td><td width=”60″ align=”center”>0</td><td width=”60″ align=”center”>49</td><td width=”60″ align=”center”>49</td><td width=”60″ align=”center”><b>P</b></td></tr></table><br /><br /><table><tr><td></td><td></td><td>Total Marks:</td><td> 557 &nbsp;&nbsp;&nbsp; </td></tr></table> </td></tr>
<tr>
<td width=”513″>
<ul>
</ul>
<p align=”right”><b>Sd/-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</b></p>
<p align=”right”><b>REGISTRAR(EVALUATION)</b></p>
<table cellspacing=”0″ cellpadding=”0″ border=”0″>
<tbody>
<td width=”513″>
</td></tr>
<tr>
<td width=”513″>
</td></tr>
<tr>
<td width=”513″>
<table cellspacing=”0″ cellpadding=”0″ border=”0″>
<tbody>
<tr>
<td width=”5″><img alt=”” border=”0″ height=”1″ src=”../image/px.gif” width=”5″ /></td></tr></tbody></table></td></tr>
</tbody></table>

 

Since my results are in a table tag that has no attribute and is nested within the another table tag with an attribute of unique width, I’d use that as a reference checkpoint for further parsing.

table = soup.findAll(“table”,{ “width” : “515” })[0] #this would find the first occurrence of a table tag in the html page with unique attribute width=515.this holds my name that needs to be grabbed

innertable=table.findAll(“table”)[1] #This would find the 2nd occurrence of the table tag with no attribute from the reference tag which holds my result

innertable2=table.findAll(“table”)[0] #This would find the 1st occurrence of the table tag from the reference tag which holds my class and semester of the result

 

Finally, fetch the important details of the page from the respective table to form an aggregated display of important information.

for rowb in table.findAll(‘tr’)[1:]: #find 3rd tr tag skipping first two rows of table
col2 = rowb.findAll(‘td’)[0].findAll(‘b’)[0] #find td tag and finally b tag inside it. store it in col2.
print “NAME: “+col2.string #fetch the content of b tag which was stored in col2
break
for rowa in innertable2.findAll(‘tr’)[0:]: #find 2nd row of “innertable2” skipping first row as it has header info.

col1 = rowa.findAll(‘td’) #fetch all td tags each in a iteration
print “nSEMESTER: “+col1[1].b.string #print content of 2nd td tag inside col1
print col1[3].b.string #print content of 4th td tag
print “n”

print “format is n”+”subjectttexternaltinternalttotaltresult(p/f)n”
print “your complete result:n”
print “USN:”+ip+”n”

for row in innertable.findAll(‘tr’)[1:]: #find each/iteration tr tag inside inner table skipping 1st row

col = row.findAll(‘td’) #similar to same operation of fetching as for prev table
subject = col[0].i.string
#print subject+”/n”
external = col[1].string
#print external+”/n”
internal = col[2].string
#print internal+”/n”
total = col[3].string
#print total+”/n”
result = col[4].b.string
#print result+”/n”
record = (subject, external, internal, total, result) #make a record of all content
print “t”.join(record) #join each record

 

Complete working code: http://pastebin.com/5hhxHqLZ

University number (trail purpose): 4mh12is021

screen shot: (scrapped data)

Screenshot from 2016-08-03 23-48-35

Share with Friends
FacebookTwitterLinkedInEmail
Use Cybytes and
Tip the Author!
Join
Share with Friends
FacebookTwitterLinkedInEmail
Ready to share your knowledge and expertise?
6 Comments
  1. Byoootiful! Thanks!

  2. Good Job! 🙂

  3. Thanks for this!

Comment on This

You must be logged in to post a comment.

Our Revolution

We believe Cyber Security training should be free, for everyone, FOREVER. Everyone, everywhere, deserves the OPPORTUNITY to learn, begin and grow a career in this fascinating field. Therefore, Cybrary is a free community where people, companies and training come together to give everyone the ability to collaborate in an open source way that is revolutionizing the cyber security educational experience.

Cybrary On The Go

Get the Cybrary app for Android for online and offline viewing of our lessons.

Get it on Google Play
 

Support Cybrary

Donate Here to Get This Month's Donor Badge

 
Skip to toolbar

We recommend always using caution when following any link

Are you sure you want to continue?

Continue
Cancel