
By: prometheus
August 12, 2016
Custom Python Script: Webscraping with Mechanize and Beautifulsoup

By: prometheus
August 12, 2016

easy_install mechanize
easy_install beautifulsoup
Writing the Script:Let's start with the basic operations of the modules. Here, Mechanize is used for automating the form actions, such as fetching the page and interacting with the main page elements. Beautifulsoup is mainly used for pulling the specific content of the HTML tags for specific analysis of data.As I already mentioned, I'll be scraping my university website. First, automating the entering of my university number and fetching the resulting webpage and finally, scraping for the required result from the fetched web page and display according to our format. Import the modules into Python script :import mechanizefrom BeautifulSoup import BeautifulSoup Instantiate the Mechanize object and open a website using Mechanize:br = mechanize.Browser()#instantiatebr.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')] #since no physical browser,request using a manual headerbr.set_handle_robots(False)#used to ignore robotsip=raw_input("enter the university seat numbern") #take USN inputbr.open('https://results.vtu.ac.in/')#my university website is results.vtu.ac.in
optional : br.set_proxies(proxy)#request through a proxy
Next, you have to find the form element and fill in the opened web page. This can be done by inspecting the page and finding the form of your choice and the id of the field to be filled.br.select_form(nr=0)#this selects the first form of the pagebr['rid'] = ip#this finds the form element of id "rid" and sets it with taken inputres = br.submit()#submits the form details
Next, you'll store the target web page and parse it with BeautifulSoup.html = res.read() #read and store the result html pagesoup = BeautifulSoup(html) #pass it for further parsing with beautiful soup
Now, we have to find the specific tag of interest to display or specific content of a tag, which is important to us. After analyzing the fetched HTML page, I found out that my interest was to find content of a specific table tag element. The raw snippet is shown below:<table cellspacing="0" cellpadding="0" width="515" bgcolor="#ffffff" border="0"><tbody><tr><td><h4 class="style1"> RESULTS (PROVISIONAL)</h4><h4><br /><br /><br /> </h4></td></tr><tr><td width="513"><b>K B GOUTHAM MADHWARAJ (4mh12is021) </b><br /><br /><br /><br /><hr /><table><tr><td><b>Semester:</b></td><td><b>8</b></td><td></td><td> <b> Result: FIRST CLASS WITH DISTINCTION </b></td></tr></table><hr /><table><tr><td width="250">Subject</td><td width="60" align="center">External </td><td width="60" align="center">Internal</td><td align="center" width="60">Total</td><td align="center" width="60">Result</td></tr><br /><tr><td width="250"><i>Software Architectures (10IS81)</i></td><td width="60" align="center">63</td><td width="60" align="center">23</td><td width="60" align="center">86</td><td width="60" align="center"><b>P</b></td></tr><tr><td width="250"><i>Information Retrieval (10IS842)</i></td><td width="60" align="center">59</td><td width="60" align="center">19</td><td width="60" align="center">78</td><td width="60" align="center"><b>P</b></td></tr><tr><td width="250"><i>System Modeling and Simulation (10CS82)</i></td><td width="60" align="center">55</td><td width="60" align="center">23</td><td width="60" align="center">78</td><td width="60" align="center"><b>P</b></td></tr><tr><td width="250"><i>Project Work (10IS85)</i></td><td width="60" align="center">96</td><td width="60" align="center">98</td><td width="60" align="center">194</td><td width="60" align="center"><b>P</b></td></tr><tr><td width="250"><i>Information and Network Security (10IS835)</i></td><td width="60" align="center">47</td><td width="60" align="center">25</td><td width="60" align="center">72</td><td width="60" align="center"><b>P</b></td></tr><tr><td width="250"><i>Seminar (10IS86)</i></td><td width="60" align="center">0</td><td width="60" align="center">49</td><td width="60" align="center">49</td><td width="60" align="center"><b>P</b></td></tr></table><br /><br /><table><tr><td></td><td></td><td>Total Marks:</td><td> 557 </td></tr></table> </td></tr><tr><td width="513"><ul></ul><p align="right"><b>Sd/- </b></p><p align="right"><b>REGISTRAR(EVALUATION)</b></p><table cellspacing="0" cellpadding="0" border="0"><tbody><td width="513"></td></tr><tr><td width="513"></td></tr><tr><td width="513"><table cellspacing="0" cellpadding="0" border="0"><tbody><tr><td width="5"><img alt="" border="0" height="1" src="../image/px.gif" width="5" /></td></tr></tbody></table></td></tr></tbody></table> Since my results are in a table tag that has no attribute and is nested within the another table tag with an attribute of unique width, I'd use that as a reference checkpoint for further parsing.table = soup.findAll("table",{ "width" : "515" })[0]#this would find the first occurrence of a table tag in the html page with unique attribute width=515.this holds my name that needs to be grabbed
innertable=table.findAll("table")[1] #This would find the 2nd occurrence of the table tag with no attribute from the reference tag which holds my result
innertable2=table.findAll("table")[0]#This would find the 1st occurrence of the table tag from the reference tag which holds my class and semester of the result
Finally, fetch the important details of the page from the respective table to form an aggregated display of important information.for rowb in table.findAll('tr')[1:]:#find 3rd tr tag skipping first two rows of tablecol2 = rowb.findAll('td')[0].findAll('b')[0]#find td tag and finally b tag inside it. store it in col2.print "NAME: "+col2.string#fetch the content of b tag which was stored in col2breakfor rowa in innertable2.findAll('tr')[0:]: #find 2nd row of "innertable2" skipping first row as it has header info.
col1 = rowa.findAll('td') #fetch all td tags each in a iterationprint "nSEMESTER: "+col1[1].b.string #print content of 2nd td tag inside col1print col1[3].b.string #print content of 4th td tagprint "n"
print "format is n"+"subjectttexternaltinternalttotaltresult(p/f)n"print "your complete result:n"print "USN:"+ip+"n"
for row in innertable.findAll('tr')[1:]: #find each/iteration tr tag inside inner table skipping 1st row
col = row.findAll('td') #similar to same operation of fetching as for prev tablesubject = col[0].i.string#print subject+"/n"external = col[1].string#print external+"/n"internal = col[2].string#print internal+"/n"total = col[3].string#print total+"/n"result = col[4].b.string#print result+"/n"record = (subject, external, internal, total, result) #make a record of all contentprint "t".join(record) #join each record
Complete working code: http://pastebin.com/5hhxHqLZUniversity number (trail purpose): 4mh12is021screen shot: (scrapped data)