Monday, August 18, 2008

More Google App Engine RSS

As it turns out, I had a need to scrape the Schlock Mercenary website in order to gather information after all. I was hoping that the clockwork-like regularlity of its updates would let me generate the RSS feed independently. Thankfully, I was able to request a unique class attribute on the information I wanted, but it gave me an opportunity to mess around with html parsing in python. I initially tried to write a solution using the built in dom parser, xml.dom.minidom. However, this rapidly ground to a halt when I discovered that the minidom library will generally throw an exception if the web page is not valid xhtml, making it too fragile for practical use. After some investigation, I discovered a library with the odd name of Beautiful Soup, which is designed for the purpose of scraping web sites and providing a DOM. I found the library to be powerful, but the documentation to be a little lacking. The syntax is a little odd when it comes to element attributes that are python reserved keywords (id, class, etc.).

The following code scrapes the contents of the table with class='FOOTNOTE' if it exists. If not, it scrapes the next table with a width element after a particular image.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(result.content)
import re;
table=soup.find('table',attrs={'class' : 'FOOTNOTE'})
if table is not None:
return " ".join(str(v) for v in table.tr.td.contents)
table=soup.find(src=re.compile('/comics/schlock%s'%date)).findNext('table', width=True)
return " ".join(str(v) for v in table.tr.td.contents)

No comments: