Monday, August 18, 2008

More Google App Engine RSS

As it turns out, I had a need to scrape the Schlock Mercenary website in order to gather information after all. I was hoping that the clockwork-like regularlity of its updates would let me generate the RSS feed independently. Thankfully, I was able to request a unique class attribute on the information I wanted, but it gave me an opportunity to mess around with html parsing in python. I initially tried to write a solution using the built in dom parser, xml.dom.minidom. However, this rapidly ground to a halt when I discovered that the minidom library will generally throw an exception if the web page is not valid xhtml, making it too fragile for practical use. After some investigation, I discovered a library with the odd name of Beautiful Soup, which is designed for the purpose of scraping web sites and providing a DOM. I found the library to be powerful, but the documentation to be a little lacking. The syntax is a little odd when it comes to element attributes that are python reserved keywords (id, class, etc.).

The following code scrapes the contents of the table with class='FOOTNOTE' if it exists. If not, it scrapes the next table with a width element after a particular image.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(result.content)
import re;
table=soup.find('table',attrs={'class' : 'FOOTNOTE'})
if table is not None:
return " ".join(str(v) for v in table.tr.td.contents)
table=soup.find(src=re.compile('/comics/schlock%s'%date)).findNext('table', width=True)
return " ".join(str(v) for v in table.tr.td.contents)

Friday, August 8, 2008

Google App Engine powered RSS feeds

I am a huge Schlock Mercenary fun, and I've been looking for a simple project to teach me a bit about Google App Engine (GAE). I originally thought that this might be a trivial use of GAE, but if solves a problem I have had for years. Schlock releases a new strip every night at 8 PM PDT, but offers no RSS feed, partially due to the extreme regularity of the update schedule. I've been wanting to add an RSS feed for his strip, and after talking with the author, Howard Tayler, over the past week, he has been surprisingly receptive. He surprised me with a request to include the images within the feed itself. A mirror of project can be found here, though the real thing is sent through feedburner.

http://schlockrss.emmesdee.com
  • New strips are published at 8PM
  • Today's strip is a link to the front page before 5PM PDT.
  • Today's strip links to the archive after 5PM PDT.
  • All other links go to the archive.
  • The Atom feed does not contain images, but the RSS feed does.
  • The comic consists of 3 JPGs on Sunday.
  • The comic consists of a single PNG or JPG on other days, depending on the amount of shading.
  • Project Wonderful ad integration -- I wanted to add this, but my original approach turned out not to be feasible. The feed would require a new agreement with Project Wonderful, and simply piggybacking off of the existing ad arrangement of the main site would be a violation.
  • Adwords is provided for free once you use feedburner.
This turned out to be a great learning experience for GAE, Atom, and RSS. I was disappointed to learn that Atom has some serious limitations when it comes to including images within a feed itself. What made this truly enjoyable was interacting with Howard. He was very receptive to the feed, and he was able to clarify several things about the site as well as make a few requests that turned this from a trivial project to a true learning experience. I originally created the feed as an Atom feed, as it is more of an open standard, while the RSS format appeared to have some odd ownership quirks. This worked fine at first, until Howard asked me to put images into the feed itself. After much wrestling with the Atom format, I came to the conclusion that putting arbitrary images into an Atom feed was not going to work. Since RSS and Atom are both easy to implement, the solution was to learn more about RSS and implement an RSS image feed. The second quirk came about when I realized that I would have to scrape data off of the web site itself in order to generate the image links. Howard publishes most dailies as PNG, but when he puts extra effort into shading, he likes to publish as JPG. Thankfully, this is as simply as checking the PNG and seeing if I get a 404, but caching the results gave me a good reason to learn memcache.