My first Python script – an HTML grabber

February 22nd, 2005    1 Comment

I've been looking for ways of getting content from other sites and inserting it into my site (properly credited, of course). Doing this with RSS feeds is easy - that's what they're made for - but doing it with any old web page that may not be made of well-formed XHTML, and may even be very badly formed HTML, is much more tricky.

I wanted to do something I could use in Plone, so that pretty much meant Python - and I know very little about Python and have never written any Python, before this evening.

A big help came in the form of the weirdly named Beautiful Soup (see www.crummy.com/software/BeautifulSoup/examples.html), although it's almost wholy undocumented, which means working out for yourself how to use it from the few examples out there.

Another big help was urllib (http://docs.python.org/lib/module-urllib.html) and urllib2 (http://python.active-venture.com/lib/module-urllib2.html), which you can see in the code below.

So I wrote a test script (my first Python script) called grabber.py. What this script does is it calls a URL which runs a search of my Movable Type weblog, returning summaries of *all* the entries in the blog. The URL includes parameters, which are generated in the "params = urllib.urlencode(...)" line of code. The script then feeds this large chunk of HTML through Beautiful Soup, which uses the map function - "map(lambda x: x.first('a'), soup('h3'))" - to parse out just the <a>...</a> elements within <h3>...</h3> elements.

These anchor elements get put in an array called anchorelement, and I then iterate through the first 5 anchors, assigning them to a variable called newHTML, which I finally write out to a file called notesgrab.html.

You can view this file here: www.itauthor.com/notes/notesgrab.html

grabber.py

import urllib, urllib2from BeautifulSoup import BeautifulSoup

url = 'http://localhost/cgi-bin/mt/mt-search.cgi'params = urllib.urlencode({'IncludeBlogs': 1, 'Template': 'notes', 'RegexSearch': 1, 'search': '.*'})

html = urllib2.urlopen(url, params).read()soup = BeautifulSoup()soup.feed(html)

newHTML = '<html>\n<body>\n\n'newHTML += '<h1>Grabbed HTML</h1>\n\n'counter = 0for anchorelement in map(lambda x: x.first('a'), soup('h3')):    strAnchor = str(anchorelement)    counter += 1    if counter==1:        continue #The first time round strAnchor always == 'None' so skip it.    elif counter>6:        break #Stop after the first 5 anchors have been added.    else:        newHTML += ('<p>' + strAnchor + '</p>\n\n')

newHTML += '</body>\n</html>'outfile = open ( 'notesgrab.html', 'w' )outfile.write(newHTML) outfile.close()

To run this on my Linux machine, I enter the command:
python grabber.py

The next thing to do is either to set this up as a cron job to run every half an hour or so, or to use the Zope Management Interface to incorporate this into my Plone site, so that I can use it (or other more useful variants of it) to deliver dynamically generated content.

Comments

  1. User Gravatar Bassinkurser said:

    February 8th, 2011 at 11:49 pm (#)

    nice work...

Leave a comment