My first Python script – an HTML grabber
February 22nd, 2005 1 Comment
I've been looking for ways of getting content from other sites and inserting it into my site (properly credited, of course). Doing this with RSS feeds is easy - that's what they're made for - but doing it with any old web page that may not be made of well-formed XHTML, and may even be very badly formed HTML, is much more tricky.
I wanted to do something I could use in Plone, so that pretty much meant Python - and I know very little about Python and have never written any Python, before this evening.
A big help came in the form of the weirdly named Beautiful Soup (see www.crummy.com/software/BeautifulSoup/examples.html), although it's almost wholy undocumented, which means working out for yourself how to use it from the few examples out there.
Another big help was urllib (http://docs.python.org/lib/module-urllib.html) and urllib2 (http://python.active-venture.com/lib/module-urllib2.html), which you can see in the code below.
So I wrote a test script (my first Python script) called grabber.py. What this script does is it calls a URL which runs a search of my Movable Type weblog, returning summaries of *all* the entries in the blog. The URL includes parameters, which are generated in the "params = urllib.urlencode(...)" line of code. The script then feeds this large chunk of HTML through Beautiful Soup, which uses the map function - "map(lambda x: x.first('a'), soup('h3'))" - to parse out just the <a>...</a> elements within <h3>...</h3> elements.
These anchor elements get put in an array called anchorelement, and I then iterate through the first 5 anchors, assigning them to a variable called newHTML, which I finally write out to a file called notesgrab.html.
You can view this file here: www.itauthor.com/notes/notesgrab.html
grabber.py
import urllib, urllib2from BeautifulSoup import BeautifulSoup
url = 'http://localhost/cgi-bin/mt/mt-search.cgi'params = urllib.urlencode({'IncludeBlogs': 1, 'Template': 'notes', 'RegexSearch': 1, 'search': '.*'})
html = urllib2.urlopen(url, params).read()soup = BeautifulSoup()soup.feed(html)
newHTML = '<html>\n<body>\n\n'newHTML += '<h1>Grabbed HTML</h1>\n\n'counter = 0for anchorelement in map(lambda x: x.first('a'), soup('h3')): strAnchor = str(anchorelement) counter += 1 if counter==1: continue #The first time round strAnchor always == 'None' so skip it. elif counter>6: break #Stop after the first 5 anchors have been added. else: newHTML += ('<p>' + strAnchor + '</p>\n\n')
newHTML += '</body>\n</html>'outfile = open ( 'notesgrab.html', 'w' )outfile.write(newHTML) outfile.close()
To run this on my Linux machine, I enter the command:
python grabber.py
The next thing to do is either to set this up as a cron job to run every half an hour or so, or to use the Zope Management Interface to incorporate this into my Plone site, so that I can use it (or other more useful variants of it) to deliver dynamically generated content.
Potentially similar posts
- Convert escaped Unicode to HTML entities – January 2012
- Viewing dynamically generated HTML in the HTML Help viewer – November 2010
- Perl basics for beginners (on Windows) – August 2010
- Gotchas with running a Perl script as a cron job – August 2010
- ITauthor podcast #33 – A history of RSS – March 2010
February 8th, 2011 at 11:49 pm (#)
nice work...