Checking website links

November 28th, 2003

I was thinking of writing a Perl script that would work as a spider and check links on a website. Then I thought, yes I could do that, but someone must have done it already.

A search on HotScripts.com for link-checking scripts found: checklinks by James Marshall
http://www.jmarshall.com/tools/cl/ (last updated March 26, 2000)

This command-line Perl script parses HTML pages and checks the links (images as well as other HTML pages) and prints the details of any links that return an error.

The script only checks pages on the same host as the script. It won't check remotely hosted pages. You run it from the command line. The usage and options are shown if you run the command checklinks.pl

Here is a typical command:
checklinks.pl -v -I documentation http://www.whatever.xxx/documentation/index.html > checkresults.txt

This checks all links that have "documentation" somewhere in their path (this is specified with the -I flag), starting with the page http://www.whatever.xxx/documentation/index.html and puts all the results in a file called checkresults.txt.

Notes:

  1. The -v specifies verbose mode.
  2. Because I've restricted the checking to paths containing "documentation it won't check a link on a documentation page called mystuff/web-pages/images/mypic.gif because it doesn't contain the word "documentation". This, therefore, stops the checking spreading to non-documentation pages, but doesn't ensure that all documentation pages are free of broken links.

This works quite nicely and I could adapt it to suit my own requirements, but when I showed a colleague, he said he used Xenu Link Sleuth to check the links on his site. So I went and checked that out.

Xenu's Link Sleuth (TM) by Tilman Hausherr
http://home.snafu.de/tilman/xenulink.html (version 1.2e, September 28, 2003)

This is a Windows .exe file that lets you use a GUI to specify the starting page for the check and things you want to exclude or include. It's easy to use, but, if you need them, instructions are at http://home.snafu.de/tilman/xenulink_guide.html

When you run a check, the program prints a whole load of fairly useless facts to its main window, listing all the files it has checked. When it finishes checking it asks if you want to see a report. Of course you do - why else did you run the check? If you select Yes the report appears in your default browser as a list of links to the pages containing broken links, and the broken links themselves. This provides an excellent way of seeing exactly what the problem is. The report also contains some statistics (e.g. it tells you how many bad links it finds - the first time I ran it, it found 1989 links) and it lists all the bad links and pages, and all the good pages (which can be a very long list). The report (still all on the same web page) also contains a hierarchical map of the website you are checking.

The first time I ran the check this produced a very, very long list. My first report was a 4.7MB HTML file (not great for emailing!). The program does contain an email facility. Unfortunately, if you do try to email the report, by selecting the email option, Xenu tries to send the HTML report as part of the email. This won't get through most mail systems (it failed on my local email system and also failed when I tried to send it out to my Hotmail account). Xenu also closes itself down, after failing to send the email, which is annoying because you then can't view the report, you have to restart the program, deselect the email option and run the check again.

Having said that, checking is remarkably quick. The 1989-file check I did took under a minute to complete. This is an incredibly useful (and free) program. I'll try it on another system to see if emailing works elsewhere. But even if the mailing facility is broken, it's still a must-have for anyone who maintains a website.

Leave a comment