Accessing normally inaccessible URLs

December 12th, 2004

After talking to my colleague Stephen Readman about Apache virtual hosting (see www.itauthor.com/notes/archives/2004/12/getting_apache.html), he came up with the idea of using virtual hosting as a way of being able to browse to URLs that would normally not work outside a closed network.

I've taken his idea and tweaked the implementation of it, as follows.

Scenario:
You work from home and connect to your work network via SSH.
The intranet at work has URLs like:

www.intranet.yourcompany.com
   for the main part of the intranet, and
bugzilla.yourcompany.com
   for the bug tracking system on the intranet.

Neither of these URLs is normally accessible outside the confines of your work network. However, you have set of port forwarding in your SSH connection to allow you to get to the intranet via forwarded ports on your local machine. You have forwarded port 5678 on localhost to the appropriate port on the server at work that serves up www.intranet.yourcompany.com (e.g. port 80 on intraserver) and you have forwarded port 8765 on localhost to the appropriate port on the server at work that serves up bugzilla.yourcompany.com (e.g. port 80 on bugserver).

This means that, after you have successfully connected to your work network using this preconfigured SSH setup (which establishes the port forwarding for the duration of the login session) you can browse to:

http://localhost:5678/development/index.html

and you will be served up the page at:

http://www.intranet.yourcompany.com/development/index.html

and you can browse to:

http://localhost:8765/query.cgi

and you will be served up the page at:

http://bugzilla.yourcompany.com/query.cgi

This is all very well and good, and (provided all the links on your intranet are relative and not absolute) you can browse around your intranet as you would be able to do while in work.

However, if someone emails you with a link to http://bugzilla.yourcompany.com/show_bug.cgi?id=3001 you can't click the link to go there. You need to copy the "show_bug.cgi?id=3001" bit and paste it into your browser address bar at the end of "http://localhost:8765/". This is annoying, but can be remidied, like this:

Accessing internal-only URLs externally

The only requirement for this process is that you are running Apache web server on a machine you can access and administer (e.g. on the machine from which you are trying to connect, or from another machine on your home network). I connect from a Windows XP machine, running Apache, and this is probably the easiest way to go. If you haven't got Apache on your machine already, it's a free download from http://httpd.apache.org/download.cgi, and installing it is a cinch (see http://httpd.apache.org/docs-2.0/install.html for UNIX/Linux and http://httpd.apache.org/docs-2.0/platform/windows.html for Windows). Whichever computer you choose as the proxy (i.e. either the localhost machine, or another machine on your local network) it should be hidden from the outside world behind a good, strong firewall.

What we're going to do is use Apache's virtual hosts functionality to redirect a page request to the forwarded port on your local machine, so that the request gets forwarded on to the web server at work. What will happen is that you'll request a page in your browser, the domain will be recognised and diverted to the Apache web server that you have configured, Apache will then redirect the request via the appropriate port on the local machine. To kick this process off, we first need to intercept any page requests that include those normally inaccessible domains (in my example, www.intranet.yourcompany.com and bugzilla.yourcompany.com.

To do this, edit your hosts file to map the remote domains to your local computer. You can edit the hosts file on your local machine, or on your firewall. Editing it on your firewall means that if you have several computers on your local network and you want to be able to connect from any of them you don't need to edit the hosts file on each computer individually. The way DNS works is that your local computer first tries to resolve a domain name to an IP address local, if it can't do so it looks further afield (e.g. on the firewall), then further afield again (e.g. on your ISP's system), then on again (out into the wide blue internet). One of the first stops is the hosts file on your local computer.

If you're running a Windows machine, the hosts file is hidden away in a directory such as:
C:\WINDOWS\system32\drivers\etc
and yes it's just called "hosts" (no file name extension).

If you're running UNIX/Linux, the hosts file is a lot more accessible (provided you have root privileges), it's in
/etc

The hosts file is a plain text file, and uses the syntax:

For example, if you want to map www.bbc.co.uk to the default page for the web server running on your local machine, you could add:

127.0.0.1        www.bbc.co.uk

However, I wouldn't advise adding stuff like this. You'll probably forget you added it and then be mystified as to why you can't get to those web sites any more.

Say your local machine is called aardvark, and has the static IP address of 192.168.0.123, you may well already have the following line in your hosts file:

192.168.0.123      aardvark

This ensures that whenever you refer to the domain/machine "aardvark", your computer will go looking for "192.168.0.123".

Add similar entries for the work domains you want to access, pointing them to the IP of the local machine running Apache web server. For example:

192.168.0.123      www.intranet.yourcompany.com
192.168.0.123      bugzilla.yourcompany.com

Windows reads the hosts file as and when it's required. Other operating systems may cache the mappings, in which case you need to force a reread of the hosts file. You may need to reboot the machine to achieve this.

Now, if you browse to:

http://www.intranet.yourcompany.com/index.html

you will be served up the page at:

http://192.168.0.123/index.html

This isn't what you want, but we're getting there.

If you browse to:

http://www.intranet.yourcompany.com/index.html

and you do not see the same thing that you see when you browse to:

http://192.168.0.123/index.html

then something is wrong. Chances are the hosts file has not been reread. If Apache is running on a Linux machine, try rebooting the machine to force a reread of the hosts file.

You now need to set up Apache to act as a proxy, redirecting the request elsewhere, rather than serving up a web page itself. To to this, you need to edit Apache's configuration file: httpd.conf.

/etc/httpd/conf/httpd.conf on Linux
C:\Apache\conf\httpd.conf on Windows

First, make sure Apache's mod_proxy and mod_rewrite modules are enabled.

To do this, make sure the LoadModule section of the file contains the following lines:

LoadModule rewrite_module modules/mod_rewrite.so
LoadModule proxy_module modules/mod_proxy.so

These will probably be in there already, but you may have to remove the # comment mark at the beginning of each line.

Check that the file also contains:

AddModule mod_rewrite.c
AddModule mod_proxy.c

Again, these are probably there already, but may need uncommented.

Then, near the end of the file you'll find a section dealing with Virtual Hosts. It may already have one or more
<VirtualHost *>
    ...
</VirtualHost>
elements, or it may have a commented out example.

Add the following as the first VirtualHost element:

<VirtualHost hostname>
    ServerName hostname that you want to display in error messages
    HostnameLookups off
    RewriteEngine on
    RewriteCond %{HTTP_HOST} ^start of URL.*
    RewriteRule ^/(.*) http://localhost:forwarded port number/$1 [L,P]
    ... if required, another RewriteCond, followed by a RewriteRule,
    for each domain you want to redirect

</VirtualHost>

For example:

<VirtualHost aardvark>
    ServerName ' -> VirtualHost: "aardvark" - configured in aardvark\'s httpd.conf file <- '
    HostnameLookups off
    RewriteEngine on
    RewriteCond %{HTTP_HOST} ^www\.intranet\.yourcompany\.com.*
    RewriteRule ^/(.*) http://localhost:5678/$1 [L,P]
    RewriteCond %{HTTP_HOST} ^bugzilla\.yourcompany\.com.*
    RewriteRule ^/(.*) http://localhost:8765/$1 [L,P]
</VirtualHost>

What this example does

  1. <VirtualHost aardvark> tells Apache to apply the contents of this element to any web page request addressed to http://aardvark..., or to any URL whose domain resolves to the same IP address as aardvark.

    For example, if "aardvark", "www.intranet.yourcompany.com" and "bugzilla.yourcompany.com" all resolve to 192.168.0.123, this element will apply equally to all these domains. This is not immediately obvious, so you may want to add a comment above <VirtualHost aardvark> to remind anyone debugging the httpd.conf file that you have used the hosts file to map other domain to the same IP address as "aardvark", meaning that this element will apply to them too.

  2. The value of ServerName is a literal string that is displayed in the browser if something goes awry. This will help in debugging. You could just have repeated the hostname here if you'd wanted to.
  3. RewriteCond ... and RewriteRule ... are pairs of line that define a rule to be applied to page requests with a specific URL.

    The first RewriteCond, in the example above, examines the contents of the HTTP_HOST variable and only applies the RewriteRule if the URL starts "www.intranet.yourcompany.com". The rewrite rules use regular expressions, in which the caret symbol (^) signifies that what follows it must come at the very start of the string being examined. Full stops must be escaped with a backslash (i.e. \.) because in regular expressions . means any character. The .* at the end of the line means any character any number of times - so it will match "www.intranet.yourcompany.com", "www.intranet.yourcompany.com/" or "www.intranet.yourcompany.com/anything/you/like.html".

  4. The RewriteRules take the URL and exame the part of it that starts with the first backslash, reading from left to right. Everything following this backslash is captured in the special variable $1 - this is, again, standard regular expression practice, done using parentheses.

    So, if the URL was "www.intranet.yourcompany.com/anything/you/like.html", $1 would contain "anything/you/like.html".

    The second RewriteRule argument is what the URL will get transformed into. In the example above the resulting URL is http://localhost:5678/ followed by the contents of $1, e.g.:

    http://localhost:5678/anything/you/like.html

    "[L,P]" at the end of the line tell Apache that if it has rewritten the URL, this should be the Last thing it does in this VirtualHost directive, and it should now immediately treat this as a Lroxy request and send the URL to the mod_proxy module for processing.

    Using these flags makes ordering of multiple conditions/rules important, as later ones will never be reached if a URL is processed by an earlier condition/rule.

RewriteRule seems quite complicated until you get the hang of it, but it's well documented here:
http://httpd.apache.org/docs/mod/mod_rewrite.html.

And that's it. All that remains for you to do is save httpd.conf, and force Apache to reread it (e.g. by restarting Apache). You can then browse to something like http://bugzilla.yourcompany.com/query.cgi in your favourite web browser and (provided you have already logged on to the remote network and, thereby, established the required port forwarding) you should see the web page you'd only normally be able to find with this URL if you were using a computer on the remote network.

Leave a comment