This weekend I have mostly been parsing HTTP

A few weeks ago I was lamenting the lack of of a reverse proxy that understood Apache's mod_rewrite syntax.

On my server I have a bunch of thttpd processes, each running under their own UID, each listening on local ports. For example I have this running:

thttpd -C /etc/thttpd/sites.enabled/blog.steve.org.uk

This is running under the Unix-user "s-blog", with UID 1015, and thus is listening on 127.0.0.1:1015:

steve@steve:~$ id s-blog
uid=1015(s-blog) gid=1016(s-blog) groups=1016(s-blog)

steve@steve:~$ lsof -i :1015
COMMAND   PID   USER   FD   TYPE  DEVICE SIZE NODE NAME
thttpd  26293 s-blog    0u  IPv4 1072632       TCP localhost:1015 (LISTEN)

Anyway in front of these thttpd processes I have Apache. It does only two things:

Expands mod_rewrite rules.
Serves as a proxy to the back-end thttpd processes.

Yes other webservers could be run in front of these processes, and yes other webservers have their own rewrite-rule-syntax. I'm going to say "Lala lala can't hear you". Why? Because mod_rewrite is the defacto standard. It is assumed, documented, and included with a gazillion projects from wikipedia to wordpress to ...

So this weekend I decided I'd see what I needed to do to put together a simple proxy that only worked for reverse HTTP, and understood Apache's mod_rewrite rules.

I figure there are three main parts to such a beast:

Be a network server: Parse configuration file, accept connections, use fork(), libevevent(), or similar such that you ultimately receive HTTP requests..
Process HTTP Requests, rewriting as required: Once you have a parsed HTTP-request you need to test against each rule for the appropriate destination domain. Rewriting the request as appopriate.
Proxy: Send your (potentially modified) request to the back-end, and then send the response back to the client.

Thus far I've written code which takes this:

GET /etc/passwd?root=1 HTTP/1.0
Host: foo.example.com:80
Accept-Language: en-us,en-gb?q=0.5
Refer: http://foo.bar.com/
Keep-Alive: 300
Connection: keep-alive

Then turns it into this:

struct http_request
{
  /*
   * Path being requested. "/index.html", etc.
   */
  char *path;

  /*
   * Query string
   */
  char *qs;

  /**
   * Method being requested "GET/POST/PUT/DELETE/HEAD/etc".
   */
  char *method;

  /**
   * A linked list of headers such as "Referer: foo",
   * "Connection: close"
   */
  struct http_headers *headers;
};

There are methods for turning that back to a string, (so that you can send it on to the back-end), finding headers such as "Referer:", and so on.

The parser is pretty minimal C and takes only a "char *buffer" to operate on. It has survived a lot of malicious input, as proved by a whole bunch of test cases.

My next job is to code up the mod_rewrite rule-processor to apply a bunch of rules to one of these objects - updating the struct as we go. Assuming that I can get that part written cleanly over the next week or two then I'll be happy and can proceed to write the networking parts of the code - both the initial accepting, and the proxying to the back-end.

In terms of configuration I'm going to assume something like:

/etc/proxy/global.conf                    : Global directives.

/etc/proxy/steve.org.uk/
/etc/proxy/steve.org.uk/back-end          : Will contain 127.0.0.1:1019
/etc/proxy/steve.org.uk/rewrite.conf      : Will contain the domain-wide rules

/etc/proxy/blog.steve.org.uk/
/etc/proxy/blog.steve.org.uk/back-end     : Will contain 127.0.0.1:1015
/etc/proxy/blog.steve.org.uk/rewrite.conf : Will contain the domain-wide rules

..

That seems both sane & logical to me.

ObQuote: "I shall control the fate of the world... " - Day Watch.

ObRandom: Coding in C is pleasant again.

Tags: c, http, proxy | No comments

This weekend I have mostly been parsing HTTP

Recent Posts