A few weeks ago I was lamenting the lack of of a reverse proxy that understood Apache's mod_rewrite syntax.
On my server I have a bunch of thttpd processes, each running under their own UID, each listening on local ports. For example I have this running:
thttpd -C /etc/thttpd/sites.enabled/blog.steve.org.uk
This is running under the Unix-user "s-blog", with UID 1015, and thus is listening on 127.0.0.1:1015:
steve@steve:~$ id s-blog uid=1015(s-blog) gid=1016(s-blog) groups=1016(s-blog) steve@steve:~$ lsof -i :1015 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME thttpd 26293 s-blog 0u IPv4 1072632 TCP localhost:1015 (LISTEN)
Anyway in front of these thttpd processes I have Apache. It does only two things:
- Expands mod_rewrite rules.
- Serves as a proxy to the back-end thttpd processes.
Yes other webservers could be run in front of these processes, and yes other webservers have their own rewrite-rule-syntax. I'm going to say "Lala lala can't hear you". Why? Because mod_rewrite is the defacto standard. It is assumed, documented, and included with a gazillion projects from wikipedia to wordpress to ...
So this weekend I decided I'd see what I needed to do to put together a simple proxy that only worked for reverse HTTP, and understood Apache's mod_rewrite rules.
I figure there are three main parts to such a beast:
- Be a network server
Parse configuration file, accept connections, use fork(), libevevent(), or similar such that you ultimately receive HTTP requests..
- Process HTTP Requests, rewriting as required
Once you have a parsed HTTP-request you need to test against each rule for the appropriate destination domain. Rewriting the request as appopriate.
- Proxy
Send your (potentially modified) request to the back-end, and then send the response back to the client.
Thus far I've written code which takes this:
GET /etc/passwd?root=1 HTTP/1.0 Host: foo.example.com:80 Accept-Language: en-us,en-gb?q=0.5 Refer: http://foo.bar.com/ Keep-Alive: 300 Connection: keep-alive
Then turns it into this:
struct http_request { /* * Path being requested. "/index.html", etc. */ char *path; /* * Query string */ char *qs; /** * Method being requested "GET/POST/PUT/DELETE/HEAD/etc". */ char *method; /** * A linked list of headers such as "Referer: foo", * "Connection: close" */ struct http_headers *headers; };
There are methods for turning that back to a string, (so that you can send it on to the back-end), finding headers such as "Referer:", and so on.
The parser is pretty minimal C and takes only a "char *buffer" to operate on. It has survived a lot of malicious input, as proved by a whole bunch of test cases.
My next job is to code up the mod_rewrite rule-processor to apply a bunch of rules to one of these objects - updating the struct as we go. Assuming that I can get that part written cleanly over the next week or two then I'll be happy and can proceed to write the networking parts of the code - both the initial accepting, and the proxying to the back-end.
In terms of configuration I'm going to assume something like:
/etc/proxy/global.conf : Global directives. /etc/proxy/steve.org.uk/ /etc/proxy/steve.org.uk/back-end : Will contain 127.0.0.1:1019 /etc/proxy/steve.org.uk/rewrite.conf : Will contain the domain-wide rules /etc/proxy/blog.steve.org.uk/ /etc/proxy/blog.steve.org.uk/back-end : Will contain 127.0.0.1:1015 /etc/proxy/blog.steve.org.uk/rewrite.conf : Will contain the domain-wide rules ..
That seems both sane & logical to me.
ObQuote: "I shall control the fate of the world... " - Day Watch.
ObRandom: Coding in C is pleasant again.