Entries tagged c

Related tags: http, mod_rewrite, proxy, reverse proxy.

Minor success has been achieved in reimplimenting mod_rewrite

Tuesday, 28 December 2010

So yesterday I mentioned the mod_rewrite-compatible proxy server. Today I've spent several hours getting to grips with that.

I've progressed far enough along that the trivial cases are handled, as the following test case shows:

void
TestParamteredMatch (CuTest * tc)
{
    /**
     * We pretend this came in via the network.
     */
    char *request = "GET /login/steve.kemp/secret#password HTTP/1.1\n\n";

    /**
     * This is the mod_rewrite rule we're going to test.
     */
    char *rule    = "RewriteRule
                       ^/login/(.*)/(.*)/*$
                         /cgi-bin/index.cgi?mode=login;lname=$1;lpass=$2";

    int res = 0;

    /* Parse the HTTP request */
    struct http_request *req = http_request_new (request);
    CuAssertPtrNotNull (tc, req);

    /* Ensure it looks sane. */
    CuAssertStrEquals(tc, "/login/steve.kemp/secret#password", req->path );

    /* Create the rewrite rule */
    struct rewrite_rule *r = rewrite_rule_new (rule);
    CuAssertPtrNotNull (tc, r);

    /* Assert it contains what we think it should. */
    CuAssertStrEquals(tc, "^/login/(.*)/(.*)/*$", r->pattern );

    /* Apply - expect success (==1) */
    res = rewrite_rule_apply( r, req );
    CuAssertIntEquals (tc, 1, res );

    /* Ensure path is updated. */
    CuAssertStrEquals(tc, "/cgi-bin/index.cgi?mode=login;lname=steve.kemp;lpass=secret#password", req->path );

    free_http_request (req);
    free_rewrite_rule (r);
}

So all is good? Sadly not.

I was expecting to handle a linked list of simple rules, but I've now realised that this isn't sufficient. Consider the following two (real) examples:

#
#  If the path is /robots.txt and the hostname isn't repository.steve.org.uk
# then redirect to the master one.
#
RewriteCond %{http_host} !^repository\.steve\.org\.uk
RewriteRule /robots.txt$  http://repository.steve.org.uk/robots.txt [R=permanent,L]

#
#  Request for :  http://foo.repository.steve.org.uk 
#  becomes:  http://repository.steve.org.uk/cgi-bin/hgwebdir.cgi/foo/file/tip
#
RewriteCond %{http_host} .
RewriteCond %{http_host} !^repository.steve.org.uk [NC]
RewriteCond %{http_host} ^([^.]+)\.repository.steve.org.uk [NC]
RewriteRule ^/$ http://repository.steve.org.uk/cgi-bin/hgwebdir.cgi/%1/file/tip [QSA,L]


So rather than having a simple linked list of rules for each domain I need to have a list of rules - each of which might in turn contain sub-rules. In terms of parsing this is harder than I'd like because it means I need to maintain state to marry up the RewriteCond & RewriteRules.

Still the problem isn't insurmountable and I'm pleased with the progress I've made. Currently I can implement enough of mod_rewrite that I could handle all of my existing sites except the single site I have with the complex rule demonstrated above.

(In all honesty I guess I could simplify my setup by dropping the wildcard hostname handling for the repository.steve.org.uk name, but I do kinda like it, and it makes for simple canonical mercurial repositories.)

ObQuote: - 300

| No comments

 

This weekend I have mostly been parsing HTTP

Monday, 27 December 2010

A few weeks ago I was lamenting the lack of of a reverse proxy that understood Apache's mod_rewrite syntax.

On my server I have a bunch of thttpd processes, each running under their own UID, each listening on local ports. For example I have this running:

thttpd -C /etc/thttpd/sites.enabled/blog.steve.org.uk

This is running under the Unix-user "s-blog", with UID 1015, and thus is listening on 127.0.0.1:1015:

steve@steve:~$ id s-blog
uid=1015(s-blog) gid=1016(s-blog) groups=1016(s-blog)

steve@steve:~$ lsof -i :1015
COMMAND   PID   USER   FD   TYPE  DEVICE SIZE NODE NAME
thttpd  26293 s-blog    0u  IPv4 1072632       TCP localhost:1015 (LISTEN)

Anyway in front of these thttpd processes I have Apache. It does only two things:

  • Expands mod_rewrite rules.
  • Serves as a proxy to the back-end thttpd processes.

Yes other webservers could be run in front of these processes, and yes other webservers have their own rewrite-rule-syntax. I'm going to say "Lala lala can't hear you". Why? Because mod_rewrite is the defacto standard. It is assumed, documented, and included with a gazillion projects from wikipedia to wordpress to ...

So this weekend I decided I'd see what I needed to do to put together a simple proxy that only worked for reverse HTTP, and understood Apache's mod_rewrite rules.

I figure there are three main parts to such a beast:

Be a network server

Parse configuration file, accept connections, use fork(), libevevent(), or similar such that you ultimately receive HTTP requests..

Process HTTP Requests, rewriting as required

Once you have a parsed HTTP-request you need to test against each rule for the appropriate destination domain. Rewriting the request as appopriate.

Proxy

Send your (potentially modified) request to the back-end, and then send the response back to the client.

Thus far I've written code which takes this:

GET /etc/passwd?root=1 HTTP/1.0
Host: foo.example.com:80
Accept-Language: en-us,en-gb?q=0.5
Refer: http://foo.bar.com/
Keep-Alive: 300
Connection: keep-alive

Then turns it into this:

struct http_request
{
  /*
   * Path being requested. "/index.html", etc.
   */
  char *path;

  /*
   * Query string
   */
  char *qs;

  /**
   * Method being requested "GET/POST/PUT/DELETE/HEAD/etc".
   */
  char *method;

  /**
   * A linked list of headers such as "Referer: foo",
   * "Connection: close"
   */
  struct http_headers *headers;
};

There are methods for turning that back to a string, (so that you can send it on to the back-end), finding headers such as "Referer:", and so on.

The parser is pretty minimal C and takes only a "char *buffer" to operate on. It has survived a lot of malicious input, as proved by a whole bunch of test cases.

My next job is to code up the mod_rewrite rule-processor to apply a bunch of rules to one of these objects - updating the struct as we go. Assuming that I can get that part written cleanly over the next week or two then I'll be happy and can proceed to write the networking parts of the code - both the initial accepting, and the proxying to the back-end.

In terms of configuration I'm going to assume something like:

/etc/proxy/global.conf                    : Global directives.

/etc/proxy/steve.org.uk/
/etc/proxy/steve.org.uk/back-end          : Will contain 127.0.0.1:1019
/etc/proxy/steve.org.uk/rewrite.conf      : Will contain the domain-wide rules

/etc/proxy/blog.steve.org.uk/
/etc/proxy/blog.steve.org.uk/back-end     : Will contain 127.0.0.1:1015
/etc/proxy/blog.steve.org.uk/rewrite.conf : Will contain the domain-wide rules

..

That seems both sane & logical to me.

ObQuote: "I shall control the fate of the world... " - Day Watch.

ObRandom: Coding in C is pleasant again.

| No comments

 

Room for another reverse-proxy?

Friday, 17 September 2010

Like many people I use Apache's mod_proxy to proxy from *:80 to a bunch of servers running upon 127.0.0.1:XX.

(I've mentioned this too often; but in short I have a bunch of sites all running with thttpd under their own UID).

Why do I use apache soley as a reverse proxy, instead of pound, varnish, nginx, or lighttpd? After all it is pretty heavy-weight. Well the answer to that is that I have a bunch of mod_rewrite rules.

So I'm wondering, could I drop apache if I were to hack together a simple network proxy that would listen upon port 80, reading requests, and directing them to local servers? The answer to that is plainly "yes". There are many reverse-proxies around and writing them isn't hard.

So what would be the point? Imagine a reverse-proxy that understood mod_rewrite rules. That would rock.

In short we'd have to define three things:

  • Matching vhost name.
  • Destination to proxy to.
  • (Optionally) the mod_rewrite rules

Given something like this:

LISTEN 1.2.3.4:80

host example.net or host www.example.net
{
   #redirect traffic here
   proxy_to 127.0.0.1 1011

   RewriteRule /about /cgi-bin/index.cgi
}

When the proxy received an incoming request to http://example.net/about it would actually send the request /cgi-bin/index.cgi to the host 127.0.0.1:1011.

That seems neither too complex nor too impossible.

The hard part would be emulating mod_rewrite 100%. Especially chained requests. I would be willing to write the trivial version, but I suspect the full emulation would be a job of diminishing returns. Am I right?

ObSubject: You crossed the line first, sir. - "The Dark Knight"

| 9 comments.

 

Recent Posts

Recent Tags