About Archive Tags RSS Feed


Entries posted in January 2008

Its a lot like life

3 January 2008 21:50

Assume for a moment that you have 148 hosts logging, via syslog-ng, to a central host. That host is recording all log entries into an MySQL database. Assume that each of these machines is producing a total of 4698816 lines per day.

(Crazy random numbers pulled from thin air; globviously).

Now the question: How do you process, read, or pay attention to those logs?

Here is what we've done so far:


All the syslog-ng client machines are logging to a central machine, which inserts the records into a database.

This database may be queried using the php-syslog-ng script. Unfortunately this search is relatively slow, and also the user-interface is appallingly bad. Allowing only searches, not a view of most recent logs, auto-refreshing via AJAX etc.

rss feeds

To remedy the slowness, and poor usability of the PHP front-end to the database I wrote a quick hack which produces RSS feeds via queries, against that same database, accessed via URIs such as:

  • http://example.com/feeds/DriveReady
  • http://example.com/feeds/host/host1

The first query returns and RSS feed of log entries containing the given term. The second shows all recent entries from the machine host1.

That works nicely for a fixed set of patterns, but the problem with this approach, and that of php-syslog-ng in general, is that it will only show you things that you look for - it won't volunteer trends, patterns, or news.

The fundamental problem is a lack of notion in either system of "recent messages worth reading" (on a global or per-machine basis).

To put that into perspective given a logfile from one host containing, say, 3740 lines there are only approximately 814 unique lines if you ignore the date + timestamp.

Reducing logentries by that amount (78% decrease) is a significant saving, but even so you wouldn't want to read 22% of our original 4698816 lines of logs as that is still over a million log-entries.

I guess we could trim the results down further via a pipe through logcheck or similar, but I can't help thinking that still isn't going to give us enough interesting things to view.

To reiterate I would like to see:

  • per-machine anomolies.
  • global anomolies.

To that end I've been working on something, but I'm not too sure yet if it will go anywhere... In brief you take the logfiles and tokenize, then you record the token frequencies as groups within a given host's prior records. Unique pairings == logs you want to see.

(i.e. token frequency analysis on things like "<auth.info> yuling.example.com sshd[28427]: Did not receive identification string from"

What do other people do? There must be a huge market for this? Even amongst people who don't have more than 20 machines!



I still got the blues for you

11 January 2008 21:50

Been a week since I posted. I've not done much, though I did complete most of the migration of my planet-searching code to gluck.debian.org.

This is now logging to a local SQLite database, and available online.

I've updated the blog software so that I can restrict comments to posts made within the past N days - which has helped with spam.

My other comment-spam system is the use of the crm114 mail filter. I have a separate database now for comments (distinct from that I use for email), and after a training on previous comments all is good.

Other than being a little busy over the past week life is good. Especially when I got to tell a recruitment agent that I didn't consider London to be within "Edinburgh & Surrounding Region". Muppets.

The biggest downside of the week was "discovering" a security problem in Java, which had been reported in 2003 and is still unfixed. Grr. (CVE-2003-1156 for those playing along at home).

Heres the code:

#  Grep for potentially unsafe /tmp usage in shared libraries

find /lib -name '*.so' -type f -print > /tmp/$$
find /usr/lib -name '*.so' -type f -print >> /tmp/$$
for i in $(cat /tmp/$$ ); do
    out=$(strings $i | grep '/tmp')
    if [ ! -z "$out" ]; then
        echo "$i"
        echo "$out"
rm /tmp/$$



Never made it as a wise man

16 January 2008 21:50

Recently I bought a new desktop PC and rotated my existing machines around, leaving me with one spare. Last night I donated that PC to a local friend, (co-incidentally the same woman who had been out cat-sitter when I went to stay at Megan's parents for Christmas).

I said "Its got Debian on it, that funny operating system that you used when you were looking after our kitten, remember?".

She said "Linux ssems to be getting popular these days I might as well try it out".

But seriously she used it. It had GNOME, it had firefox, she seemed happy. It isn't rocket science these days to point a technicalish user at a Debian Desktop and expect them to know what to do with it. The only minor complication was the lack of flash/java since it was etch/amd64 - but thats probably a plus with the amount of flash adverts!

In other news I made a new release of asql last night. This now copes with both Apache's common and combined logformats by default. (Yet another tool of mine which is now being used at work, which motivated the change.)

I've also started pushing commits to xen-tools last night, so there'll be a new release soon. Mostly I'm bored with that code though, it doesn't need significant updates to work any longer. All the interesting things have been done, so it's probably only a matter of time until I drift off. (To the extent that I've unsubscribed myself from xen-users & xen-devel mailing lists)

Virtualisation is increasingly a commodity these days. If xen dies people probably won't care, there are enough other projects out there doing similar things. Being tied to one particular platform will probably come to be regarded as a mistake.

(Talking of which I should try out this lguest thing; I just find it hard to be motivated...)



I hear it every day

17 January 2008 21:50

It bothers me that my Tor usage is less than I'd like because it is just so fiddly.

When it comes to privacy I want to keep things simple, I want to use tor, but I dont want to use it for things that aren't sane.

In practise that means I want to use tor for a small amount of browsing:

  • When the host is a.com, b.com, & c.com
  • When the traffic is not over SSL.

To do that I have to install privoxy, and use that with a configuration file like this:

# don't forward by default.
forward-socks4   /    .
# don't forward by default, even more so for HTTPS
forward-socks4   :443 .

# but we do want tor on these three sites:
forward-socks4   a.com/ .
forward-socks4   b.com/ .
forward-socks4   c.com/ .

I'm using absolutely nothing else in my Privoxy configuration, so it seems like overkill.

I'd love to hear about a simple rule-based proxy-chaining tool - if there is one out there then I'd love to know about it lazyweb.

If not it shouldn't be too hard to write one with the Net::Proxy & Net::Socks module(s).

  listen 1234

  hostname one.com
  port != 443
  proxy socks localhost 8050

  hostname two.com
  port != 443
  proxy socks localhost 8050

  hostname foo.com
  port = 80
  proxy localhost 8000



Painted wings and giant rings make way for other toys.

25 January 2008 21:50

This week has mostly involved me getting my live mail filtering site up and running with a guineapig or two.

This uses a custom user interface to allow users to manage the filtering settings for an entire domain:

  • spam filtering.
  • virus scanning.
  • greylisting.
  • sender/recipient whitelisting.
  • DNS-based blacklists

In terms of implementation this is an SMTP proxy which is built upon the qpsmtpd framework. I've got both the user interface and a collection of plugins reading all data from an MySQL database.

The practical upshot is that if you use the service you'll get less spam, and anything that has been rejected will appear in an online browsable quarantine for a period of times allowing you to view mistakes/rejected mails.

Any mail you didn't want, providing you've got the spam-filtering plugin enabled for your domain, you may send back to be trained as spam.

It scales nicely, doesn't appear to have lost any mail ever in real-world testing, and could be useful.

| No comments


There's a hell of a lot more to me

27 January 2008 21:50

This weekend has been an interesting mix of activities. Mostly I've been tweaking my mail filtering service now that it has more users it is more interesting to do that.

The basic process of mail-scanning is pretty simple, but there are some fun things in the mix which make it slightly more fiddly than I'd like.

The basic recipe goes something like this:

  • Accept mail.
  • Validate the mail is addressed to a domain hosted upon the machine.
  • Do the spam filtering / magic (many steps missing here)
    • If the mail should be rejected archive it to a local Maildir folder and bounce it.
    • If the mail should be accepted then forward it to the destination machine.

The archiving of all rejected messages is a big win. It means that if there is a mistake in the handling of any mail we could undo it, retraining the spam database etc. It also provides, via a web page/rss feed, a way for a user to see what a good job the filtering system is doing - by saying "Here's what you would have had ..".

Today I switched the way that the archived mail is displayed via the Web GUI. Previously I used some nasty Maildir parsing code, but now I'm running IMAP upon localhost - so the viewing of messages is a lot more straightforward. (via Net::IMAP::Simple.)

More interestingly, to most readers I'm sure, today I managed to take a new Kite out for flying. A cold and windy day, but lots of fun. There was beer, pies, and near-death!

This was also the second weekend I carried out some painting of my front room. At this rate I'll have painted all four walls of the room in less than two months! (The last time I painted a room it took approximately six months to complete. Move furnuture & paint one wall. Wait several weeks, then repeat until all walls are complete!)



Never Say Goodbye

31 January 2008 21:50

If you try using some of my software, or any software come to think of it, and it doesn't work, or causes you problems then there is a simple solution.

Tell me. I might not be able to fix it immediately, I might not ever be able to fix it. But chances are I can, and if there is a record it'll help others out in the future regardless.

I've bumped into this in the past "Oh yes I tried to use that tool you wrote but it didn't work, so I ended up with something else."