On my own personal machines I find the use of logfiles invaluable when installing new services but otherwise I generally ignore the bulk of the data. (I read one mail from each machine containing a summary of the days events.)
When you're looking after a large network having access to logfiles is very useful. It lets you see interesting things - if you take the time to look and if you have an idea of what you want to find out.
So, what do people do with their syslog data? How do you use it?
In many ways the problem of processing the data is that you have two conflicting goals:
- You most likely want to have access to all logfiles recorded.
- You want to pick out "unusual" and "important" data.
While it is the case you can easily find unique messages (given a history of all prior entries) it becomes a challenge to allow useful searches given the volume of data.
Consider a network of 100 machines. Syslog data for a single host can easily exceed 1,000,000 lines in a single day. (The total number of lines written beneath /var/log/ on the machine hosting www.debian-administration.org was 542,707 for the previous 24 hours. It is not a particularly busy machine, and the Apache logs were excluded.)
Right now I've got a couple of syslog-ng servers which simply accept all incoming messages from a large network and filter them briefly. The remaining messages are inserted into mysql via a FIFO. This approach is not very scalable and results in a table having millions of rows which is not pleasant to search.
I'm in the process of coming up with a replacement system - but at the same time I suspect that any real solution will depend a lot on what is useful to pull out.
On the one hand having unique messages only makes spotting new things easy. On the other hand if you start filtering out too much you lose detail. e.g. If you took 10 lines like this and removed all but one you lose important details about the number of attacks you've had:
- Refused connect from 2001:0:53aa:64c:c9a:xx:xx:xx
Obviously you could come up with a database schema that had something like "count,message" and other host-tables which showed you where. The point I'm trying to make is that naive folding can mean you miss the fact that user [email protected] tried to login to host A, host B, and host C..
I'm rambling now, but I guess the point I'm trying to make is that depending on what you care about your optimisations will differ - and until you've done it you probably don't know what you want or need to keep, or how to organise it.
I hacked up a simple syslog server which accepts messages on port 514 via UDP and writes them to a FIFO /tmp/sys.log. I'm now using that pipe to read messages from a perl client and write them to local logfiles - so that I can see the kind of messages that I can filter, collapse, or ignore.
Its interesting the spread of severity. Things like NOTICE, INFO, and DEBUG can probably be ignored and just never examined .. but maybe, just maybe there is the odd deamon that writes interesting things with them..? Fun challenge.
Currently I'm producing files like this:
/var/log/skxlog/DEBUG.user.cron.log /var/log/skxlog/ALERT.user.cron.log /var/log/skxlog/INFO.authpriv.sshd.log
The intention is to get a reasonably good understanding of which facilities, priorities, and programs are the biggest loggers. After that I can actually decide how to proceed.
Remember I said you might just ignore INFO severities? If you do that you miss:
IP=127.0.0.1 Severity:INFO Facility:authpriv DATE:Jul 18 15:06:28 PROGRAM:sshd PID:18770 MSG:Failed password for invalid user rootf from 127.0.0.1 port 53270 ssh2
ObFilm: From Dusk Til Dawn
Tags: data, homework, logging, syslog 9 comments
I like the concept. What do you think, maybe we (the system administrators) need a piece of software that would present that data to us in a usable way:
* web based with interactivity
* 'spam' filters that we can train and correct manually that know what is normal 'white noise' from our systems (don't display, but save for later)
* use decision support system concepts to highligh where things are ok, where there are unclear things and where there are bad things
* use data mining approaches to allow an administrator to query, drill down and examine logs in all kinds of ways after a bad even, like after an intrusion
* a way for sysadmins to share generalized recipes of normal and bad stuff with each other easily
On the other hand this sounds so logical, that someone must have already made something like this.