About Archive Tags RSS Feed


Entries tagged homework

I didn't make a statement. I asked a question. Would you like me to ask it again?

18 July 2009 21:50

On my own personal machines I find the use of logfiles invaluable when installing new services but otherwise I generally ignore the bulk of the data. (I read one mail from each machine containing a summary of the days events.)

When you're looking after a large network having access to logfiles is very useful. It lets you see interesting things - if you take the time to look and if you have an idea of what you want to find out.

So, what do people do with their syslog data? How do you use it?

In many ways the problem of processing the data is that you have two conflicting goals:

  • You most likely want to have access to all logfiles recorded.
  • You want to pick out "unusual" and "important" data.

While it is the case you can easily find unique messages (given a history of all prior entries) it becomes a challenge to allow useful searches given the volume of data.

Consider a network of 100 machines. Syslog data for a single host can easily exceed 1,000,000 lines in a single day. (The total number of lines written beneath /var/log/ on the machine hosting www.debian-administration.org was 542,707 for the previous 24 hours. It is not a particularly busy machine, and the Apache logs were excluded.)

Right now I've got a couple of syslog-ng servers which simply accept all incoming messages from a large network and filter them briefly. The remaining messages are inserted into mysql via a FIFO. This approach is not very scalable and results in a table having millions of rows which is not pleasant to search.

I'm in the process of coming up with a replacement system - but at the same time I suspect that any real solution will depend a lot on what is useful to pull out.

On the one hand having unique messages only makes spotting new things easy. On the other hand if you start filtering out too much you lose detail. e.g. If you took 10 lines like this and removed all but one you lose important details about the number of attacks you've had:

  • Refused connect from 2001:0:53aa:64c:c9a:xx:xx:xx

Obviously you could come up with a database schema that had something like "count,message" and other host-tables which showed you where. The point I'm trying to make is that naive folding can mean you miss the fact that user admin@ tried to login to host A, host B, and host C..

I'm rambling now, but I guess the point I'm trying to make is that depending on what you care about your optimisations will differ - and until you've done it you probably don't know what you want or need to keep, or how to organise it.

I hacked up a simple syslog server which accepts messages on port 514 via UDP and writes them to a FIFO /tmp/sys.log. I'm now using that pipe to read messages from a perl client and write them to local logfiles - so that I can see the kind of messages that I can filter, collapse, or ignore.

Its interesting the spread of severity. Things like NOTICE, INFO, and DEBUG can probably be ignored and just never examined .. but maybe, just maybe there is the odd deamon that writes interesting things with them..? Fun challenge.

Currently I'm producing files like this:


The intention is to get a reasonably good understanding of which facilities, priorities, and programs are the biggest loggers. After that I can actually decide how to proceed.

Remember I said you might just ignore INFO severities? If you do that you miss:

DATE:Jul 18 15:06:28
MSG:Failed password for invalid user rootf from port 53270 ssh2

ObFilm: From Dusk Til Dawn



Images transitioned, and mysql solications.

16 May 2011 21:50

Images Moved

So I've retired my old picture hosting sub-domain, and moved all the files which were hosted by the dynamic system into a large web-root.

This means no more uploads are possible, but each link continues to work. For example:

Happily the system generated "random" links, and it was just a matter of moving each uploaded file into that static location, then removing the CGI application.

The code for the new site has now been made public, although I suspect there will need to be some email-pong if anybody wishes to use it. Comments welcome.

MySQL replacements?

Lets pretend I work for a company which has dealings with many MySQL users.

Lets pretend that, even though it is true, such that I don't have to get into specifics.

Let us pretend that we have many many hundreds of users who are very happy with MySQL, but that we have a few users who have "issues". That might be:

  • mysqld segfaulting every few months, with no real idea why.
    • Transactions are involved. So are stored proceedures.
    • MySQL paid support might have been perfect, or it might have lead to "yup, its a bug. good luck rebuilding with this patch. let us know how it turns out kthxbai."
    • Alternatively it might not have been re-producable.
  • Master-Master and Master-Slave setups being "unreliable" such that data inconsistencies arise despite MySQL regarding them as being in sync.
    • Good luck resolving that when you have two almost-identical "mysqldump" outputs which are 6Gb each and which cause "diff" to exit with "out of memory" even on a 64Gb host.
    • Is it possible to view differences in table-data, via the binary records? That'd be a fun project .. for a masochist.
  • Poor usage of resources.
  • Heavy concurrancy caused by poorly developed applications in a load-balanced environment, leading to stalling queries. (Wordpress + Poor wordpress plugins I'm looking at you; you're next on my killfile).

To compound this problem some of these installations may or may not be running Etch. let us pretend they are not, just to simplify things. (They mostly arent' these days, but I'm sure I could think of one or two if I tried)

So, in this hypothetical situation what would you recommend?

I know there are new forks aplenty of MySQL. Drizzle et al. I suspect most of the forks will be short-lived - lots of this stuff is hard and non-sexy. I suspect the long-lived forks are probably concentrating on edge-cases we've not hit (yet), or on sexy exciting things like new storage engines and going nosql like all the cool kids.

Realistically going down the postgresql road is liable to lead to wholly different sets of problems, and a significant re-engineering of several sites, applications and tools with no proof of stability.

Without wanting to jump ship entirely, what, if any, are our options?

PS. MySQL I still mostly love you, but my two most recent applications were written to use redis instead. Just a coincidence... I swear. No, put down that axe. Please can't we just talk about it?/p>

ObQoote: "I've calculated your chance of survival, but I don't think you'll like it. " - Hitchhikers Film.