About Archive Tags RSS Feed

 

The plans you refer to will soon be back in our hands.

22 August 2009 21:50

Many of us use rsync to shuffle data around, either to maintain off-site backups, or to perform random tasks (e.g. uploading a static copy of your generated blog).

I use rsync in many ways myself, but the main thing I use it for is to copy backups across a number of hosts. (Either actual backups, or stores of Maildirs, or similar.)

Imagine you backup your MySQL database to a local system, and you keep five days of history in case of accidental error and deletion. Chances are that you'll have something like this:

/var/backups/mysql/0/
/var/backups/mysql/1/
/var/backups/mysql/2/
/var/backups/mysql/3/
/var/backups/mysql/4/

(Here I guess it is obvious that you backup to /mysql/0, after rotating the contents of 0->1, 1->2, 2->3, & 3->4)

Now consider what happens when that rotation happens and you rsync to an off-site location afterward: You're copying way more data around than you need to because each directory will have different content every day.

To solve this I moved to storing my backups in directories such as this:

/var/backups/mysql/9-03-2009/
/var/backups/mysql/10-03-2009/
/var/backups/mysql/11-03-2009/
..

This probably simplifies the backup process a little too: just backup to $(date +%d-%m-%Y) after removing any directory older than four days.

Imagine you rsync now? The contents of previous days won't change at all, so you'll end up moving significantly less data around.

This is a deliberately contrived and simple example, but it also applies to common everyday logfiles such as /var/log/syslog, syslog.1, syslog.2.gz etc.

For example on my systems qpsmtpd.log is huge, and my apache access.log files are also very large.

Perhaps food for thought? One of those things that is obvious when you think about it, but doesn't jump out at you unless you schedule rsync to run very frequently and notice that it doesn't work as well as it "should".

ObFilm: Star Wars. The Family Guy version ;)

| 12 comments

 

Comments on this entry

icon martin f. krafft at 10:57 on 22 August 2009
http://madduck.net

http://bugs.debian.org/536432 -- once that's fixed, the next step is to make it default. ;)

icon Anonymous at 16:57 on 22 August 2009

Another fun realization about rsync and backups:

If you encrypt the files you back up (say, with GPG and one or more public keys), you don't need to bother securing rsync; you can run anonymous rsync. That means you don't need an insecure "backup" SSH key that lacks a passphrase.

I use this to back up the SQL database for one of the sites I run. Dump GPG-encrypted backups to a directory shared via anonymous rsync.

icon Matt Simmons at 18:38 on 22 August 2009

Two thoughts

@Steve - You'll be much saner in the end if your date format is Ymd rather than d-m-y, because sorting works in the former case, but not in the last. We've actually got a company-wide date format that we call "DATECHUNK".

All of our operations scripts include one "global" shell script that sets all the necessary environmental variables, and one of those variables is
export DATECHUNK=`date %Y%m%d`


@Anonymous - It's very possible to have have cron-scripted rsyncs using SSH keys with passphrases - http://www.standalone-sysadmin.com/blog/2008/11/host-to-host-security-with-ssh-keys/

You've only got to remember to run ssh-agent when the machine boots, and to edit the string in the crontab. You can even have a Nagios check to see if ssh-agent is running, and alert if it isn't.




icon Markus Hochholdinger at 18:45 on 22 August 2009

echo '13a14,16
> # daily extension for log files
> dateext
> ' | patch /etc/logrotate.conf

icon Markus Hochholdinger at 18:51 on 22 August 2009

And another one for the backup, optimized rsync option: "--link-dest=$DIR". This saves you the "cp -al".

icon Steve Kemp at 20:00 on 22 August 2009

Thanks for pointing out dateext to me - I was already aware of it, but I guess I should have been explicit.

Allowing anonymous rsync rather than using an SSH tunnel is a great solution if you're sure of your endpoints, and you have appropriate ACLs (or rsync passwords in use), and is definitely worthy of a tip entry itself.

Finally storing dates in YYYY-MM-DD is definitely a good thing in general. For my own needs I tend to only store 5-10 days worth of archives so sort order isn't a huge deal, but I appreciate it is probably the best approach for the general case.


icon Wouter Verhelst at 23:00 on 22 August 2009

You may want to reduce rsync'ing data even further by first creating the link farm locally, then creating it remotely too, and only then starting rsync. That way, files that haven't changed since yesterday will be copied from yesterday's directory on the remote server, rather than you having to rsync them over for no particular reason.

And, also, rather than reinventing the wheel, you may be interested in rsnapshot, which does all this and more out of the box.

icon Anonymous at 02:56 on 23 August 2009

@Steve
"Allowing anonymous rsync rather than using an SSH tunnel is a great solution if you're sure of your endpoints, and you have appropriate ACLs (or rsync passwords in use), and is definitely worthy of a tip entry itself."

Why do you need an ACL or rsync password? It doesn't really matter, unless you think someone might try to DoS the system via rsync, and they can do that plenty of other ways.

@Matt: Yes, I realize you can set up SSH for backups, but anonymous rsync proves quite a bit simpler and more convenient.

icon Toby at 04:46 on 23 August 2009

Gosh, isn't it just time to use ZFS? (snapshots, replication).

Or at least NILFS2 (snapshots).

icon Steve Kemp at 14:32 on 23 August 2009

If the encryption is strong then it doesn't matter if people can steal it - so on that basis anonymous rsync is sufficient. But my own paranoia would ensure that I'd still want to have a list of valid/known-good IPs.

That would partly mitigate against DoS attacks, partly against any bugs/exploits in rsync, and partly just let me sleep at night.

@Toby: Even if zfs were available on all my systems it isn't a magic bullet. Having snapshots locally is great for fast rollbacks, but provides no protection against a failing disk - so you still need off-site copies. Which leads you back to rsync, etc.

icon er:k at 09:46 on 24 August 2009
http://blog.sietch-tabr.com

For mysql I use a script named automysqlbackup (which you can found in sourceforge) which does what you say: date named directories for each daily/weekly/monthly backup so rsync is faster.
I actually use it with rdiff-backup too which optimise disk space using hardlinks. This way I only use rsync for the backup on a remote location.

icon Rodolphe Quiedeville at 13:47 on 24 August 2009

After trying lots of solution to rotate I now use this little hack, backup in a folder named with day of week. Like this you have 7 days history and only one dir change a day, no neccessary to clean old datas too.

DATE_DIR=`date +%A`
mkdir -p $ODIR/$DATE_DIR
rsync -qav -e ssh $HOST:$BDIR $ODIR/$DATE_DIR