Amazon S3? Not yet ready for me!
March 15th, 2009
I’ve been thinking for a while how to properly keep back-ups of all of my data while, at the same time, saving a few bucks. Since the “cloud computing” term is now floating all over the Internet, I thought that a distributed, remote back-up service might do the work for me.
I looked around and found quite different services, but most of them offer ridiculously small storage size, like 5GB, or force me into using sub-par Web-based user interfaces that make using rsync complicated or unfeasible. I’m looking services that offer 2TB+ storage and, so far, the only solution that I find promising is Amazon S3. The problem is price. Keeping 2,048GB of data stored in Amazon S3 costs me about $300 USD per month, plus a one-time cost for uploading the data. At that price, for a whole year, I can buy a QNAP TS-809 filled with 8 x 1.5TB disks
So, unfortunately for me, multi-terabyte back-up copies to the Internet are still to expensive. Perhaps, in 5 years, technology will drive prices down such as that I can afford to keep my back-ups on the Internet.
Incremental backups with rsync
September 9th, 2005
I have been thinking for a while to implement incremental, cyclical backups on my home network. The problem with cyclical backups to tape is that they are slow. The problem with cyclical backups to disk is that they consume a great deal of space. I finally opted for cyclical backups to disk since my DDS-3 SCSI tape is slow and can’t hold the many gigabytes I have in data, even with hardware/software compression.
I want to periodically branch my main backup tree so that I can keep several backups, ordered from the newest (backup.0) to the oldest (backup.n), where “n” could be the number of days or weeks, depending on the frequency of the backups.
The filesystem should look like this:
\-- backup.0 | |- backup.1 | |- backup.2 | . . . \- backup.n
A simple way to reduce disk space usage is by using a UNIX-like feature called hard-links. The idea behind this is that if a file does not see its contents changed between backups, we could save space by having all the identical copies hard-linked together.
Using rsync and cp we can implement this very easily, thanks to the way that rsync works. By default, when not using the –inplace command-line switch, if rsync detects that a destination file is different from its source file, instead of performing direct modifications onto the destination file by opening it, writing to it, then closing it, rsync will create a new file. This has several advantages:
- Users can keep on working with files, even when rsync is synching them underneath. Since rsync always creates a new file instead of performing modifications to the current file, users won’t suffer from the strangeness that involves multiple updates to the same file by multiple users/processes.
- Since rsync creates a new file, when the original destination file is hard-linked across several backup branches, the synching process won’t indirectly sync up those backup branches too. Instead, they will be kept intact, and a new destination file, mirroring its source file, will be created.
We don’t want that an update to a file in the backup.0 branch means updating any file hard-linked to it, since that would destroy the incremental semantics.
Thus, we can implement a really simple cyclical backup scheme using rsync and hard-links..
-
Things to run on the server.
We run this periodically:
# rm -fr backup.${n} # for i in `seq ${n} -1 2`; do mv backup.$[${i}-1] backup.${i}; done # cp -al backup.0 backup.1This will rotate all the backups, discarding the last one. Then, the cp command will replicate the main branch (backup.0) into (backup.1) by using hard-links.
NOTE for FreeBSD users: the cp command that comes with the FreeBSD base system does not support neither the -a nor the -l command-line switches. -a means -dpR (recursively copy and preserve attributes), while -l means not to copy, but to create hard-links instead.
Fortunately, the FreeBSD ports collection includes a port of the GNU coreutils package, which sports the full GNU cp program, supporting the -a and -l switches:
# cd /usr/ports/sysutils/coreutils # make all install
To avoid the name clashing betweeh the cp command from the FreeBSD system and the GNU one, the GNU cp command is renamed to gcp. So, in the script listed bedore, we should rename the invocation to cp to gcp.
-
Things to run on the client.
To perform the incremental backup against the server, we can run the following command:
# rsync -a -E Users/ rsync://
:/data/backup.0/ It’s very important to keep the timestamps synchronized on both the client and the server so rsync can use them to decide which files have been changed and which files not. This is done with the -t command-line switch. Note that the -a (archive) command-line switch to rsync is like specifying -rlptgoD, and thus we don’t have to specify -t.
The -E command-line switch is useful for Mac OS X-based machines and will allow synching files stored in a HFS+ volume that uses resource forks by using the AppleDouble format.
Adding extended attributes support to rsync
September 9th, 2005
The rsync software that comes with Mac OS X 10.4, and newer releases, supports extended attributes (HFS+ resource forks). This means it can sync files from a local HFS+ filesystem to a remote volume which does not suport HFS+ resource forks by using AppleDouble encoding.
The AppleDouble encoding splits a native HFS+ file in two parts:
- The data fork, which is the one that holds the real contents of the file, like the contents of a document, the pixels from a bitmap, and so on. It receives the same name as the original HFS+ file.
- The resource fork, which is built of data held on the resource forks and Finder data, like the Spotlight comments and so on. It receives a filename which consists on prepending a dot and a slash characters to the original HFS+ filename.
Thus, for a HFS+ file named MyDocument.webloc, when stored in the AppleDouble format, it is splitted in two files: MyDocument.webloc and ._MyDocument.webloc.
By default, Mac OS X rsync implementation does not enable extended attributes support. This must be explicitly enabled by supplying the -E command-line switch to the rsync command. The problem is, however, that few rsync implementations (I don’t know of any besides Apple’s Mac OS X 10.4 one) support neither this kind of functionality nor the command-line switch that activates it.
The solution was pretty easy, by the way. I downloaded Darwin rsync source code for the rsync-20 from the Darwin Projects Directory and extracted the file patches/EA.diff from within it. This patch file includes the extended attributes (HFS+ resource forks) functionality I was seeking. This patch, at the moment of this writing, is agaist rsync-2.6.3.
Thus, I only had to grab the sources for rsync-2.6.3, which are also included inside the rsync-20.tar.gz file I downloaded before, extracted them, patched, configured, made and installed:
# tar zxvf rsync-20.tar.gz # cd rsync-20 # tar zxvf rsync-2.6.3.tar.gz # cd rsync-2.6.3 # patch < ../patches/EA.diff # CFLAGS="-O -pipe" \ ./configure --prefix=/usr/local \ --disable-debug --enable-ea-support \ --with-rsyncd-conf=/usr/local/etc/rsyncd.conf # make # make install
I ran the previous commands on FreeBSD 7.0-CURRENT, thus the /usr/local prefix. Also, note the –enable-ea-support command-line switch supplied to configure. It is required in order to build the extended attributes support in. Leaving it out will produce a normal, EA-disabled rsync.
# rsync --help | grep -- -E -E, --extended-attributes copy extended attributes
That’s all, folks.