tar

From Peyton Hall Documentation

(Difference between revisions)
Jump to: navigation, search
(New page: {{oldfaq|NUM=111}} Here's some information about tarballs that persons unfamiliar to Unix might find handy. First, we'll start with some basic information. 'tar', which stands for "Tape ...)
m (Tarballs moved to tar)

Revision as of 18:04, 16 May 2007

Template:oldfaq

Here's some information about tarballs that persons unfamiliar to Unix might find handy.

First, we'll start with some basic information. 'tar', which stands for "Tape ARchive", is a program originally written to store information on a tape. Tape drives tend to write files, one after the other. So to backup a directory, you would have to backup each file separately. A directory of thousands of files would be tedious to write, both for the user and for the drive! So tar will take all the files, and write them out as one large file. It does no compression on the files, just smashes them all together into one big file. This is handy for tape drives, but is also good for things like emailing files as attachments, or for FTP or HTTP transmission. Because tar files, commonly called 'tarballs', are uncompressed, they are usually compressed manually with 'gzip' or 'compress' (and now people are also using 'bzip2' to compress them). This gives a tarball its extension, '.tar.gz', '.tar.Z' or '.tar.bz2' respectively. If the file spent any time on a Windows machine, or might do so, the extension '.tgz' is also common.

Before we talk about making and using tarballs, here's some common 'rules', though they are not necessarily strictly adhered to. A tarball's name, say 'foo.tar.gz', generally tells you that when you uncompress it, a directory named 'foo' will be created in the current directory, and all the files in 'foo' will go into that directory. This is a somewhat standard approach, though not everyone (and everything) follows it. It's just for neatness, so that you can have a directory full of tarballs, and know that if you untar each one you'll have a directory for each tarball as well. So a tarball named 'ssh-1.2.32.tar.gz' will generally (but not always) create a directory named 'ssh-1.2.32', and all the files that were in it when the tarball was created.

To create a tarball, you use the 'c' option to tar, like so:

tar cf foo.tar foo

This will take the contents of the directory 'foo', and put them all (with the directory itself) into a file named foo.tar. The 'f' option is to tell tar what device to write the tarball to, and could just as easily be '/dev/tape' to write it out to a tape drive.

Now remember what I said about compression; this tarfile will not be compressed, and may be quite large depending on the contents of the directory 'foo'. So most people will use stdout as the device to write the tarfile to, and pipe the output to gzip (sometimes with the -9 option for maximum compression, if you're not concerned with the amount of time it may take to compress):

tar cf - foo | gzip -9 > foo.tar.gz

This does the same as above, but compresses the file as well. You could just as easily type 'gzip -9 foo.tar' after running the above command, but think about the numbers. If the directory 'foo' contains 1MB of files, 'foo.tar' will be about 1MB in size. So the first tar command would double the amount of space used by these files (1MB for the original, 1MB for the tarball). Then running gzip on the file will take up a bit of space, since gzip will not remove the original file until it is completely compressed. So if the file is 500KB compressed, the total space used is 2.5MB for everything, which then shrinks down to 1.5MB (original files plus the compressed tarball). Now, if instead you do everything in a pipe, you only use the 1.5MB, since nothing is written to disk until compressed (gzip will start writing the compressed output to disk, but it will not be any larger than the finished file).

GNU Tar also has a compression option, 'z', to automatically pipe the output of tar through gzip, though it uses the standard level of compression. So:

tar zcf foo.tar.gz foo

..will do the same as above, though without the -9 compression option to gzip.

Now, how to uncompress a tarball? Well, just as making them, there's a few ways. First, the standard:

gunzip foo.tar.gz
tar xf foo.tar

This method suffers the same drawback as creating a tarball in two steps above, using more space than is necessary. So you can use 'zcat' instead:

zcat foo.tar.gz | tar xf -

This is much like the second method above, using a pipe to keep down the amount of disk space needed to do all the operations. Again, GNU Tar has the 'z' option built-in to uncompress, like so:

tar zxf foo.tar.gz

Another handy option when creating or expanding tarballs is the 'v' option, which lists each file to stderr as it is expanded. This goes to stderr so you can still use a pipe for compression, as in:

tar cvf - foo | gzip > foo.tar.gz

Lastly, since you now know you can do tars over pipelines, you can do a copy of files using two tars:

tar cvf - foo | ssh remotemachine 'cd /destination ; tar xf -'

Note that you only want to use the 'v' option on one side or the other of the pipe, or it gets messy. I tend to put the option on whichever side is closest to my machine, ie on the nearer side of the ssh:

ssh remotemachine 'cd /source ; tar cf - foo' | tar xvf -

This also cuts down on bandwidth and makes the copy go faster, since only the tar data is going down the pipe, not the stderr output of the filenames.

Three things to be careful about:

  1. Leading / on paths. If you try to do a tar of '/u/user/foo', you'll see an error message "tar: Removing leading `/' from member names". This means that when you untar the files, instead of going to '/u/user/foo' it will go to './u/user/foo', starting at the current directory and creating new ones as it goes. This works fine if you first cd to /, but if you're already in /u/user it will not work as expected. There is an option to leave the leading / on the tarball, however you should not do this unless you *really* know what you're doing. I believe there is also an option required to expand a tarball and keep the leading '/' intact, for the same reason (imagine getting a tarball from someone which overwrites /bin/sh, etc)
  2. File structures. As I said above, a general rule is to include the directory in the tarball, as in a backup of all the files in 'foo' would start with 'foo/' followed by the files. If for some reason you just want the files, running the command 'tar zcvf foo.tar.gz foo/*' will do this, *but* if there were any dotfiles in that directory (files that begin with a '.' and are usually hidden in a normal 'ls' command) they will *not* be included! Chances are, if you want the dotfiles as well, you want to tar the directory, not the files IN the directory.
  3. Just because a file ends with the extension '.tar' doesn't mean it's an uncompressed tarball. Sometimes browsers will strip the trailing '.gz' or '.bz2' extension from a file, and leave you with a compressed tarball ending in '.tar'. When in doubt, run 'file' on the file first to see if it's compressed, or else scratch your head when you see the error:
tar: This does not look like a tar archive
tar: Skipping to next header
tar: 108 garbage bytes ignored at end of archive
tar: Error exit delayed from previous errors

A bit of tar humor: Stanford University has a FTP repository, full of tarballs. The machine's name is 'labrea'.

Personal tools