tar

From Peyton Hall Documentation
Jump to navigation Jump to search

'tar' (short for Tape ARchive) is a Unix program used to stream files together into a single device, which may be a tape drive or even a file for easy transport from one computer to another.


Introduction

Tape drives tend to write files one after the other. So to backup a directory, you would have to backup each file separately. A directory of thousands of files would be tedious to write, both for the user and for the drive! So tar will take all the files, and write them out as one large file. It does no compression on the files, just smashes them all together into one big file. This is handy for tape drives, but is also good for things like emailing smaller files as attachments, or for FTP or HTTP transmission of larger files. Because tar files, commonly called 'tarballs', are uncompressed, they are usually compressed manually with 'gzip' or 'compress' (and now people are also using 'bzip2' to compress them). This gives a tarball its extension, '.tar.gz', '.tar.Z' or '.tar.bz2' respectively. If the file spent any time on a Windows machine, or might do so, the extension '.tgz' is also common.


Guidelines for creating tarballs

Before we talk about making and using tarballs, here's some common 'rules', though they are not necessarily strictly adhered to. A tarball's name, say 'foo.tar.gz', generally tells you that when you uncompress it, a directory named 'foo' will be created in the current directory, and all the files in 'foo' will go into that directory. This is a somewhat standard approach, though not everyone (and everything) follows it. It's just for neatness, so that you can have a directory full of tarballs, and know that if you untar each one you'll have a directory for each tarball as well. So a tarball named 'ssh-1.2.32.tar.gz' will generally (but not always) create a directory named 'ssh-1.2.32', and all the files that were in it when the tarball was created.


Using 'tar'

The basics

To create a tarball, you use the 'c' option to tar, such as 'tar cf foo.tar foo'. This will take the contents of the directory 'foo', and put them all (with the directory itself) into a file named foo.tar. The 'f' option is to tell tar what device to write the tarball to, and could just as easily be '/dev/tape' to write it out to a tape drive.

Remember tarfiles are not compressed by default, and may be quite large depending on the contents of the directory 'foo'. So most people will use stdout as the device to write the tarfile to, and pipe the output to gzip (sometimes with the -9 option for maximum compression, if you're not concerned with the amount of time it may take to compress): 'tar cf - foo | gzip -9 > foo.tar.gz'. This does the same as before, but compresses the output file as well. You could just as easily type 'gzip -9 foo.tar' after running the tar command, but think about the numbers:

  • If the directory 'foo' contains 1MB of files, 'foo.tar' will be about 1MB in size.
  • Therefore, the first tar command would double the amount of space used by these files (1MB for the original, 1MB for the tarball).
  • Then running gzip on the tar file will take up a bit of space, since gzip will not remove the original file until it is completely compressed.
  • If the file is 500KB compressed, the total space used is 2.5MB for everything, which then shrinks down to 1.5MB (original files plus the compressed tarball).
  • If instead you do everything in a pipe, you only use the 1.5MB, since nothing is written to disk until compressed (gzip will start writing the compressed output to disk, but it will not be any larger than the finished file).


Compression

GNU Tar also has a compression option, 'z', to automatically pipe the output of tar through gzip, using the standard level of compression: 'tar zcf foo.tar.gz foo' will do the same as above, though without the -9 compression option to gzip.

NOTE:

You should *always* keep tarballs compressed unless you're extracting the data from them, especially if they reside in your home directory. Uncompress tar files, usually accompanied by their contents in the same directory, take up twice as much space as the files themselves. It has been known to happen, on occasion, that someone will go through and 'bzip2 -9' any uncompressed tarballs left lying around, as well as sending nasty notes about them.


Uncompressing

Just as making tarballs, there's a few different ways to uncompress them. First, the standard 'gunzip foo.tar.gz ; tar xf foo.tar' which suffers the same drawback as creating a tarball in two steps above, using more space than is necessary. You can instead use 'zcat foo.tar.gz | tar xf -' which is much like the second method above; using a pipe to keep down the amount of disk space needed to do all the operations. Again, GNU Tar has the 'z' option built-in to uncompress, so 'tar zxf foo.tar.gz' will put things into one command.


Other options

A handy option when creating or expanding tarballs is the 'v' option, which lists each file to stderr as it is expanded. This goes to stderr so you can still use a pipe for compression: 'tar cvf - foo | gzip > foo.tar.gz'


Tricks

Since you now know you can do tars over pipelines, you can do a copy of files using two tars: tar cvf - foo | ssh remotemachine 'cd /destination ; tar xf -'. Note that you only want to use the 'v' option on one side or the other of the pipe, or it gets messy; I tend to put the option on whichever side is closest to my machine, ie on the nearer side of the ssh: ssh remotemachine 'cd /source ; tar cf - foo' | tar xvf -. This also cuts down on bandwidth and makes the copy go faster, since only the tar data is going down the pipe, not the stderr output of the filenames.


Gotchas

  • Leading / on paths.
    If you try to do a tar of '/u/user/foo', you'll see an error message "tar: Removing leading `/' from member names". This means that when you untar the files, instead of going to '/u/user/foo' it will go to './u/user/foo', starting at the current directory and creating new ones as it goes. This works fine if you first cd to /, but if you're already in /u/user it will not work as expected. There is an option to leave the leading / on the tarball, however you should not do this unless you *really* know what you're doing. There is also an option required to expand a tarball and keep the leading '/' intact, for the same reason (imagine getting a tarball from someone which overwrites /bin/sh, etc).
  • File structures.
    As stated above, a general rule is to include the directory in the tarball, as in a backup of all the files in 'foo' would start with 'foo/' followed by the files. If for some reason you just want the files, running the command 'tar zcvf foo.tar.gz foo/*' will do this, *but* if there were any dotfiles in that directory (files that begin with a '.' and are usually hidden in a normal 'ls' command) they will *not* be included! Chances are, if you want the dotfiles as well, you want to tar the directory, not the files IN the directory.
  • Just because a file ends with the extension '.tar' doesn't mean it's an uncompressed tarball.
    Sometimes browsers will strip the trailing '.gz' or '.bz2' extension from a file, and leave you with a compressed tarball ending in '.tar'. When in doubt, run 'file' on the file first to see if it's compressed, or else scratch your head when you see the error:
tar: This does not look like a tar archive
tar: Skipping to next header
tar: 108 garbage bytes ignored at end of archive
tar: Error exit delayed from previous errors


A bit of humor

Stanford University has a FTP repository, full of tarballs. The machine's name is 'labrea'.


See also