1. Overview

In this tutorial, we’ll see how to compress and decompress files and directories using bzip2, a high-quality standard data compression tool.

The same version of bzip2 1.0.8 is preinstalled in Debian 12, Fedora 39, openSUSE Tumbleweed, Gentoo 2, Arch Linux, and other Linux distributions derived from them. This ensures that bzip2 is fully compatible for manual and scripting use in all distributions.

2. Why Use bzip2?

One of the main reasons for choosing bzip2 is its high efficiency in compressing text files. This makes it particularly suitable for applications where the volume of text data is large and where storage space reduction is critical. In addition, bzip2 offers a good balance between compression and decompression speed, making it useful for archives that need to be decompressed frequently.

It’s also worth noting that Clonezilla, a disk cloning tool, offers several compression options for disk images, including bzip2. In the Clonezilla documentation, bzip2 is mentioned as one of the compression options that, although slower, can result in significantly smaller file sizes than other options such as gzip or lzop.

So, while bzip2 is particularly effective for text files, its compression technology can be equally beneficial for files that aren’t strictly text, such as disk images. This is because bzip2 works well with data that has repetitions of long byte sequences, a feature that we can also find on disks.

3. Compressing Files and Directories

Let’s use this Bash script to generate four sample text files to compress:

#!/bin/bash
mkdir -p test/subdir
wget "https://loripsum.net/api/10/verylong/plaintext" -O test/file1.txt
wget "https://loripsum.net/api/10/verylong/plaintext" -O test/file2.txt
wget "https://loripsum.net/api/10/verylong/plaintext" -O test/subdir/subfile1.txt
base64 /dev/urandom | head -c 1000000 > test/file3.txt

Using the free loripsum.net API, we created file1.txt, file2.txt, and subfile1.txt with pseudo-natural human language text, and in these cases, we can expect to achieve the best level of compression. Let’s look at the beginning of the contents of file1.txt:

$ cat ./test/file1.txt 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Et tamen quid attinet luxuriosis ullam
exceptionem dari aut fingere aliquos, qui, cum luxuriose viverent, a summo philosopho non
[...]

Instead, file3.txt contains 1MB of random data converted into readable characters. In this case, we can expect less effective compression. Incidentally, if the data were binary and truly random, there would be no compression:

$ cat ./test/file3.txt 
ZnZ2aKHho1q2ZJLkSBaHL7mTQj6EHH2NWNF6bnahJRH+rzGde0wzZngpOfSBrr9FgF6oqirkA9xh
5iBUrZkvLPuOdwrtZfTWXng9C4XQqGN5OIOcdiYS4MlJG5CzeaMDC2ERRfrNJOy14fU39392S5zh
[...]

Now, we’re ready to experiment with bzip2 compression.

3.1. Compress a Single File

We can specify one or more files as input, remembering that bzip2 processes them separately. Each compressed file will have the same name as the input file, but with the extension .bz2 added. To prevent deletion of the input files, we can use the –keep option:

$ bzip2 --keep ./test/file1.txt ./test/file3.txt

Using awk and stat, let’s evaluate the degree of compression achieved. The calculated percentage indicates the number of bytes less than the uncompressed file:

$ awk '{s[NR]=$0} END {print "Compression: " 100*(1-s[2]/s[1]) "%"}' <(stat -c %s ./test/file1.txt ./test/file1.txt.bz2)
Compression: 62.097%

$ awk '{s[NR]=$0} END {print "Compression: " 100*(1-s[2]/s[1]) "%"}' <(stat -c %s ./test/file3.txt ./test/file3.txt.bz2)
Compression: 24.1822%

The result is as expected, i.e., file1.txt containing human language has a significantly higher compression level than file3.txt containing a set of text characters in random order.

According to the bzip2 man page, the default compression level is already the maximum.

3.2. Compress Multiple Files

bzip2 only compresses individual files and doesn’t have the ability to bundle multiple files into a single container. To handle multiple files, we must first use the tar command, which is included by default in almost all Linux distributions. It groups the files into a single .tar archive, known as a tarball. Then, we can compress it using bzip2:

$ tar -cvf archive_name.tar test/file1.txt test/file2.txt test/subdir
$ bzip2 archive_name.tar

This way, we got the file archive_name.tar.bz2 with a double extension, but the single extensions .tbz2, .tb2, .tbz, and .tz2 are still valid. We can combine the previous two commands into one:

$ tar -cjvf archive_name.tar.bz2 test/file1.txt test/file2.txt test/subdir

The result is identical. Let’s look at each option:

  • -c → Creates a new archive
  • -j → Uses bzip2 compression
  • -v → Verbose mode that provides visible output of which files are included
  • -f → It’s followed by the name of the tarball

We need to specify the files and directories to include in the tarball at the end of the command.

3.3. Compress a Directory and Its Subdirectories

In this case, what we saw earlier about using tar with the -j option applies. When tar takes a directory as input, it includes all its subdirectories, since recursion is the default. To exclude certain files or directories, we can use the –exclude option:

$ tar -cjvf testfiles.tar.bz2 --exclude='file3.txt' test
test/
test/file2.txt
test/file1.txt
test/subdir/
test/subdir/subfile1.txt

By default, tar includes hidden files and preserves symbolic and hard links. It also doesn’t traverse mount points when creating an archive. For more information, the official tar manual documents these features with examples.

4. Decompressing Files and Directories

In the following examples, we’ll use the previously created file3.txt.bz2 and testfiles.tar.bz2 as an example. As a reminder, the first contains a single bzip2 compressed file, the second a bzip2 compressed tar archive.

Before we go any further, let’s keep in mind that bzip2 and bunzip2 are really the same program. The latter is the same as bzip2 –decompress.

4.1. Extract a Single File From an Archive

It’s really easy. As before, we can use bunzip2 with the –keep option to avoid deleting the input file:

$ bunzip2 --keep file3.txt.bz2 
$ file file3.txt
file3.txt: ASCII text

This is how we extracted file3.txt.

Let’s note that overwriting existing files is disabled by default. That’s why if we had run this command with file3.txt already present, we’d have gotten the error Output file file3.txt already exists.

4.2. Extract Multiple Files

Let’s say we want to extract only test/file1.txt and test/file2.txt from testfiles.tar.bz2:

$ tar -xvjf testfiles.tar.bz2 test/file1.txt test/file2.txt
test/file2.txt
test/file1.txt

The tar options are almost the same as those used for compression before. The only difference is -x, which tells tar to extract the files.

Unlike bunzip2, tar overwrites the target files if they exist, so we have to be more careful.

4.3. Extract the Whole Archive

To extract the entire contents of a .tar.bz2 archive, we can use the tar command without having to specify individual files:

$ tar -xvjf testfiles.tar.bz2
test/
test/file2.txt
test/file1.txt
test/subdir/
test/subdir/subfile1.txt

In this particular case, since all the files are in the test directory and its subdirectory, we get exactly the same result by specifying to extract the test directory:

$ tar -xvjf testfiles.tar.bz2 test

As a final note, we can protect existing data during the extraction. The –keep-old-files option displays warnings when it encounters existing files and exits tar with an error status. On the other hand, –skip-old-files provides a cleaner, quieter operation by silently skipping over existing files without any warnings if we haven’t enabled verbose output with the -v option.

5. Conclusion

In this article, we’ve explored how to efficiently compress and decompress files and directories using bzip2, with a focus on using the tar command to bundle multiple files into a single archive.

We also looked at the different default behaviors of bzip2 and tar with respect to automatically deleting input files and overwriting output files.

These tools are essential for optimizing storage, especially when dealing with large amounts of data.