1. Overview
Compressing a file helps save disk storage and bandwidth when transmitting through the internet. In this tutorial, we’ll look at the best compression methods in Linux.
2. Introduction to Lossless Compression
Lossless compression is a class of data compression that reduces the size of a file without losing any data, unlike lossy compression. The central idea of lossless compression is to exploit the statistical redundancy of the content in a file. Specifically, lossless compression reduces the data size by replacing long symbols with a shorter representation during compression.
2.1. Dictionary-Based Algorithm
Out of the many different methods, the dictionary-based algorithm is the most popular algorithm for doing lossless compression. The dictionary-based lossless compression algorithm works by first building up a dictionary that contains a series of pointers pointing to symbols. Then, for each of the symbols in the content of the file, the compression algorithm replaces them with the pointer, which is typically smaller in size.
For example, if the word “Linux” keeps repeating in the file, the lossless compression method can build a map that maps the character “1” to the word “Linux”. Then, we represent all the words “Linux” in the original file with the character “1”. As the character “1” requires fewer bytes to represent, the compressed file will be smaller.
LZ77 and LZ78 are popular implementations of dictionary-based algorithms. In fact, all of the algorithms we’ll be introducing in this article have traces of L77 and L78 in them.
2.2. Compression Performance
Compression performance boils down to two metrics: the compression ratio and the duration of compression. The compression ratio is a number that measures how much the compression reduces the size of the original file. We can calculate the ratio by dividing the original file size by the compressed size. For example, a 1 GB file compressed to 100 MB will have roughly a compression ratio of 10. This is the metric we’ll be trying to increase throughout the article.
3. Problem Statement
How “good” a compression method is depends highly on what we’re after and the characteristics of the files we want to compress. Therefore, it’s important that we first set out a clear problem statement before we begin our experiment.
In this article, we’ll focus on improving the compression ratio of the program source code files. This means we want to get the best reduction in size after compression with little regard for the compression duration. Besides that, the regularity of the occurrences of the symbols in program source code files means that we can expect a relatively large compression ratio, as compared to binary files with white-noise-like data.
Although we’re focusing on the compression ratio metric, we’ll also want to measure the duration of the compression process to paint a more complete picture. This is because compression is a trade-off between compression duration and the compression ratio. Theoretically, the higher the compression ratio we want to achieve, the longer the compression process takes. Therefore, we might reach a point of diminishing returns when the compression time takes so long that the improvement of the compression ratio is not worth it.
4. Benchmark Compression Performance Using gzip
For the benchmark, we’ll use the popular gzip program in Linux. The gzip program uses the DEFLATE compression algorithm internally to compress the file. The DEFLATE compression algorithm combines the LZ77 compression algorithm with Huffman encoding to further improve the compression ratio.
The gzip compression algorithm is popular as it has a great compression ratio while not requiring a long compression time and a lot of computing resources. These characteristics make gzip a good all-around compression method that serves as a great benchmark.
4.1. Example File for Testing Compression
We’ll be using the tarball of the Linux kernel code as the file for compression testing. This tarball contains a collection of Linux kernel source code files that makes it a perfect candidate for our experiment.
Let’s first download the Linux kernel source code tarball and save it as linux.tar.gz using wget:
$ wget -qO linux.tar.gz https://github.com/torvalds/linux/archive/refs/tags/v6.4.tar.gz
Since the tarball is already in compressed form, we’ll need to get the original tarball by decompressing it with the gzip -d command:
$ gzip -d -c linux.tar.gz > linux.tar
$ ls -hl linux.tar
-rw-r--r-- 1 user user 1.4G Jul 29 06:43 linux.tar
We can see that the uncompressed linux.tar is 1.4 GB in size. Let’s see how much we can compress the tarball.
4.2. Maximum gzip Compression Ratio
We’ll compress the linux.tar file using the gzip command. Furthermore, we’ll use the maximum compression rate possible in gzip with the -9 option flag:
$ time gzip -9 -c linux.tar > linux-gzip-9.tar.gz
real 0m48.542s
user 0m48.106s
sys 0m0.416s
$ ls -lh linux-gzip.tar.gz
-rw-r--r-- 1 user user 212M Jul 29 06:57 linux-gzip-9.tar.gz
From the output, we can see that it takes close to 50 seconds to compress the linux.tar down to 212 MB, giving us a compression ratio of 6.6, which is not too bad. Let’s see how the next tool fares.
5. Higher Compression Ratio With lrzip
The lrzip program is a compression program that uses rzip and other popular compression algorithms internally. The use of rzip offers a better compression ratio than other compression algorithms when the files are large and the system has a large amount of RAM to spare.
Theoretically, the usage of rzip will further improve the compression ratio for a huge file by matching symbols over a large distance. The rzip algorithm uses a larger history buffer of 900 MB to match symbols over a large distance. In comparison, the gzip program uses a history buffer of only 32 KB. This means rzip will be able to “remember” more repeating patterns than gzip. It goes without saying that when compressing smaller file sizes, the improvement of compression ratio might not be that evident between rzip and gzip.
Of course, the larger history buffer of rzip doesn’t come for free and will require the system to have larger RAM to spare. This is because the history buffer lives in the RAM and the buffer will occupy the RAM throughout the entire compression process.
5.1. Installation
To get the lrzip program, we can install the lrzip package using the package manager of our Linux system:
$ apt-get install -y lrzip
Then, we can verify the installation by checking the version of the lrzip command:
$ lrzip --version
lrzip version 0.631
5.2. Two-Stage Pass of the lrzip Command
The lrzip command compresses a file in a two-stage process. First, it passes the file through an rzip-like compression method to remove long-distance data redundancy. Then, the output will be passed to another compression algorithm.
For a large file with repeating sequences that are spread all over the file, the first pass is what gives the improvement in compression ratio as compared to using the second-stage compression algorithm alone.
In the subsequent section, we’ll see how the different second-stage algorithms affect the compression ratio on the same linux.tar file.
5.3. Using gzip at the Second Stage
To use the gzip compression method at the second pass, we can specify the -g option to the lrzip command:
$ time lrzip -L 9 -g linux.tar
Output filename is: linux.tar.lrz
linux.tar - Compression Ratio: 7.130. Average Compression Speed: 31.738MB/s.
Total time: 00:00:42.72
real 0m30.976s
user 0m48.616s
sys 0m14.124s
$ ls -lh linux.tar.lrz
-rw-r--r-- 1 user user 190M Jul 29 06:43 linux.tar.lrz
We can see that by using lrzip in combination with the gzip compression method at the second stage, we get a higher compression ratio of 7.13 on the same file. This additional gain in compression ratio is thanks to the use of rzip, which removes long-distance data redundancy.
5.4. LZMA Compression Method
The LZMA compression method takes longer than gzip in general but provides a better compression ratio. By default, lrzip uses the LZMA compression method as the second stage if we’ve not specified any compression method.
Let’s compress the linux.tar using lrzip, with the LZMA compression method as the second stage compression:
$ time lrzip linux.tar
Output filename is: linux.tar.lrz
linux.tar - Compression Ratio: 9.733. Average Compression Speed: 9.454MB/s.
Total time: 00:02:20.90
real 2m20.905s
user 6m4.248s
sys 0m15.788s
$ ls -lh linux.tar.lrz
-rw-r--r-- 1 user user 137M Jul 29 06:43 linux.tar.lrz
From the time command output, we can see that the LZMA compression method is taking much longer than the gzip compression method. However, we get a compression ratio of 9.73 in return. In other words, we’ve managed to reduce the 1.4 GB tarball to 137 MB in size.
5.5. Best Compression Ratio at Slower Execution Time
The ZPAQ compression offers the highest compression ratio possible at the cost of longer compression time. Specifically, the compression time of ZPAQ is roughly four times slower than the LZMA.
To use the ZPAQ compression method with lrzip, we’ll pass the -z option flag to the lrzip command:
$ time lrzip -z linux.tar
Output filename is: linux.tar.lrz
linux.tar - Compression Ratio: 12.638. Average Compression Speed: 4.565MB/s.
Total time: 00:04:51.30
real 4m51.312s
user 11m56.119s
sys 0m14.244s
$ ls -lh linux.tar.lrz
-rw-r--r-- 1 user user 106M Jul 29 06:43 linux.tar.lrz
As we can see, the total time taken for the command is much longer, close to five minutes. In return, the ZPAQ gives us a compression ratio of 12.64, reducing the linux.tar size to 106 MB.
6. Conclusion
In this tutorial, we’ve first learned about lossless compression in general. Then, we set a benchmark of compression ratio for the Linux kernel code tarball using the gzip command. Additionally, we’ve measured the time taken to compress the tarball using the gzip command as our benchmark for comparison.
After conducting a series of trials, we’ve learned that lrzip with gzip as its second-stage compression method provides a slightly better compression ratio than gzip alone. Then, the LZMA compression method significantly compresses the file, giving us a compression ratio of 9.73. Finally, the ZPAQ compression method achieves the highest compression ratio of 12.638. However, the entire compression duration for the file is more than twice that of LZMA.