1. Overview
Previously, we’ve looked at Zip and 7-Zip in Linux. In this short tutorial, we focus on gzip and gunzip for compressing and uncompressing files from the Linux command line.
2. Why Use gzip?
Zip has two advantages over gzip:
- Zip is cross-platform and available on all computing platforms. gzip is a Linux tool first. It’s also available out-of-the-box on macOS but isn’t readily available on Windows
- Zip can compress multiple files and even entire directory hierarchies. gzip, on the other hand, compresses only a single file. That’s why we typically use it with tar, which packages multiple files/directories into a single archive file
So, if Zip has these advantages, why would we use gzip in Linux then? One word: ubiquity. No matter which Linux distribution we use, tar and gzip are always installed. In Linux, Zip is two different programs: zip and unzip. We can’t rely on either to be installed.
Granted, we could easily install zip and unzip on most Linux systems with yum and apt. But in the age of running our Spring Boot applications in Docker containers, we want to keep our Docker images small. And that means installing as little additional software as possible.
3. Using gzip and gunzip for Single Files
Let’s use gzip to compress a single file:
gzip -v data.csv
This compresses the file data.csv and replaces it with the file data.csv.gz. The -v option lets gzip display the compression ratio.
gzip has compression levels 1-9, where 9 gives us maximum compression but at the slowest speed. The default compression level is 6 and is a good compromise between speed and compression ratio.
Using higher levels of compressions significantly increases compression time, but often with only a slight increase, if any, in the compression ratio.
Here’s how we compress a file with maximum compression level:
gzip -v9 data.csv
Now, let’s use gunzip to decompress a single file from a gzip file:
gunzip -v data.csv.gz
This decompresses the file data.csv.gz and replaces it with data.csv.
As with gzip, the -v option shows the compression ratio after the file was uncompressed.
4. Using tar with gzip for Multiple Files and Directories
gzip compresses just a single file. That’s why we have to use gzip together with the tar archiving utility to compress multiple files or entire directories. We can archive with tar and compress with gzip in one step:
tar czvf archive.tar.gz *.csv
- We compress all files with a csv extension in the current directory into the compressed archive, archive.tar.gz
- The z option enables compression with gzip
- Because of the v option, tar shows which files are added to the archive
- Unlike gzip, tar doesn’t delete the input files after it creates the archive
As we recall from the previous section, gzip offers various compression levels. Which compression level does tar pick? It depends on our version of tar, but it probably is the default compression level 6.
tar allows setting the compression program through the –use-compress-program option. We use this option to also set the compression level. Here, we specify the maximum gzip compression level of 9:
tar cvf archive.tar.gz --use-compress-program='gzip -9' *.csv
Please note that we had to remove the z option here because –use-compress-program already sets the compression program.
Uncompressing a tar archive with gzip is also a single step:
tar xvf archive.tar.gz
- We decompress the file archive.tar.gz and extract its content into the current directory
- We don’t have to tell tar to uncompress with gunzip — tar does this automatically by inspecting the file and detecting the gzip compression
- Because of the v option, tar shows which files are extracted from the archive
- Unlike gunzip, tar doesn’t delete the archive file after the extraction is complete
5. Faster Compression and Decompression With pigz
gzip and gunzip, like most Linux tools, only use a single CPU core. So, compressing large files can take a while.
That’s why pigz exists, a “parallel implementation of gzip”. pigz takes advantage of both multiple CPUs and multiple CPU cores for higher compression and decompression speed. pigz is an anagram of gzip and is pronounced “pig-zee”. We can install it with either yum or apt.
pigz is compatible with gzip, and unpigz is compatible with gunzip. As such, pigz produces files that gunzip can decompress and uses the same options as gzip. Likewise, unpigz decompresses files that gzip created and also uses the same options as gunzip.
How much faster is pigz?
To find out, we ran a quick test on a modern computer with six CPU cores and hyperthreading. The test data was an 818 MB CSV file. We used the maximum compression level 9 with both gzip and pigz.
First, we compressed a file with pigz:
pigz -v9 data.csv
And then, we decompressed this file using unpigz:
unpigz -v data.csv.gz
- The fastest compression with gzip took 112 seconds and reduced the file size by 88.3%, down to 95 MB
- pigz took just 15 seconds to compress the same file and therefore was 7.5 times faster, and the compressed file was 0.3% smaller than the one gzip created
- Decompressing with gunzip took 3 seconds, while pigz took about a second to decompress the same file, merely about three times faster; note that measuring a duration of just a few seconds isn’t very precise in this setup
So, pigz/unpigz does indeed speed up compressing and decompressing files significantly with multiple CPUs or multiple CPU cores!
6. Using pigz With tar
To use pigz together with tar, we specify –use-compress-program* to compress with *pigz:
tar cvf archive.tar.gz --use-compress-program=pigz *.csv
We cannot specify the decompression program when extracting a compressed archive with tar. That’s why we have to perform two separate steps if we want to use unpigz for decompression:
unpigz -v archive.tar.gz
tar xvf archive.tar
7. Conclusion
In this short article, we first saw when we might choose gzip over Zip. We then learned how to compress and decompress single files with gzip/gunzip.
Next, we looked at how we can use tar with gzip to compress and decompress multiple files and directories.
And finally, we discovered how pigz speeds up the compression and decompression on modern computers.