1. Overview
Along with the ZIP and contemporary 7-Zip formats, gzip is one of the most used compression formats and mechanisms.
In this short tutorial, we focus on xz for compressing and decompressing files in the Linux command line.
2. Why Use xz?
It’s fairly well known that ZIP is the standard cross-platform archiving tool and format. Similarly, gzip with tar is the standard archiving and compression tool in Linux. So, why use xz at all?
xz creates much smaller archives than gzip while using the same options. Therefore, we can consider xz a better drop-in replacement for gzip. Notably, we explore the claim of smaller archives later.
The disadvantage of xz is that it doesn’t ship with all Linux distributions. Yet, we can with many native package managers such as yum and apt.
3. Using xz for Single Files
Let’s use xz to compress a single file.
Apart from the program name, the usage is identical to that of gzip:
$ xz -v data.csv
This command compresses the file data.csv and replaces it with the file data.csv.xz. The -v option makes xz display progress information.
xz has the same compression levels 1-9 as gzip. The default compression level is 6. However, unlike gzip, that default compression level isn’t usually a good compromise between speed and compression ratio.
So, let’s compress a file with the minimum compression level 1:
$ xz -v1 data.csv
Unlike gzip, there’s no separate program for decompressing a file.
Instead, we use the -d option to decompress a single file:
$ xz -dv data.csv.xz
This decompresses the file data.csv.xz and replaces it with data.csv. Again, the -v option also displays progress information.
4. Using tar With xz for Multiple Files and Directories
Just like with gzip, xz can only compress a single file.
4.1. Compress Many Filesystem Objects
That’s why we usually leverage the tar archiving utility in combination with xz to compress multiple files or entire directories:
$ tar cJvf archive.tar.xz *.csv
Let’s break down this command:
- f archive.tar.xz: resulting archive name
- *.csv: compress all files with a csv extension in the current directory
- J: sets the compression algorithm to xz
- v: verbosity makes tar show each added and compressed file
Notably, unlike xz and gzip, tar doesn’t delete the input files after it creates the archive.
Which xz compression level does tar pick? It depends on the version of tar, but it’s usually the default compression level 6.
Still, tar enables setting the compression program through the –use-compress-program option. We use this option to set the compression level since it accepts command-line arguments. Here, we specify the minimum compression level 1:
$ tar cvf archive.tar.xz --use-compress-program='xz -1' *.csv
Notably, we remove the J option because –use-compress-program already sets the compression program.
4.2. Decompress Archive
Decompressing a tar archive with xz is also a single step and identical to gzip (except for the different file extension):
$ tar xvf archive.tar.xz
Again, let’s see what each option does:
- f archive.tar.xz: archive for extraction
- x: extract (decompress)
- v: verbosity makes tar show each extracted file
Again, the archive isn’t deleted after the operation. Notably, we don’t have to tell tar to decompress with xz as tar does this automatically by inspecting the file and detecting the xz compression.
5. Faster Compression With Multithreading
Unlike gzip, xz supports multithreading directly, which speeds up compression.
By default, xz uses just a single thread. We can specify the number of threads with the -T option. A value of 0 tells xz to use one thread for every available CPU core. That’s generally a good default value to use:
$ xz -vT0 data.csv
If we decide to force multithreading, we can use more threads, such as the 3 in this example:
$ xz -vT3 data.csv
Unlike unpigz, decompression with xz doesn’t benefit from multithreading by default. If we want to employ faster decompression, we’d have to use multithreaded compression as we did above.
Even then, more than two or three threads don’t usually present much improvement, if any.
6. Using Multithreading With tar
There are two main ways to use multithreading with tar and xz.
6.1. The –use-compress-program Option
Previously, we specified the compression level with the –use-compress-program option. Now, we enable multithreading through the same –use-compress-program option by setting the number of threads with the command-line options.
Here, we again use one thread for every CPU core:
$ tar cvf archive.tar.xz --use-compress-program='xz -1T0' *.csv
While decompression with xz doesn’t benefit from multithreading by default, we can still use the same options:
$ tar xvf archive.tar.xz --use-compress-program='xz -dT3'
Thus, we again use -d with a specific thread count (3).
6.2. Environment Variables
Another way to set the options for xz is to use the XZ_*** environment variables that tar is aware of:
- XZ_DEFAULTS: sets the default options for xz globally
- XZ_OPT is usually for passing options to the tool when run by another executable
So, in general, we use XZ_DEFAULTS in a .bashrc or similar initialization script, while XZ_OPT generally helps in specific sessions or local scripts.
Let’s see the compression example from earlier with XZ_OPT:
$ XZ_OPT='-T0 -1' tar cJvf archive.tar.xz *.csv
Similarly, we can perform a decompression:
$ XZ_OPT='-d -T0' tar xJvf archive.tar.xz
Notably, we shouldn’t expect much improvement in either case due to the general way the algorithm works when decompressing.
6.3. Decompression Considerations
Since version 5.4.1, xz provides support for parallel decompression with -T0. Yet, TAR files require a sequential read. Because of this, the process might need to preread a number of blocks. To do this, xz expects the archive to be compressed with the multithreading option.
Because of this, if multithreading is a must, we usually turn to algorithms like Zstd.
7. Testing Archive Sizes With xz
As we already noted, xz usually creates smaller archives than gzip.
To test this claim, we used the same 818 MB CSV file, and the same computer with six CPU cores and hyperthreading. This is the same setup we used to test gzip in Linux.
We compared xz to pigz, a gzip implementation that uses multithreading for faster compression and decompression:
- both archiving tools saturated the CPU: pigz does this by default, xz because of the -T0 option
- at compression level 7 out of 9, pigz compressed the 818 MB CSV file down to 95 MB in 4 seconds: higher compression levels didn’t produce meaningfully smaller archives
- at compression level 1 out of 9, xz compressed the 818 MB CSV file down to 48 MB in 4 seconds: 49% smaller result that pigz
*With compression level 5, xz produced the smallest archive at 29 MB, which is 69% smaller than *pig**z with the same setup**. However, xz took nearly 18 times as long at 70 seconds. Compression levels six and beyond hugely increased the compression time for a negligible 1% reduction in archive size.
So, we’ve demonstrated that xz does indeed create much smaller archives than gzip, sometimes at the price of time.
8. Conclusion
In this short article, we first saw when we might choose xz over ZIP and gzip.
Then, we learned how to compress and decompress single files with xz. Next, we looked at how we can use tar with xz to compress and decompress multiple files and directories.
Finally, we discovered how multithreading speeds up the compression on modern computers.