1. Overview
In this tutorial, we’ll be learning how to check whether the content of two gzipped files is the same.
2. The Gzip File Format
The gzip command is a command that compresses files in order to reduce their size. The compression produces a compressed version of the file in the Gzip file format. Other than the compressed payload bytes, the Gzip file format also stores some metadata in its header and trailer. On a high level, the file format can be divided into different blocks:
[ 10 bytes of headers ][ variable size of optional headers depending on file flags ][ compressed payload block ][ CRC checksum and uncompressed size ]
Firstly, the 10 bytes headers block consists of metadata for the compressed payload, such as the magic number of 0x**1f 0x8b, the compression method, the file flags, and the timestamp of the original file.
Then, depending on the file flags value, different optional information will be stored in the next block with a variable size. For example, the original filename can be stored in this block if the FLG.FNAME is set to true during compression. The full list of the available flags is out of the scope of this article, but interested readers can find it in the RFC for Gzip file format.
The third block will contain the compressed payload, which is the compressed content of the original file. Finally, the trailing 8 bytes consist of the CRC checksum along with the uncompressed size of the original file.
Understanding the basic of the Gzip file format is important for us to understand why byte-to-byte comparisons of gzipped files fail in some scenarios.
3. Why Can’t We Compare Byte by Byte?
As we’ve seen in the previous section, the gzip file format contains fields that store the modification timestamp and the filename of the original file. When we compare two gzipped files byte by byte, we are also comparing the timestamp and filename of the original files, which is not what we usually want. Most of the time, we consider different gzipped files to be the same as long as they contain the same content.
For example, let’s say we have a background service that continuously generates logs at a fixed interval with different names, log-a.txt and log-b.txt:
$ cat log-a.txt
info: user logged in
$ cat log-b.txt
info: user logged in
$ ls -l log-a.txt
-rw-r--r-- 1 baeldung baeldung 11 Apr 21 03:11 log-a.txt
$ ls -l log-b.txt
-rw-r--r-- 1 baeldung baeldung 11 Apr 21 03:12 log-b.txt
Additionally, we want to back up these logs by compressing them with gzip to save disk space:
$ gzip -c log-a.txt > log-a.gz
$ gzip -c log-b.txt > log-b.gz
To further optimize the disk space usage, we would discard the backup if the content is the same. Now, if we simply compare the log files byte by byte, they will always be different because these log files have different timestamps and original filenames:
$ cmp log-a.gz log-b.gz
log-a.gz log-b.gz differ: char 5, line 1
From the output, the cmp command tells us that the 5th byte, which is the timestamp byte, is different. But for this particular scenario, the comparison should return true because the content of both of the gzipped files is the same.
To compare only the content of the compression, we’ll have to look into different methods.
4. Not Storing the Original Filename and Timestamp
This method requires us to not store the filename and last modified timestamp of the original file in the header. When we do not store these fields, the binary produced is the same when the different files have the same content. Which means the simple byte by byte comparison will work.
To prevent the gzip command from storing the filename and timestamp of the original file, we pass the –no-name option to the gzip command:
$ gzip --no-name -c log-a.txt > log-a-noname.gz
$ gzip --no-name -c log-b.txt > log-b-noname.gz
Then, we can run the cmp command on these gzipped files to do the comparison:
$ cmp log-a-noname.gz log-b-noname.gz
$ echo $?
0
As expected, the cmp command returns an exit code of 0, indicating that both of the files have the same bytes.
The downside of this method is that it requires us to give up the information of the original file, which make it impossible to restore the original filename and timestamp during decompression. Furthermore, this method doesn’t work if we cannot decide how the compression is done.
5. The zcmp Command
The zcmp command is built on top of the cmp command. Specifically, the zcmp command first uncompresses the files. Then, it passes the content of the files to the cmp command for comparison.
In other words, it works as if we’ve first uncompressed both files into their original form and then run the cmp command on them. This would prevent the discrepancies in the header information of the gzip file format to interfere with the comparison result.
Let’s compare the log-a.gz and log-b.gz files using the zcmp command:
$ zcmp log-a.gz log-b.gz
$ echo $?
0
The command first runs the zcmp on log-a.gz and log-b.gz. Then, we print the exit code of the command and show that it returns 0, indicating that both of the files are the same. Since it compares the files in their original form, differences in metadata of the two gzipped files will not affect the result of the comparison.
The downside of this method is that the uncompression might flood the disk for highly compressed files. In that case, we can check for equality using the fingerprint of the gzipped files.
6. Equality by Gzip Files Fingerprint
Each of the gzipped files has trailing 8 bytes of metadata. This trailing block stores the CRC checksum and the file size in its uncompressed form. Since the CRC checksum is derived from the content of the original file, we can couple it with the uncompressed size info and turn it into the gzipped file’s fingerprint. In other words, we can uniquely identify the content of a gzip file by its CRC checksum and uncompressed size.
6.1. Getting the CRC Checksum and Uncompressed Size
We can print the CRC checksum and the uncompressed size of gzipped files using the gzip command with -v and -l flags:
$ gzip -v -l log-a.gz log-b.gz
method crc date time compressed uncompressed ratio uncompressed_name
defla bf32ab4c Apr 21 03:15 64 21 -9.5% log-a
defla bf32ab4c Apr 21 03:17 65 21 -9.5% log-b
129 42 -107.1% (totals)
First, the -l option lists the details of the gzip files. Then, with the -v option, the gzip command further enriches the details with fields like method, crc, and date time. From the output, we can see that the CRC checksum for both files is bf32ab4c, and the uncompressed size is 21.
6.2. Generating and Comparing the Fingerprint
To present them in a format that can be compared, we pipe the output to the awk command. Then, using the awk command, we print only the 2nd and 7th columns:
$ gzip -v -l log-a.gz | awk '(NR>1){print $2, $7}'
bf32ab4c 21
The command first gets the CRC checksum and uncompressed size using the gzip command with -v and -l options. Then, the output is piped to the awk command. The awk command then specifies the (NR>1) operator to skip the first row, which is the header row. Then, we print the 2nd and 7th columns to get our fingerprint of the gzipped file.
Using the output, we can then refer to it in a comparison logic:
$ LOG_A_FP=$(gzip -v -l log-a.gz | awk '(NR>1){print $2, $7}')
$ LOG_B_FP=$(gzip -v -l log-b.gz | awk '(NR>1){print $2, $7}')
$ if [ "$LOG_A_FP" == "$LOG_B_FP" ]; then echo "dropped due to same content"; fi
dropped due to same content
The script above generates the fingerprint of log-a.gz and log-b.gz and stores them in the LOG_A_FP and LOG_B_FP variables, respectively. Then, we print the line “dropped due to same content” if the fingerprint is the same.
7. Conclusion
In this tutorial, we’ve looked at the Gzip file format in detail. Then, we’ve explained how the header information complicates the comparison of gzipped files in a byte by byte manner. Furthermore, we show how we can turn off the header information at compression time.
Besides that, we’ve demonstrated the zcmp command for comparing the gzipped files without much hassle. Finally, we’ve also shown how to compare the fingerprint of the gzipped files.