1. Overview
In Linux, a tarball refers to the tar archive, which is a common way to consolidate several files into a single file with the extension “.tar.” It’s a convenient method for organizing and distributing large numbers of files, such as in the case of backups. In addition to reducing the number of files to manage to a single archive, it also preserves the original folder structure hierarchy upon extraction.
In certain situations, it becomes necessary to compare different tarballs. For instance, when creating a backup tarball for a large folder, we can compare the current backup with the previous one. Then, we can retain the new version only if there are any differences between them. This approach helps conserve disk space and ensures a more manageable number of files.
In this tutorial, we’ll look at the different ways we can compare the content of two tarballs. Linux offers several methods for comparing tarballs, and the choice of the “better” method depends on the specific use case. Let’s delve into each of these different comparison methods in detail.
2. Comparing Files by Files
One way we can compare two tarballs is to compare the content of the files on both sides. Specifically, we’ll extract both of the tarballs. Then, for each of the files in one tarball, we locate the same file in the other tarball and compare their content.
2.1. Considerations
This type of comparison is the most comprehensive, as it compares the exact content of the files to look for the differences. Furthermore, the line-by-line comparison allows the tools to generate a diff report that provides richer diff details.
However, one downside of this approach is that it’s resource-consuming. Specifically, when we want to compare tarballs that are large in their unarchived form, extracting everything first and then comparing might not be feasible. Additionally, the process might take a long time for a large archive, as it needs to scan through the content of the files line by line.
2.2. pkgdiff
One tool in Linux that implement this approach is the pkgdiff command. The pkgdiff takes two archives as input, compares the content of the files, and reports any difference. Then, it generates the difference report as an HTML file.
We can install the pkgdiff package on our system using the package manager:
$ sudo apt-get install -y pkgdiff
$ pkgdiff --version
Package Changes Analyzer (PkgDiff) 1.7.2
...
Let’s run pkgdiff on linux-65-rc4.tar and linux-65-rc3.tar to find out the difference between the archives:
$ time pkgdiff linux-65-rc4.tar linux-64-rc3.tar
reading packages ...
comparing packages ...
creating report ...
result: CHANGED (0.01%)
report: pkgdiff_reports/linux/65-rc4_to_65-rc3/changes_report.html
real 2m32.376s
user 1m1.036s
sys 1m37.473s
The command took 2 minutes and 30 seconds to report a 0.01% difference between the two tarballs. Furthermore, the command generates a comprehensive report that details the difference between the tarballs at pkgdiff_reports/linux/65-rc4_to_65-rc3/changes_report.html:
The report contains a high-level summary of the differences it found. Besides that, it breaks down the differences into different types of files. In the different groups of file types, there’re different columns that show the number of files that are added, removed, and changed.
At the bottom of the report, there’s a list of files the pkgdiff finds in both of the archives. When a file has changed, the report produces a link to a page that shows the line-by-line diff on the file:
This visual diff highlights the exact difference and makes it easy to understand the difference between the files in the different tarballs.
3. Comparing the Checksum of Files
The file-by-file comparison provides the most comprehensive result regarding the differences. However, sometimes we don’t need that level of detail. Some use cases require just a simple boolean answer when we ask if the tarballs are different.
For example, during backup, we’ll usually omit the new backup if there’s no change in the content. In such a scenario where we don’t need to know the exact difference between the archives comparing by checksum is a more efficient way to achieve our goal.
3.1. Considerations
The main benefit of the checksum comparison method is that it’s usually quicker than the pkgdiff command. This is because computing checksums for files is quicker than comparing two files line by line.
In return, we sacrifice the details of the difference. In other words, a checksum comparison does not tell us which lines of the files are different, unlike the pkgdiff command.
3.2. gtarsum
The gtarsum program takes as input the tar archives and computes a concatenation of all the checksums of the files inside the archive. Then, we can compare the checksums produced by the gtarsum between the two tarballs to see if they are identical.
To obtain the gtarsum binary, we can download and extract it using a one-liner command:
$ wget -qO- https://github.com/VonC/gtarsum/releases/download/v0.2.1/gtarsum_0.2.1_Linux_x86_64.tar.gz | tar xvz
The command above will create a gtarsum binary in our current directory. Then, we can compare the checksums of linux-65-rc4.tar and linux-65-rc3.tar using gtarsum:
$ time ./gtarsum linux-65-rc3.tar linux-65-rc4.tar
real 0m34.958s
user 1m16.284s
sys 0m4.928s
$ echo $?
1
The command above computes the checksum of the archives in parallel. At the end of the computation, the program does a comparison between the checksums. If they’re different, the program returns an exit code of 1.
As we can see, the time it takes is roughly 35 seconds, which is much faster than the pkgdiff. With the faster execution time, though, we’re losing out on the diff detail that we get from the pkgdiff command.
4. Comparing Filenames Only
If we’re merely looking for the difference in file lists between the tarballs, we can use an even faster method. That is, we get the list of the filenames in both of the archives and then compare them.
The advantage of this approach is that it’s the fastest possible to quickly compare the tarballs to look for differences in the filename list. However, the downside is that the algorithm does not check for the content of the files between the two different tarballs. Let’s look at a concrete example.
4.1. tardiff
In Linux, the tardiff command is a command-line tool that inspects the lists of the files in the archives. Then, it shows us the difference in the files between the archive without considering the content of the files.
To obtain the tardiff command-line tool, we can install the tardiff using our package manager:
$ sudo apt-get install -y tardiff
Let’s run the tardiff command on the Linux source code tarballs:
$ time tardiff linux-65-rc3.tar linux-65-rc4.tar
- scripts/coccinelle/api/debugfs
- scripts/coccinelle/api/debugfs/debugfs_simple_attr.cocci
real 0m16.456s
user 0m3.762s
sys 0m16.450s
Using the tardiff command, we manage to complete the comparison in just 16 seconds. This is the fastest method for comparing the tarballs we’ve seen thus far.
4.2. tardiff Does Not Check the Content of Files
One important point to note when using the tardiff command is that the command only checks for the differences in the filename lists. Specifically, if both files are present in the archives but the contents have been changed, the tardiff command will not flag it as a difference.
To visualize the caveat, let’s create an archive archive1.tar with the following structure:
$ tree testfolder
testfolder
`-- testfile
0 directories, 1 file
$ cat testfolder/testfile
ab
Then, we create another archive2.tar with the same structure, changing only the content of the testfile:
$ echo "abc" > testfolder/testfile
$ tar -cf archive2.tar testfolder
Now, let’s run the tardiff against archive1.tar and archive2.tar:
$ tardiff archive1.tar archive2.tar
$ echo $?
$ 0
The command returns with an exit code 0 and empty standard output. In other words, tardiff concludes that archive1.tar and archive2.tar are the same. This is because tardiff only compares the filenames between the tarballs, not the content of the individual files.
5. Conclusion
In this article, we’ve learned a few methods for comparing tarballs in Linux. For the most straightforward and comprehensive way to do the comparisons, pkgdiff is an excellent tool. It checks for the content of the files, finding the differences line by line. Finally, these differences are visualized through an HTML report. The downside of pkgdiff is that it’s the slowest of the methods we’ve shown.
For a quicker alternative, the gtarsum programs compare the archives by computing the checksum of each file. Although it’s quicker when comparing the same set of archives, we don’t get the same level of information as pkgdiff.
Finally, the tardiff command is the quickest of all but, at the same time, it’s the most lenient. Since it does not check for the content of the files, tardiff should be used with discretion.