1. Introduction
During our daily Linux use, we may want to check if there is a change in any of the files in a directory. Or we may want to confirm that a directory’s content is the same as another directory in a different location, disk, or system.
In this tutorial, we’re going to learn how to calculate an MD5 checksum of an entire directory tree in Linux. We’ll calculate a single hash value of all the contents of the directory to make comparisons.
2. Getting the List of All Files in a Directory Tree
To find out the collective hash of all the files in a directory tree, we should first obtain a list of these files. We’ll use the find command for this task.
Let’s run the tree command to look at our sample directory structure:
$ tree
.
├── file1.png
├── folder1
│ ├── file2.jpg
│ └── folder3
│ └── file3.txt
└── folder2
└── file4.sh
As we can see, we have files in multiple subdirectories. We can now use the find command with the -type f argument to get a list of all the files in our directory and its subdirectories, excluding folders and symbolic links:
$ find . -type f
./folder2/file4.sh
./folder1/folder3/file3.txt
./folder1/file2.jpg
./file1.png
Now, we can get a list of all the files in a directory and its subdirectories by running a single command.
3. Sorting Using sort and the “Locale Problem”
Now that we can get a list with all of our files, our next steps are:
- Run the md5sum command on every file in that list
- Create a string that contains the list of file paths along with their hashes
- And finally, run md5sum on this string we just created to obtain a single hash value
So if anything in our directory changes, including file paths or the file list, the hash will also change. But we have a problem with this approach.
The find command does not sort its output by default. For efficiency purposes, the find command just prints the individual results it gets as it traverses the filesystem. So the order can change between different systems, locations, or even different runs. As a result of this, the hash value will change, even if the two directories are exactly the same.
We can solve this problem by sorting our search results using the sort command:
$ find . -type f | sort
./file1.png
./folder1/file2.jpg
./folder1/folder3/file3.txt
./folder2/file4.sh
But we still have something missing.
The sort operation is more complex than it seems. Letters, numbers, dates, and how they should be sorted can change from locale to locale. This can change our results for directories residing on two systems with different locale settings. We can solve this problem by overriding our locale using the LC_ALL environment variable:
$ find . -type f | LC_ALL=C sort
./file1.png
./folder1/file2.jpg
./folder1/folder3/file3.txt
./folder2/file4.sh
By using the standard C locale for our sort operation, we eliminate the problems with sorting.
4. Putting It All Together
We can use the -exec parameter of the find command to execute the md5sum command on each file that has been found:
$ find . -type f -exec md5sum {} +
7d2186aaeed78b24f00f782f2346e5f9 ./folder2/file4.sh
d41d8cd98f00b204e9800998ecf8427e ./folder1/folder3/file3.txt
c6aa7ce9967680b77ea7e72d96949303 ./folder1/file2.jpg
46ffe26d56fe5164570ad43cc79b59d3 ./file1.png
We used curly braces ({}) to specify where the file names will be passed to the md5sum command as arguments. Also, we added the plus sign (+) at the end so our files will be passed as arguments to a single md5sum command (md5sum file1 file2 file3..) instead of running a separate md5sum process for each file. Note that choosing one method or the other won’t change the format of the output, though.
After executing the command above, we’ll need to sort the output before calculating the final MD5 hash value:
$ find . -type f -exec md5sum {} + | LC_ALL=C sort
46ffe26d56fe5164570ad43cc79b59d3 ./file1.png
7d2186aaeed78b24f00f782f2346e5f9 ./folder2/file4.sh
c6aa7ce9967680b77ea7e72d96949303 ./folder1/file2.jpg
d41d8cd98f00b204e9800998ecf8427e ./folder1/folder3/file3.txt
As we can see, now we have the hash values of all the files sorted and ready for the final hash calculation.
Let’s then add the final command pipe:
$ find . -type f -exec md5sum {} + | LC_ALL=C sort | md5sum
8b0d2ca740c06ea8ab2619f14e75b652 -
This is our final hash value for the directory. Note that we calculated this value by only taking into account the file contents and file paths. We purposely ignored other file attributes such as modification date, permissions, or owner by not including them in our list.
5. Taking File Attributes Into Account
Suppose we also want to include modification dates of each file when calculating our hash value.
Let’s start by defining a function that we’re going to run on each filename:
summary (){
echo "$(stat -c '%y' "$1") $(md5sum "$1")"
}
We can use this function to print the modification date, checksum, and the name of a file. Let’s export it and use it along with the find command:
$ export -f summary
Lastly, let’s get it all together by running our command on each file that is found:
$ find . -type f -exec bash -c 'summary "$0"' {} \; | LC_ALL=C sort | md5sum
6. Conclusion
In this tutorial, we learned how to calculate the checksum of an entire directory tree in Linux. We used standard Linux command-line tools like md5sum, find, and sort, and combined them to achieve our goal.