递归查找目录中最大的文件

1. Overview

Identifying large files within a directory structure is crucial for disk space management in order to ensure the system is performing at its best.

In this tutorial, we’ll look at several commands and scripts that we can use to find the largest file in a directory recursively.

2. Problem Statement

Within a single directory level, finding the largest file is straightforward with the ls command. Specifically, we can use the ls command with the -lS option to make the command list the files by their size in descending order. Then, to get the largest file, we take the first row of the output.

For example, here’s how we can find the largest file in the /usr/bin directory on our system:

$ ls -lS /usr/bin | head -1
total 610340
-rwxr-xr-x 1 root root      99933288 Jul 21 22:35 dockerd

However, it’s challenging when we want to find the largest file recursively within the same directory. The difficulty lies in having to compare the file size across different subdirectories.

One way we can possibly tackle this problem is to manually traverse the different subdirectories, get their largest file, and note it down. Then, we compare the size of the largest file among the different subdirectories. It’s easy to see how this solution is time-consuming and error-prone for a complex directory hierarchy.

Fortunately, in Linux, there are tools like the Bash shell globstar operator and the find command that can recursively retrieve all the file paths on the current directory. Once we have the list, we can get their size in bytes and then sort them to obtain the largest file in the list.

Let’s look at the different methods we can use to achieve that and a concrete example that demonstrates the method.

3. Bash Shell’s globstar and stat

In the Bash shell, a globstar operator returns us an array of all the file paths in the current directory. These file paths include the file path nested within the subdirectory. With this list of file paths, we can loop through them and record the largest file in a variable.

Here’s one possible way we can implement the idea in the form of a shell script:

$ cat find_largest_file.sh
shopt -s globstar
largest_file_size=0
for filepath in **; do
  if [[ -f "$filepath" && ! -L "$filepath" ]]; then
    size=$( stat -c %s -- "$filepath" )
    if (( size > largest_file_size )); then
      largest_file_size=$size
      largest_file_path=$filepath 
    fi
  fi
done
echo "$largest_file_size $largest_file_path"

Let’s walk through the script above to understand the important parts.

Firstly, we set the globstar option on our Bash shell. When enabled, the ** operators cause the shell to return the path to all the files within the current directory recursively.

Then, we use the for loop to loop through each file. Before we check their size, we check if it’s a file using the -f condition. Additionally, we also skip files that are symlink using the -L condition. Once we’re sure that the file path is pointing to an ordinary file, we check its file size using the stat command.

Then, we check if the file size is larger than what we’ve seen previously as tracked by the largest_file_size variable. If yes, we record the file size and the file path to the largest_file_size and largest_file_path variables, respectively. Finally, the script prints a line of output that tells us the size as well as the path to the largest file.

To run this script on a specific directory, we copy the script to that directory and run it. For example, we can run this script on the /usr/lib directory to find out the largest file in that directory:

$ cp find_largest_file.sh /usr/lib/find_largest_file.sh
$ ./find_largest_file.sh
92107656 x86_64-linux-gnu/libwireshark.so.13.0.3

The first command uses the cp tool to copy the script over to the /usr/lib directory. Then, running the script produces a single line output, which tells us that the x86_64-linux-gnu/libwireshark.so.13.0.3 is the largest file in the /usr/lib directory.

Notably, the script above only works for Bash shell version 4.0 and above as it relies on the globstar option.

4. Using the find Command

The find command-line tool in Linux allows us to recursively match files and execute commands on individual files in a directory.

The idea is similar to the previous method, except that we’ll use the find command to return to us the entire list of file paths instead of the globstar operator of the Bash shell.

4.1. With GNU find

The GNU find command extends the POSIX standard’s find specification with several more options. One of these extensions is the -printf option. Using the -printf option on the GNU find command, we can print the file size as well as the name onto the output. Then, we can pipe the output to the sort command so that it sorts the file paths in the descending order by size.

For example, we can find the largest file from the /usr/lib directory recursively using a one-liner:

$ find /usr/lib -type f -printf "%s\t%p\n" | sort -n | tail -1
92107656        /usr/lib/x86_64-linux-gnu/libwireshark.so.13.0.3

The command above first uses the find command to match paths that are files using the -type f filter. Additionally, we specify the -printf “%s\t%p\n” to print the file size and file path separated by the tab character.

Then, we pipe the output to the sort command. We specify a numeric sort using the -n instead of the default alphabetical sort behavior. Finally, we get the largest file from the last row of the output using the tail command. This is because the sort command, by default, sorts ascendingly. Therefore, the largest file is in the last row of the output.

4.2. Using find With ls

For Linux distributions that don’t have the GNU find command, the -printf option won’t be available. Therefore, instead of the -printf option, we’ll need to use the -exec option to run the ls command on each file path to get its size.

Specifically, we’ll substitute the -printf option with -exec ls -lS:

$ find /usr/lib -type f -exec ls -lS {} \; | sort -k5 -n | tail -1
-rwxr-xr-x    1 root     root         26496 Nov  1  2022 /usr/lib/engines-1.1/padlock.so

From the output, we can see that the largest file we have in this system is the padlock.so within the /usr/lib/engines-1.1 library.

Because we use the -exec ls -lS to get the file size, the size of each file is now on the fifth column of the output. Therefore, we’ll need to pass the -k5 option to the sort command to make it sort by the value on the fifth column. Similarly, we can get the largest file by getting the last row of the output using the tail command.

5. Conclusion

In this article, we looked at the problem of finding the largest file in a directory recursively.

Then, we learned that with Bash shell, we can use the globstar operator to get all the file paths and find the largest file size among them.

Besides that, we also looked at the find command. Specifically, we saw that the GNU find command can conveniently print the file size of each file with the -printf option for sorting.

Finally, for a standard POSIX find command, we simply use the -exec option with the ls -lS command to print the file size for sorting.

Persistence

REST

Security