1. Overview
Listing files sorted by the number of lines they contain has a variety of practical use cases. Developers working on a project with multiple files can apply this use case to identify the lengthy code files. Further, they could prioritize redistributing the data from bulky data or code files to smaller ones for better maintenance.
Additionally, they could build statistics based on the size of files, such as cycle time of code reviews by size of source code in a file, and so on.
In this tutorial, we’ll learn different approaches to listing files sorted by the number of lines they contain.
2. Scenario Setup
To simulate our use case, we need a directory structure with a few sample files with multiple lines of content. So, let’s start by creating the sample_files directory along with a child_directory under it:
$ mkdir -p sample_files/child_directory
Now, we can switch our current directory and add a few sample files, namely, file1.txt, file2.txt, file3.txt, file4.txt, file5.txt, and child_file1.txt with 5, 10, 3, 8, 6, and 1 line(s) of content, respectively:
$ cd sample_files
$ for i in {1..5}; do echo "This is line $i of file1" >> file1.txt; done
$ for i in {1..10}; do echo "This is line $i of file2" >> file2.txt; done
$ for i in {1..3}; do echo "This is line $i of file3" >> file3.txt; done
$ for i in {1..8}; do echo "This is line $i of file4" >> file4.txt; done
$ for i in {1..6}; do echo "This is line $i of file5" >> file5.txt; done
$ echo "This is line 1 of child_file1" >> sample_files/child_directory/child_file1.txt
Lastly, let’s use the exa command to visualize our file and directory structure in a tree-like manner:
$ exa --tree
.
└── sample_files
├── child_directory
│ └── child_file1.txt
├── file1.txt
├── file2.txt
├── file3.txt
├── file4.txt
└── file5.txt
Fantastic! We have everything we need to simulate and solve our use case.
3. Using wc and sort
When counting the number of lines in files, wc is the preferred command-line utility in Linux. Further, it accepts multiple files as command-line arguments:
$ wc [OPTION]... [FILE]...
So, let’s go ahead and use the -l option to count the number of lines in each file present in the sample_files directory:
$ wc -l *.txt
5 file1.txt
10 file2.txt
3 file3.txt
8 file4.txt
6 file5.txt
32 total
It’s important to note that the output shows the number of lines in the first column. Additionally, we used the *.txt wildcard pattern to include only the text files.
Now, we can pipe the output from the wc command to the sort command with its -n option for sorting the output numerically based on the first column:
$ wc -l *.txt| sort -n
3 file3.txt
5 file1.txt
6 file5.txt
8 file4.txt
10 file2.txt
32 total
Great! We’re almost there, as we can see that the list of files under the sample_files directory is sorted by the number of lines they contain. However, the child_file1.txt file under child_directory is missing from the output.
So, we must revise our wildcard pattern to include files within the immediate sub-directories:
$ wc -l *.txt ./*/*.txt | sort -n
1 ./child_directory/child_file1.txt
3 file3.txt
5 file1.txt
6 file5.txt
8 file4.txt
10 file2.txt
33 total
Fantastic! It’s exactly what we expected.
Lastly, if we don’t want to see the total number of lines within all files, then we can get rid of it using the head command:
$ wc -l *.txt ./*/*.txt | sort -n | head -n -1
1 ./child_directory/child_file1.txt
3 file3.txt
5 file1.txt
6 file5.txt
8 file4.txt
10 file2.txt
4. Using find, wc, and sort
In this section, we’ll use the find command along with wc and sort to discover all the files within a directory and then list them by the number of lines they contain.
4.1. With -exec Option
First, let’s start by seeing how we can discover all the files under the sample_files directory recursively:
$ find . -type f
./file4.txt
./file2.txt
./file5.txt
./file1.txt
./child_directory/child_file1.txt
./file3.txt
We must note that the output includes the child_file.txt file from one of the sub-directories.
Next, let’s go ahead and use the -exec option to execute the wc -l command and then sort the output by the number of lines:
$ find . -type f -exec wc -l {} + | sort -n
1 ./child_directory/child_file1.txt
3 ./file3.txt
5 ./file1.txt
6 ./file5.txt
8 ./file4.txt
10 ./file2.txt
33 total
It’s important to note that we’ve added the + operator at the end so that the wc command executes only once for all the matching files. Further, we can see that the last line of output contains the total number of lines for all the files.
Like earlier, we can use the head command to get rid of the last line from the output:
$ find . -type f -exec wc -l {} + | sort -n | head -n -1
1 ./child_directory/child_file1.txt
3 ./file3.txt
5 ./file1.txt
6 ./file5.txt
8 ./file4.txt
10 ./file2.txt
Alternatively, we can execute the wc command once for each matching file without using the + operator:
$ find . -type f -exec wc -l {} \; | sort -n
1 ./child_directory/child_file1.txt
3 ./file3.txt
5 ./file1.txt
6 ./file5.txt
8 ./file4.txt
10 ./file2.txt
Great! The output is as expected. Further, we must note that both \; and + indicate the end of the -exec command arguments.
Lastly, we must note that this approach has a performance penalty, but it saves us from applying more filters, such as with the head command to remove the last line from the output. However, that’s acceptable if the total number of files isn’t so high.
4.2. With xargs
We can also use the xargs command with find to execute the wc -l command for all the matching files:
$ find . -type f -print0 | xargs -0 wc -l | sort -n
1 ./child_directory/child_file1.txt
3 ./file3.txt
5 ./file1.txt
6 ./file5.txt
8 ./file4.txt
10 ./file2.txt
33 total
Fantastic! It looks like we nailed this one.
We used the -print0 option with find and the -o option with xargs to send and receive filenames delimited with a null character.
5. Using awk
In this section, let’s see how we can use the awk utility together with find and sort to solve our use case.
5.1. With find and sort
Let’s start by writing our awk script in its entirety to replace the behavior of the wc -l command:
$ cat script.awk
FNR==1 {
if(NR>1) {
print lines" "last;
}
lines=0;
last=FILENAME
}
{
lines++
}
END {
print lines" "last
}
Now, let’s break this down to understand the nitty gritty of the logic. Firstly, it’s important to note that FNR and NR are built-in variables in awk, denoting the number of records read from the current and all input files.
Secondly, we’re tracking the number of lines read previously, along with the last file read in the lines and last variables. Further, we print them only when we’re reading the first line of the second or subsequent input file.
At the end of the script, we use the END block to print the last input file’s number of lines and filename.
Lastly, let’s see this in action together with the find and sort commands:
$ awk -f script.awk $(find sample_files/ -type f) | sort -n
1 sample_files/child_directory/child_file1.txt
3 sample_files/file3.txt
5 sample_files/file1.txt
6 sample_files/file5.txt
8 sample_files/file4.txt
10 sample_files/file2.txt
Great! It works correctly.
6. Using a for Loop
Another approach to discovering the files is using the for loop in Bash. In this section, let’s implement it in two different ways.
6.1. With Pipe
Let’s go ahead and write a for loop to iterate over the files and count the number of lines for each file in each iteration:
$ for f in */* ./*;
do
test -f $f && echo "$(wc -l < $f) $f";
done | sort -n
1 child_directory/hello.txt
3 ./file3.txt
5 ./file1.txt
6 ./file5.txt
8 ./file4.txt
10 ./file2.txt
We used the test command to ensure we’re invoking wc for files only and not for directories. Additionally, we pipe the output from the for loop to the sort command.
That’s it! Our code is concise and works as expected.
6.2. With Process Substitution
Alternatively, we can use process substitution to pass the output from the for loop as if it’s coming from a file:
$ sort -n <(
for f in ./* ./*/*;
do
test -f $f && echo "$(wc -l "$f")";
done)
1 ./child_directory/hello.txt
3 ./file3.txt
5 ./file1.txt
6 ./file5.txt
8 ./file4.txt
10 ./file2.txt
Fantastic! We got this one right as well.
7. Conclusion
In this article, we learned various ways to list the files sorted by the number of lines they contain. Furthermore, we explored several command-line utilities, such as wc, sort, awk, test, find, and xargs, for solving the problem step-by-step.
Lastly, we also learned how to apply the concepts of loop constructs and process substitution for solving the use case.