1. Overview
In this quick tutorial, we’ll explore how to recursively find statistics on file types in a directory.
2. Introduction to the Problem
As usual, let’s understand the problem through an example. Let’s say we have some files and subdirectories in myDir:
$ tree myDir
myDir
├── 1.txt
├── 2.txt
├── 3.txt
├── root_script.sh`
├── subDir1
│ ├── manual.pdf
│ ├── pic1.jpg
│ ├── pic2.jpg
│ ├── pic3.jpg
│ └── subSubDir
│ ├── server_1.log
│ ├── server_2.log
│ ├── server_3.log
│ └── server_4.log
└── subDir2
├── archives
│ ├── backup_lastmonth.tar
│ ├── backup_lastweek.tar
│ └── backup.tar
├── important.txt
├── logs
│ ├── app.log
│ └── security.log
├── screenshot.png
├── TheGameOfThrones3.mp4
├── TheGameOfThrones4.mp4
└── TheGameOfThrones5.mp4
5 directories, 22 files
Our goal is to get a file-type statistic report recursively:
3 jpg
6 log
3 mp4
1 pdf
...
In this tutorial, we’ll address two different approaches to solving the problem.
Next, let’s see them in action.
3. Using find, sed, sort, and uniq Commands
First, let’s list the main tasks to solve the problem:
- list all the files in the myDir directory recursively
- generate the statistic report on file extensions based on the file list above
One idea to solve the problem is to use the “divide and conquer” approach.
Firstly, we can use the find command to get the complete file list under a given directory recursively:
$ find myDir -type f
myDir/root_script.sh
myDir/subDir2/archives/backup_lastmonth.tar
myDir/subDir2/archives/backup_lastweek.tar
myDir/subDir2/archives/backup.tar
myDir/subDir2/logs/security.log
myDir/subDir2/logs/app.log
myDir/subDir2/TheGameOfThrone5.mp4
myDir/subDir2/TheGameOfThrone4.mp4
myDir/subDir2/TheGameOfThrone3.mp4
myDir/subDir2/screenshot.png
myDir/subDir2/important.txt
myDir/subDir1/manual.pdf
myDir/subDir1/pic3.jpg
myDir/subDir1/pic2.jpg
myDir/subDir1/pic1.jpg
myDir/subDir1/subSubDir/server_4.log
myDir/subDir1/subSubDir/server_3.log
myDir/subDir1/subSubDir/server_2.log
myDir/subDir1/subSubDir/server_1.log
myDir/3.txt
myDir/2.txt
myDir/1.txt
Then, to apply the “group by” file extensions operation, we can extract all file extensions from the find result using the sed command:
$ find myDir -type f | sed 's/.*\.//'
sh
tar
tar
tar
log
log
mp4
mp4
mp4
png
txt
pdf
jpg
jpg
jpg
log
log
log
log
txt
txt
txt
Once we’ve transformed all file paths to file extensions, we can use the uniq command with the -c option to count duplicated lines. However, uniq works only with the sorted input. Therefore, we need to sort the file extensions before we pass them to uniq to generate the final report:
$ find myDir -type f | sed 's/.*\.//' | sort | uniq -c
3 jpg
6 log
3 mp4
1 pdf
1 png
1 sh
3 tar
4 txt
As we can see, we’ve solved the problem by chaining find, sed, sort, and uniq through pipes.
This solution uses four processes to do the job. If our target directory contains a large number of files, the input will be processed multiple times. Further, we sort the extension list merely to provide the input to the uniq command. The sorting step is not required if we count the number of file extensions without using uniq.
So next, let’s take a look at another solution with better performance.
4. Using the find and the awk Commands
The awk command is a powerful utility for text processing. We can still use the find command to list all files under the myDir directory. Then, we pass find‘s output to awk and ask awk to do the rest:
$ find myDir -type f | awk -F'.' '{ ++a[$NF] } END{ for(x in a)print a[x], x }'
6 log
3 jpg
1 png
3 mp4
1 pdf
4 txt
3 tar
1 sh
As the output above shows, the command solves the problem. Let’s walk through the awk command quickly to understand how it works:
- -F ‘.’:we take ‘*.*‘ as the field separator so that we can extract the file extension (the last field) easily
- ++a[$NF]: we’ve created an associative array, used the file extensions as the key, and incremented its value
- END{ … }: after all the extensions have been read, we start printing the output in the END block
- In the END block, we loop through the associative array ‘a‘ and print each key-value (file extension-count) pair.
As we can see, *this approach requires two processes: find and awk. Further, find‘s output is only processed once.* Further, *as we’ve used awk‘s associative array, no sorting is performed*.
It’s worth mentioning that since we let awk take care of all steps to produce the statistic report, we can customize the output format easily, for example:
$ find myDir -type f | awk -F'.' '{++a[$NF]} END{ for(x in a) printf "%5s file : %d\n" , x, a[x]}'
log file : 6
jpg file : 3
png file : 1
mp4 file : 3
pdf file : 1
txt file : 4
tar file : 3
sh file : 1
5. Conclusion
In this article, we’ve learned how to recursively find statistics on file types in a directory. We’ve addressed two solutions through an example. One way is chaining find, sed, sort, and uniq commands. Alternatively, we can pipe the find output to awk. Then the powerful awk command can do everything alone.