1. Overview
When performing analysis, we often need tools to check the collected data for the total number of occurrences for a given text pattern.
In this tutorial, we’ll learn how to search and count the text pattern occurrences within the content of files. Then, we’ll look at how to display the number of files that have a specific text pattern within them.
2. Command Pipeline
During this tutorial, we’re going to use the following complex pipeline command to search and count the pattern occurrences within the files under a given directory:
$ find . -name "*.txt" | xargs grep <options> "<pattern>" | wc <options>
As we can see, the above one-liner uses the find, grep, and wc commands. Depending on our goal, we can use different options for each of the three commands.
Let’s have a more detailed look at each part of this pipeline.
2.1. The Find Command
Firstly, the find command is used to find files in a given directory. For example, the first part of our pipeline is searching for all files ending with .txt in the current directory:
$ find . -name "*.txt"
./file1.txt
./file2.txt
./file3.txt
We can see that in our current directory, there are three text files named file1.txt, file2.txt, and file3.txt.
2.2. The grep Command
The next part of the pipeline runs grep via the xargs command:
$ xargs grep <options> <pattern>
It starts with xargs, which takes the output of the find command, and uses it as an argument for grep.
In other words, the grep command now searches for a pattern in every text file in the directory.
2.3. The wc Command
Finally, the wc command is used to convert the grep text output to a specific number based on the wc option provided.
For example, here the ls command returns a list of files in a directory, while the wc command counts them by line:
$ ls | wc -l
3
The output is three files in our case.
3. Dataset
After we got familiar with the pipeline to use, we can count the exact text matches in a given dataset. For that, let’s assemble some text files and perform some preliminary analysis.
To that end, we’ll print the text files using the cat command:
$ cat file1.txt
hi abc
$ cat file2.txt
hi abc abc
abc
$ cat file3.txt
hi
We can see that file1.txt has one abc pattern in it, file2.txt has three abc patterns located on different lines, and file3.txt has no abc patterns. So, we have four abc pattern occurrences in a total of three files.
4. Find the Number of Text Occurrences
Using the dataset from above, let’s use the following command to count the matches of the abc pattern in our text files:
$ find . -name "*.txt" | xargs grep -oh abc | wc -w
4
Indeed, there are four matches in total, which is the correct result, as we saw earlier.
The grep command uses the options -o and -h. Option -o prints all matched patterns, while option -h prevents printing the filename of each found match.
On the other hand, the wc command uses -w to calculate the word count of the grep output.
5. Find the Number of Files With Text Pattern
Likewise, using the same dataset, we can find the number of files that have the text pattern abc in the current directory:
$ find . -name "*.txt" | xargs grep -l abc | wc -l
2
The result is 2 because only two files (file1.txt and file2.txt) out of the three contain the abc pattern.
The grep command uses the -l option, which prints the filename of each file that contains the pattern. Consequently, the wc command uses option -l to count the number of lines (matching files) in the grep output.
6. Conclusion
In this article, we learned how to find the total number of text occurrences in files in a directory. Then, we looked at how to count the number of files that contain a specific text pattern. For that, we used a pipeline with the find, grep, and wc commands.