1. Overview
Processing text files is a common operation when we work with the Linux command-line. Sometimes, we may encounter text files containing duplicated lines. In this tutorial, we’re going to learn how to count repeated lines in a text file.
2. Introduction to the Problem
To easier explain how to count duplicated lines, let’s create an example text file, input.txt:
$ cat input.txt
I will choose MAC OS.
I will choose Linux.
I will choose MAC OS.
I will choose Linux.
I will choose MAC OS.
I will choose Linux.
I will choose Linux.
I will choose Microsoft Windows.
I will choose Linux.
I will choose Linux.
As the output above shows, input.txt contains duplicated lines. Next, we want to count the occurrence of each line.
In this tutorial, we’ll address two approaches to solve the problem:
After that, we’re going to compare the two approaches and discuss which one would be a better solution to the problem.
3. Combining the sort Command and the uniq Command
The uniq command has a convenient -c option to count the number of occurrences in the input file. This is precisely what we’re looking for.
However, one thing we must keep in mind is that the uniq command with the -c option works only when duplicated lines are adjacent. That is to say, we need to first somehow group the repeated lines together. The sort command can give us a hand with that.
Let’s first sort input.txt and pipe the result to uniq with the -c option:
$ sort input.txt | uniq -c
6 I will choose Linux.
3 I will choose MAC OS.
1 I will choose Microsoft Windows.
As the output shows, the number of occurrences of each line is printed together with the line. The problem is solved.
4. Using the awk Command
Alternatively, we can solve this problem using a pretty simple awk one-liner:
$ awk '{ a[$0]++ } END{ for(x in a) print a[x], x }' input.txt
1 I will choose Microsoft Windows.
6 I will choose Linux.
3 I will choose MAC OS.
We can see in the output above that the awk one-liner solved the problem as well.
Now, let’s understand how the awk code works:
- { a[$0]++ }: We created an associative array (a[KEY])to record the line and the number of the occurrences. The KEY is a line in the input file, while the value a[KEY] is the number of occurrences of the KEY
- END{ for(x in a) print a[x], x }: After we’ve processed all lines, we used the END block to print out all elements in the array
5. Comparing the Two Solutions
The solution with the sort and uniq commands is convenient. Similarly, the awk solution is pretty straightforward as well. We may want to ask, which is a better solution?
In this section, let’s compare the two solutions in terms of performance, flexibility, and extensibility.
5.1. Creating a Bigger Input File
Since our input.txt has only ten lines, both approaches are very fast at solving the problem.
To better compare the performance of the two solutions, we’ll generate a bigger input file using a simple shell script, create_input.sh:
#!/bin/sh
# the output file
BIG_FILE="big_input.txt"
# total number of lines
TOTAL=1000000
# an array to store lines
ARRAY=(
"I will choose Linux."
"I will choose Microsoft Windows."
"I will choose MAC OS."
)
# remove the file
rm -f "$BIG_FILE"
while (( TOTAL > 0 )) ; do
echo ${ARRAY[$(( $RANDOM % 3 ))]} >> $BIG_FILE
(( TOTAL-- ))
done
In the script above, we save three lines in a Bash array named ARRAY. Then, in the while loop, we pick one line from the array randomly and write to a file called big_input.txt.
If we execute the script, we’ll get a file with one million lines:
$ wc -l big_input.txt
1000000 big_input.txt
Next, we’ll take this file as input to compare the performance of our two solutions.
5.2. Performance
Let’s apply each solution to this bigger input file, using the time command to measure their execution times.
First, let’s test the sort | uniq command:
$ time (sort big_input.txt | uniq -c)
333814 I will choose Linux.
333577 I will choose MAC OS.
332609 I will choose Microsoft Windows.
real 0m0.766s
user 0m1.995s
sys 0m0.053s
Next, we’ll test the awk command:
$ time awk '{a[$0]++}END{for(x in a)print a[x], x}' big_input.txt
333814 I will choose Linux.
333577 I will choose MAC OS.
332609 I will choose Microsoft Windows.
real 0m0.190s
user 0m0.182s
sys 0m0.001s
The test result above shows clearly that the awk command is much faster (about four times faster on this machine) than the sort | uniq combination. This is because:
- The awk command starts only a single process, but the sort | uniq approach needs two processes
- The awk command goes through the file only once, however, the sort | uniq combination must process all lines in the input file twice
- The sort command will additionally sort the file; thus, the complexity is higher than the awk command: O( nLog(n) ) > O(n)
5.3. Flexibility and Extensibility
The uniq -c command is handy. However, the format of the output is fixed. If we want to adjust the output, we have to turn to other text processing utilities. Further, this adds more processes, and the output will be processed more times.
On the other side, we can freely control the format of the output with the awk command.
For example, let’s put the count after each line:
$ awk '{ a[$0]++ } END{ for(x in a) printf "%s [ count: %d ]\n", x, a[x] }' input.txt
I will choose Microsoft Windows. [ count: 1 ]
I will choose Linux. [ count: 6 ]
I will choose MAC OS. [ count: 3 ]
Moreover, thanks to the powerful awk language, we can easily extend the awk command to handle more complex requirements.
For example, if we want to output only lines that are duplicated more than three times:
$ awk '{ a[$0]++ } END{ for(x in a) if(a[x]>3) print a[x], x }' input.txt
6 I will choose Linux.
Or if we want to get a more detailed report:
$ awk '{ a[$0]++ } END{ for(x in a) printf "%.2f%% (%d in %d): %s\n",100*a[x]/NR,a[x],NR, x }' input.txt
10.00% (1 in 10): I will choose Microsoft Windows.
60.00% (6 in 10): I will choose Linux.
30.00% (3 in 10): I will choose MAC OS.
6. Conclusion
In this article, we’ve learned two different ways to count duplicated lines in a text file. After that, we’ve compared the two solutions in terms of performance, flexibility, and extensibility.
The sort | uniq combination is straightforward. However, the awk solution is a better choice, particularly if we need to handle large files.