在Linux中计算文本文件中字符的出现次数

1. Overview

In this tutorial, we’ll learn to find the count of a specific character in a text file using Linux commands.

We assume that you have a basic understanding of commonly used Linux commands, including the grep, awk, tr, and wc.

Let’s also assume that our input file baeldung.txt has some dummy data in it:

$ cat baeldung.txt 
"I Love Baeldung!!!"
"Baeldung is great!!!"

For the rest of the tutorial, we’ll be using baeldung.txt for demonstration purposes.

2. Using the grep Command

The grep command searches for a given pattern in the input file.

Let’s go through the command to get character count using the grep :

$ grep -o 'e' baeldung.txt | wc -l
4

Here, we are looking for the occurrences of character ‘e’ in the file baeldung.txt. The -o option prints the matched part in a separate output line.

Now, we pass the output of the grep command to the wc command using the pipe operator. Finally, the -l option in the wc command counts the total number line in the input string.

2.1. Case-Insensitive Searching

The grep command supports the -i option to perform the case-insensitive search:

$ grep -o -i 'l' baeldung.txt | wc -l
3

2.2. Using Multiple Input Files

We can pass multiple input files to the grep command. It’ll then look into all the files and return the sum of character count found in each file:

$ cat > dummy.txt
This is dummy text.
$ grep -o -i 'e' baeldung.txt dummy.txt | wc -l
5

Here, we have created a new file dummy.txt and performed the character count operation on both the file, baeldung.txt, and dummy.txt.

Note that we had passed two files as an argument to the grep command. The output includes the sum of character counts from both files.

3. Using the tr Command

The tr is a command-line utility to perform character-based transformations.

We can use a combination of two options, -c and -d, to get the character count:

$ tr -c -d 'l' < baeldung.txt | wc -c
2

Let us first understand the options used in the above command.

-c: This option will take the compliment of the set
-d: It will delete all the characters mentioned in the set

A set is defined as strings of characters. In our case, the set is a string with a single character, ‘l’.

Now, when we combine the -c and -d option together, it will delete all the characters except for the one which we had mentioned in the set.

The resultant string will be passed to the wc command using the pipe operator. -c option in the wc command will return the total character count.

3.1. Case-insensitive Searching

We can perform case-insensitive searching by adding both upper and lower case characters in the set:

$ tr -cd 'lL' < baeldung.txt | wc -c
3

4. Using the awk Command

The awk is a data-driven programming language that takes input data, processes it, and returns the desired output.

Unlike the two approaches that we had discussed so far, this approach is a little tricky to understand.

Let’s look at the command and understand how it works:

$ awk -F 'e' '{s+=(NF-1)} END {print s}' baeldung.txt 
4

The default field separator for the awk utility is a space. But here we have updated the field separator to ‘e’ using the -F option. This will separate our data at each occurrence of ‘e’.

The groups formed for our data set will be ‘”I Lov’, ‘ Ba’, ‘ldung!!!”‘ for the first-line and ‘”Ba’, ‘ldung is gr’, ‘at!!!”‘ for the second line.

Now, this snippet {s+=(NF-1)} END {print s} will count all the parts of the data generated and subtract one from it (because one character match will split the data into two parts.) to get the desired character counts in each line. This count will be added for each line and finally, we get the total character occurrence count for the entire file.

5. Performance Comparison

All the three approaches that we had discussed so far perform the same operation. But the difference is in the way they are implemented to process the data.

Now, for a small string or a small-sized file, the time taken by these commands to execute is almost the same. But the real difference is when our file size is too large.

Let us run all the three commands on a 1.1GB file and monitor the time taken by each command:

$ ls -lah large.txt 
-rw-r--r--. 1 root root 1.1G Jun 12 10:53 large.txt

$ time grep -o 'e' large.txt | wc -l
82256735

real    0m40.733s
user    0m39.649s
sys    0m0.714s

$ time tr -c -d 'e' &lt; large.txt | wc -c
82256735

real    0m2.542s
user    0m1.892s
sys    0m0.433s

$ time awk -Fe '{s+=(NF-1)} END {print s}' large.txt 
82256735

real    0m11.080s
user    0m9.589s
sys    0m0.933s

The tr command is the fastest of the three to get the character count in large files.

6. Conclusion

In this tutorial, we have learned about different approaches in Linux to find the character count in a text file. We have discussed a few corner cases like case-insensitive search and using multiple input files.

Finally, we have found that the tr command is the fastest of all the three followed by the awk and the grep commands.

Persistence

REST

Security