1. Overview
In Linux, many powerful tools are available for text analysis and manipulation. One task that can be accomplished using the command-line interface (CLI) is finding the n most frequent words in a file. This process can be useful in data or log analysis, text mining, and natural language processing.
In this tutorial, we’ll explore how this can be done using Linux commands.
2. The n Most Frequent Words in a File
Suppose we have a text file named example.txt:
$ cat example.txt
One, two, two. Three three three,
four four four four.
The file contains several words. Some words are capitalized, and some are followed by a punctuation mark such as a comma or a period. Also, some words appear more frequently than others.
Suppose we wish to find this file’s three most frequent words, ignoring capitalization and punctuation marks. We’ll present a step-by-step approach to achieving this task.
2.1. Separate Words by Newlines
The first step in finding the most frequent words in a file is to separate consecutive words by a newline character. This way, each word appears on a separate line. This can be done using the tr command, which replaces characters in a text stream.
In this case, we use the tr command to replace all non-alphabetic characters with newlines, effectively separating the words:
$ cat example.txt | tr -cs '[:alpha:]' '\n'
One
two
two
Three
three
three
four
four
four
four
The POSIX character class [:alpha:] represents alphabetic characters. We use the -c option with tr to specify that the complement of this character class should be replaced. That is, we replace non-alphabetic characters with a newline character, \n. The -s option is to squeeze multiple occurrences of the newline character into a single occurrence.
2.2. Convert to Lowercase
The next step is to convert each word to lowercase. We may again use tr for this purpose:
$ cat example.txt | tr -cs '[:alpha:]' '\n' | tr 'A-Z' 'a-z'
one
two
two
three
three
three
four
four
four
four
This replaces all non-alphabetic characters with newlines and then converts all uppercase characters to lowercase.
2.3. Count Word Frequency
The next step is to count the frequency of each word. We can do so using the sort and uniq commands. We use the sort command to sort the words alphabetically and the uniq command with the -c option to count the number of occurrences of each word:
$ cat example.txt | tr -cs '[:alpha:]' '\n' | tr 'A-Z' 'a-z' | sort | uniq -c
4 four
1 one
3 three
2 two
It’s important to sort the words first before piping the result into uniq -c. This is because uniq -c counts the number of occurrences of duplicate adjacent lines. As a result, we obtain a list of words each preceded by its frequency.
2.4. Sort by Frequency
Finally, we can extract the n most frequent words from the list. This can be done using the sort and head commands. We use the sort command to sort the words numerically by frequency and the head command to extract the top n words:
$ cat example.txt | tr -cs '[:alpha:]' '\n' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | head -3
4 four
3 three
2 two
This will sort the words by frequency in reverse order, from highest to lowest. The -n option with sort ensures sorting numerically by frequency, and the -r option is for sorting in reverse order. Finally, we extract the top 3 words via head.
We may also swap the columns to show each word followed by its frequency and then columnate the result for display. We can perform the first task via the awk command and the second via column:
$ cat example.txt | tr -cs '[:alpha:]' '\n' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | head -3 | awk '{print $2,$1}' | column -t
four 4
three 3
two 2
Here, the awk command prints the second column entry of each line followed by the first column entry, whereas column -t shows the result in tabular format.
3. Alternative Expressions
We’ve seen how to separate words with newline characters using the tr command.
However, there are other ways to do so. One example is via sed:
$ sed -E 's/\s+/\n/g; s/[[:punct:]]//g' example.txt | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | head -3
4 four
3 three
2 two
Here, we use sed to replace any one or more consecutive whitespace characters with a newline character. We also delete any punctuation characters represented by the POSIX character class [:punct:].
Another way to separate words is via grep:
$ grep -oE '[[:alpha:]]+' example.txt | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | head -3
4 four
3 three
2 two
Here, grep will only match one or more alphabetic characters, thus ignoring punctuation marks. We use the -o option for extracting the pattern found instead of the entire line and the -E option for enabling extended regex.
4. Conclusion
In this article, we’ve seen how we can find the n most frequent words in a file in Linux using a combination of command-line tools such as tr, sort, uniq, and head.
Combining these commands allows us to quickly and easily extract valuable insight from text data.