1. Introduction
In the vast Linux ecosystem, the command line interface (CLI) is a formidable tool for executing many tasks. One common task is matching strings with a fixed number of characters. Whether we’re working with file names, log files, or any other text-based data, knowing how to handle a string relative to given criteria can be very useful.
In this tutorial, we’ll explore various techniques to identify and manipulate strings based on their character length within a Linux environment. We’ll achieve this by leveraging the power of regular expressions and the versatile capabilities of tools like grep, awk, and sed. These tools allow us to accomplish the task at hand efficiently and effectively.
2. Sample Dataset
To illustrate each technique below, let’s first consider a scenario where we have a file called data.txt containing a list of words:
$ cat data.txt
applé
banana
cârrot
date
cat
@nswer
ért
elephant
étuis
letter
éééééé
We have a list of different words, some with special characters. In fact, we’re including special characters to experiment with the way different commands handle them.
3. Using the grep Command
The Linux grep command is a flexible tool for searching and filtering text in files based on patterns.
It supports many options along with regular expressions for precise searching. To search for strings with a specific number of characters, we can use regular expression quantifiers.
For example, let’s suppose we want to match strings with six characters:
$ grep -E '^.{6}$' data.txt
banana
cârrot
@nswer
letter
éééééé
In this command, the ^ character denotes the start of the line, the .{6} expression matches any character six times, and $ represents the end of the line.
This command displays all lines in data.txt that contain strings with six characters. We can adjust the number in the regular expression to match words with a different character count.
The grep command’s interpretation of a character is locale-dependent. The character counts won’t match if we’re in a non-Unicode locale and we grep a file with Unicode characters.
We can use the echo command to display the current system locale:
$ echo $LANG
en_US.UTF-8
On this system, the current locale is set to en_US.UTF-8.
We can alter the output by specifying a different locale when reading the file:
$ LANG=C grep -E '^.{6}$' data.txt
applé
banana
@nswer
étuis
letter
We’re first setting the C locale, and then using grep to match words with six characters.
Alternatively, let’s say the list of words in the data.txt file is separated by white space in this format:
$ cat data.txt
CAT applé banana cârrot date cat @nswer ért elephant étuis letter éééééé
We can use this command to achieve the same result:
$ grep -E -o '\b\w{6}\b' data.txt
banana
cârrot
letter
éééééé
The -o option outputs only the matched strings and the regular pattern \b\w{N}\b matches a word with N characters between \b word boundaries.
4. Using the awk Command
AWK is a general-purpose scripting language that’s very effective in manipulating data and generating reports on the data. The awk command scripts require no compilation and support variables, numeric functions, string functions, and logical operators.
AWK enables us to write small but effective programs in the form of statements that define patterns to search for in each line.
So, we can use awk to search for words with a specific character count using the length function:
$ awk 'length($0) == 6' data.txt
banana
cârrot
@nswer
letter
éééééé
This command checks all lines in the file and outputs only lines with six characters.
Alternatively, like before, we can also use regular expressions to achieve the same result:
$ awk '/^.{6}$/' data.txt
banana
cârrot
@nswer
letter
éééééé
The awk command offers us more flexibility by combining patterns and actions. We can use regular expressions to define complex matching conditions.
For example, let’s match strings with six alphabetic characters:
$ awk '/^[[:alpha:]]{6}$/' data.txt
banana
cârrot
letter
éééééé
The command above searches for lines where the entire match consists of six alphabetic characters. We’re using the [[:alpha:]] expression to match alphabetic characters.
We can also specify the locale when using awk:
$ LANG=C awk '/^[[:alpha:]]{6}$/' data.txt
banana
letter
Alternatively, if the list of words in the data.txt file is separated by white space, we can match a string with six characters using this command:
$ awk -v n=6 '{for (i=1; i<=NF; i++) if (length($i) == n) print $i}' data.txt
banana
cârrot
@nswer
letter
éééééé
In this case, we’re using the -v option to define a variable n which is equal to the desired number of characters. The for loop iterates over each field and checks if its length matches the desired number of characters and prints it if it does.
5. Using the sed Command
The name of the sed command in UNIX stands for stream editor. sed works like a text editor but with no interactive interface, as it can perform functions like search, find and replace, insertion, and deletion. However, the most common use for sed is substitution or find and replace.
We can use sed to edit files without opening them. This is usually much quicker than opening the file using a common text editor like Nano or Vim.
Let’s use sed to search for words with a specific character count:
$ sed -n '/^.\{6\}$/p' data.txt
banana
cârrot
@nswer
letter
éééééé
Here, we’re using the -n option to suppress the automatic printing of the pattern space which is helpful when we want to control what’s being output. Finally, the two forward slash / characters surround the regular expression, while ^. means any single character at the start of the line.
Here, sed scans through the lines of the data.txt file and only prints the lines that contain exactly six characters. Of course, we can adjust the character count.
Alternatively, if the list of words in the data.txt file is separated by whitespace, we can employ this command:
$ sed -nE 's/\b(\w{6})\b/\n\1\n/gp' data.txt | sed -n '/^......$/p'
banana
cârrot
letter
éééééé
The first sed command adds newlines before and after each six-character word. The -n option suppresses the default output and the -E option enables extended regular expressions:
- \b(\w{6})\b matches a word boundary \b followed by six-word characters and another word boundary
- \n\1\n is a replacement pattern that adds a newline before and after the matched six-character word (\1 represents the matched word itself)
The second sed command gets the piped output and displays only lines consisting of six characters.
The regex pattern /^……$/ matches lines that start ^ and end $ with exactly six characters (……).
6. Conclusion
In this article, we’ve explored three different methods of matching strings that contain a specific number of characters in a text file. It can be vital to know how string matching based on character length works while working with some specific file types.
All the commands work in an almost similar way and the method to use mostly depends on our preference or experience. Finally, the interpretation of a character is locale-dependent, and counts might not match if we’re in a non-Unicode locale.