1. Introduction
Text processing is a routine yet crucial task. Whether it’s inspecting log files to troubleshoot an issue, checking configuration files for specific settings, or evaluating large datasets for important details, efficiently searching and extracting relevant lines can be critical. In such situations, we might need to locate lines containing only one of the multiple specified words.
In this tutorial, we’ll learn how to find lines containing one of multiple words exclusively in Linux.
2. Sample Dataset and Toolset
Before moving forward, let’s ensure we have a sample dataset to demonstrate the different approaches:
$ cat datafile.txt
Hey there, baeldung users
New baeldung lessons
Newly joined authors
Articles from baeldung authors
Old users of the new lessons
This sample file (datafile.txt) contains several lines of text for testing the commands used in the next sections.
Let’s suppose we have two words, baeldung and users, and we need to find all lines containing only one of these two words. Specifically, we are looking for lines that contain baeldung but not users (A and not B) or lines that contain users but not baeldung (B and not A), similar to the XOR operator.
We apply the same rationale using grep, sed, awk, and Perl to find lines containing one word exclusively.
3. Using grep
We can use grep to find lines with only one of two words:
$ grep -E 'baeldung|users' datafile.txt | grep -vE 'baeldung.*users|users.*baeldung'
New baeldung lessons
Articles from baeldung authors
Old users of the new lessons
The first part of this command searches for lines containing either one or both words from the datafile.txt file, while the second part excludes lines containing both words in the same line. Thus, resulting in only lines containing just one of two words.
Let’s break down the regular expressions used in this command:
- -E option enables the extended regular expression syntax
- ‘baeldung|users’ is the search pattern that matches every line containing either the word baeldung or the word users
- datafile.txt is the input file
- pipe (|) takes the output of the preceding command and passes it as input to the following command, in our case, the next grep command
- grep -v option selects only the non-matching lines, thus, it inverts the match excluding lines that match specified patterns
- ‘baeldung.*users|users.*baeldung’ is the regular expression that matches lines containing both the words
Similarly, we can apply the same technique to find one of three or more words exclusively:
$ grep -E 'users|authors|baeldung' datafile.txt | grep -vE 'users.*authors|users.*baeldung|authors.*users|authors.*baeldung|baeldung.*users|baeldung.*authors'
New baeldung lessons
Newly joined authors
Old users of the new lessons
This command finds all the lines containing exclusively one of the three words (users, authors, and baeldung). As we can see, this method can become tedious with multiple words.
4. Using sed
sed is a command-line stream editor for filtering and transforming text. By default, sed prints each line of the input after processing it. However, we can suppress this behavior and instruct sed to print only lines matching a specific pattern.
For example, we can print only lines that contain either the word users or baeldung:
$ sed -ne '/users/{/baeldung/! p; d;}' -e '/baeldung/p' datafile.txt
New baeldung lessons
Articles from baeldung authors
Old users of the new lessons
Let’s take a closer look at the options used in this command:
- -n suppresses automatic printing of pattern space
- -e specifies a sed script or commands
- /users/ matches lines that contain the pattern users
- {…} block executes the commands within the braces if the line matches the preceding pattern, in our case, users
- ! negates the preceding pattern, i.e., /baeldung/
- p prints the line
- d deletes the line from pattern space and prevents it from further processing
- ‘/baeldung/p’ print lines that contain the word baeldung
- datafile.txt is the input file
In this command, the second script comes into play only if the line doesn’t contain the word users.
Similarly, we can apply the same method for finding lines with only one of multiple words:
$ sed -ne '/users/{/baeldung/! {/authors/! p; d;}}' -e '/baeldung/{/users/! {/authors/! p; d;}}' -e '/authors/{/users/! {/baeldung/! p; d;}}' datafile.txt
This command finds the lines from datafile.txt that contain exclusively one of three words: users, baeldung, and authors.
Furthermore, we can use extended regular expressions to retrieve lines with exclusively one of multiple words:
$ sed -nE '/user/{/author|baeldung/! p;}; /author/{/user|baeldung/! p;}; /baeldung/{/user|author/! p;}' datafile.txt
New baeldung lessons
Newly joined authors
Old users of the new lessons
This command uses an extended regular expression to merge all the conditions in a single sed script block.
5. Using awk
The awk command-line utility executes programs written in the AWK programming language, designed for text processing and data extraction.
Let’s use awk to find lines that exclusively contain one of three specific words, i.e., users, baeldung, and authors:
$ awk '(/baeldung/+/users/+/authors/)==1' < datafile.txt
New baeldung lessons
Newly joined authors
Old users of the new lessons
The AWK script attempts to match the pattern enclosed in forward slashes (/). If the pattern is found in a line, it returns 1 (true); otherwise, it returns 0 (false). The plus sign (+) adds the results of multiple patterns. Finally, ==1 evaluates that the value returned by the preceding expression is one, ensuring that exactly one of the specified patterns is present in the line.
Furthermore, we can also use the bitwise XOR operator in awk:
$ awk 'xor(/baeldung/,/users/,/authors/)' < datafile.txt
New baeldung lessons
Newly joined authors
Old users of the new lessons
By utilizing XOR in awk, we can provide a list of patterns separated by commas to find lines with exclusively one of multiple patterns.
6. Using Perl
Perl is a highly capable and feature-rich programming language with built-in support for regular expressions and text manipulation.
We can use the if conditional statement for extracting lines with one of multiple words:
$ perl -ne 'print if /user/ && !/baeldung/ || /baeldung/ && !/user/' datafile.txt
New baeldung lessons
Articles from baeldung authors
Old users of the new lessons
This command uses the same idea, i.e., (A and not B) or (B and not A), to find lines with exclusively one of multiple words.
Let’s understand each option used in this command:
- -n loops over each line of the input file
- -e enables us to provide a Perl script directly in the command line
- print if prints the current line if the specified condition is true
- /user/ matches any line containing the substring user
- && is the logical AND operator
- !/baeldung/ matches lines that do not contain the substring baeldung
- || is logical OR operator
- /baeldung/ && !/user/* is a regular expression that matches lines containing the word baeldung but not containing the word *user
Similarly, let’s use this approach for three words:
$ perl -ne 'print if /user/ && !/author/ && !/baeldung/ || /author/ && !/user/ && !/baeldung/ || /baeldung/ && !/user/ && !/author/' datafile.txt
New baeldung lessons
Newly joined authors
Old users of the new lessons
As before, we can also employ the bitwise XOR operator (^):
$ perl -ne 'print if /user/ ^ /baeldung/ ^ /author/' datafile.txt
New baeldung lessons
Newly joined authors
Old users of the new lessons
Here, we provide a list of words as patterns separated by the ^ operator to find lines with exclusively one of multiple words.
7. Conclusion
In this article, we learned several ways to find lines containing one of multiple words exclusively.
Firstly, we created a sample dataset and discussed the rationale for finding lines with exclusively one of multiple words. Then, we used grep and sed commands to achieve our goal. Then, we explored the awk command, both with and without using the XOR operator, to accomplish the same task. Finally, we utilized if conditions in Perl to demonstrate the exclusive matching of lines based on specific criteria.
Although we can select any method depending on our preferences and needs, awk is often the simplest and most standard way to find lines containing exclusively one of multiple words in Linux.