1. Overview
When we work in the Linux command line, we often use the grep command to search text.
In this tutorial, let’s explore how to search multiple strings using only one grep process.
2. Introduction to the Problem
First of all, to understand the problem clearly, let’s prepare an input file:
$ cat input.txt
About Eric and Kent:
v ^ v ^ v ^ v ^ v ^ v ^ v ^ v ^ v ^ v ^ v
Kent is a beginner in Java programming.
Eric is an experienced architect.
They work together.
Kent has learned a lot from Eric.
Eric and Kent are good friends.
As we can see, the text file above contains several lines. So, if we want to filter the lines that match a pattern, let’s say “Eric“, we can simply use the command: grep ‘Eric’ input.txt.
However, the requirement now is to match multiple patterns, for example, Eric and Kent. When we talk about matching multiple patterns, there are two scenarios:
- the “Or” scenario – searching lines that match Eric or Kent
- the “And” scenario – finding lines containing both Eric and Kent, regardless of their occurrence order
Further, we’d like to use one single grep process to solve the problems in the scenarios above.
In this tutorial, we’ll first take “Eric” and “Kent” as two pattern examples and address how to match them in the two mentioned scenarios. Then, we’ll discuss extending our solutions when we need to match more than two strings.
3. The “Or” Scenario
First, let’s look at how to match two patterns in the “or” relationship.
In regular expression (Regex), we can use the pipe character ‘*|*‘ for the logical “Or”. For example, “A|B” matches “A” or “B“. Therefore, “Eric|Kent” is the pattern to solve our problem.
However, we should note that, by default, grep accepts BRE (Basic Regular Expressions). Therefore, if we don’t escape ‘*|*‘, it matches the pipe character literally:
$ grep 'Eric\|Kent' input.txt
About Eric and Kent:
Kent is a beginner in Java programming.
Eric is an experienced architect.
Kent has learned a lot from Eric.
Eric and Kent are good friends.
Alternatively, we can use the -E option to tell grep that the pattern is ERE (Extended Regular Expressions). Then, we don’t escape ‘*|*‘ anymore. This makes our code easier to read:
$ grep -E 'Eric|Kent' input.txt
About Eric and Kent:
Kent is a beginner in Java programming.
Eric is an experienced architect.
Kent has learned a lot from Eric.
Eric and Kent are good friends.
Next, let’s move to the “And” scenario.
4. The “And” Scenario
Unlike the “Or” scenario, Regex doesn’t have a special character to apply logical “And”. Usually, we can combine two grep processes to match lines containing both patterns, for example:
$ grep 'Eric' input.txt | grep 'Kent'
About Eric and Kent:
Kent has learned a lot from Eric.
Eric and Kent are good friends.
However, one of our requirements is to use only one grep process. Then we can create a pattern to cover all occurrences’ permutations of the two strings:
$ grep -E 'Eric.*Kent|Kent.*Eric' input.txt
About Eric and Kent:
Kent has learned a lot from Eric.
Eric and Kent are good friends.
If our grep implementation supports the -P option, such as GNU grep, we can use PCRE (Perl Compatible Regular Expressions). Then, positive lookahead zero-length assertions can help us to solve the problem:
$ grep -P '(?=.*Kent)(?=.*Eric)' input.txt
About Eric and Kent:
Kent has learned a lot from Eric.
Eric and Kent are good friends
5. Matching More Than Two Words
We’ve seen how to match two strings in the “Or” and “And” scenarios. Now, let’s look at the case that matches more than two patterns.
5.1. The “Or” Scenario
Regex allows us to add as many alternatives as we want to the alternation expression, for example, ‘A|B|C|D|…‘.
So next, let’s match lines containing “Eric” or “Kent” or “and” in the input.txt file:
$ grep -E 'Eric|Kent|and' input.txt
About Eric and Kent:
Kent is a beginner in Java programming.
Eric is an experienced architect.
Kent has learned a lot from Eric.
Eric and Kent are good friends.
5.2. The “And” Scenario
Let’s quickly revisit how we match two strings in the “And” scenario: ‘Eric.*Kent|Kent.*Eric‘. Here, we put the permutations of two words in the pattern: ‘A.*B|B.*A‘.
Let’s say we need to add one more string, ‘C‘, to the matching list. Then we have six permutations: ‘A.*B.*C|A.*C.*B|B.*A.*C|B.*C.*A|C.*A.*B|C.*B.*A‘. Obviously, writing this pattern isn’t straightforward and it’s error-prone. Moreover, if we want to match four strings in the “And” scenario, we have 24 permutations, and five strings will lead to 120 permutations.
Therefore, this approach isn’t ideal if we want to match more than two strings in the logical “And” relationship.
However, if our grep supports the -P option, we can add more lookahead assertions to the pattern to solve the problem. So next, let’s match lines containing “Eric“, “Kent“, and “and” using grep -P:
$ grep -P '(?=.*Kent)(?=.*Eric)(?=.*and)' input.txt
About Eric and Kent:
Eric and Kent are good friends
Now that we’ve seen that PCRE is more powerful than BRE and ERE, we should note that not all grep implementations support the -P option. So if we cannot use PCRE with our grep, we can still match multiple strings in the “And” scenario by combining multiple grep processes: grep ‘A’ input | grep ‘B’ | grep ‘C’ | …
We can consider using the awk command to start only one process to do the job. The awk command supports Regex too. Further, awk allows us to write C-like scripts to perform different tasks. Of course, logical “and” is included.
Finally, let’s see how awk solves the problem:
$ awk '/Kent/ && /Eric/ && /and/' input.txt
About Eric and Kent:
Eric and Kent are good friends.
As the example above shows, we can combine the words we want to match with && for the logical “and” relationship. Therefore, if a line contains these three words, the expression is evaluated as true. Then *the true value triggers awk‘s default action: “print the current line”*. Thus, we get the expected result.
6. Conclusion
When we talk about matching multiple strings, there are two scenarios, “Or” and “And”. In this article, we’ve discussed how to use only one grep command to match multiple words in the two scenarios.
Apart from that, we’ve learned that awk allows us to match multiple words in the “And” scenario straightforwardly if we want to match more than two words.