1. Overview
Regular expressions (Regex) are widely used in the Linux command line. Many common commands support Regex, such as grep, sed, and awk.
Some of us may have encountered a case where a particular Regex doesn’t work with Linux commands – for instance, a pattern containing \d – however, the same Regex works well with Java or Python. This may confuse us.
In this tutorial, let’s take a closer look at this sort of problem and explain why it can happen.
2. Introduction to the Problem
As usual, let’s understand the problem through an example. First, let’s create a text file as our input:
$ cat input.txt
Linux is awesome!
This server is running the Linux kernel 5.16.5-arch1-1.
It has many powerful commands.
The input.txt file contains three lines.
We know the Regex [0-9] matches one single digit. So, the command grep ‘[0-9]’ input.txt should match the second line in the input.txt file:
$ grep '[0-9]' input.txt
This server is running the Linux kernel 5.16.5-arch1-1.
Further, we may have learned that “\d is the short form of [0-9].” So, let’s replace the Regex in the grep command with\d and try again:
$ grep '\d' input.txt
It has many powerful commands.
As the output above shows, it seems that *grep doesn’t recognize*\d as [0-9]**. Instead, it treats \d as a literal letter ‘d‘. Therefore, only the last line is matched.
If we test the same Regex with sed or awk, we can get the same result:
$ sed -n '/\d/p' input.txt
It has many powerful commands.
$ awk '/\d/' input.txt
awk: cmd. line:1: warning: regexp escape sequence `\d' is not a known regexp operator
It has many powerful commands.
Moreover, the awk command explicitly throws a warning message saying that ‘\d’ is unknown.
However, we can get the expected output if we test the same Regex and the input file in Java, Python, or PHP.
So, why isn’t \d supported by Linux commands? Next, let’s figure it out.
3. BRE, ERE, and PCRE
To answer the question, we should understand the different Regex flavors. There are three commonly used Regex syntaxes — BRE, ERE, and PCRE:
- BRE – Basic Regular Expressions
- ERE – Extended Regular Expressions
- PCRE – Perl Compatible Regular Expressions
BRE came earliest. It has limited features and expressiveness. Then, BRE was extended to ERE. Later, PCRE joined the Regex party with a rich set of powerful features.
We won’t dive into each Regex syntax and make this a complete Regex tutorial. Instead, we’ll discuss some differences between BRE, ERE, and PCRE through some examples.
3.1. BRE
As we’ve mentioned earlier, BRE is the oldest Regex syntax. As its name implies, it supports only pretty basic features. For instance, the following features are not supported by the standard POSIX BRE:
- ‘|’ – alternation
- ‘?’ – 0 or 1
- ‘+’ – 1 or more
- ‘\s’ – shorthand for whitespace
Also, we need to escape “{m, n}” (possessive quantifiers) and “(…)” (grouping) to give them special meaning. For example, “[0-9]\{2,4\}” matches two, three, or four digits.
After ERE was introduced, most Regex engines, such as GNU BRE, supported some shorthand such as ‘\s‘ in BRE. Further, |, ?, and + are supported in BRE as well. However, we need to escape them to bring them special meaning. For example, the BRE “a\|b” matches a or b.
3.2. ERE
ERE has extended BRE. With ERE, we don’t need to escape |, ?, +, ( ), and { } to give them special meaning. For example, “a|b” matches a or b*,* and “[0-9]{2,4}” matches two, three, or four digits.
However, if we want to match those characters literally, we need to escape them. For instance, “a\|b” matches the literal string “a|b”.
3.3. PCRE
In the beginning, PCRE was a library to implement the Perl Regex engine. Later, since Perl popularized Regex, it became a popular Regex flavor. Many other utilities and programming languages have Regex engines compatible with PCRE — for instance, Java, Python, and PHP.
PCRE’s syntax is much more powerful and flexible than BRE and ERE. Let’s have a look at a few features only available in PCRE:
- Look-around – Positive and negative look-ahead/look-behind
- Non-greedy matching – *?, +?, and {m, }?
- Case-sensitive/insensitive matching – (?i) and (?-i)
- Shorthand for matching a digit or non-digit character – \d and \D
Now, we know that we’re using PCRE when we use ‘\d‘. Only PCRE-compatible Regex engines can interpret PCRE correctly.
Next, let’s take a look at the Linux commands and which Regex flavors they support.
4. Regex Flavor of grep, sed, and awk
In this section, we’ll take the widely used GNU grep, GNU sed, and GNU awk as examples.
4.1. GNU grep
grep is by default in GNU BRE matching mode. That is to say, if we don’t set an option, it only supports BRE syntax. For example, we can match a line containing either “awesome” or “powerful“:
$ grep 'awesome\|powerful' input.txt
Linux is awesome!
It has many powerful commands
As we’ve seen in the command above, we’ve escaped the ‘|’ character to give it special meaning.
grep allows us to use the -E option to interpret patterns as ERE. Let’s do the same test with the -E option:
$ grep -E 'awesome|powerful' input.txt
Linux is awesome!
It has many powerful commands.
Note that we shouldn’t escape the ‘|’ when we pass the -E option to grep. Otherwise, grep will search the literal ‘|’ character.
GNU grep supports the -P option to interpret PCRE patterns. Therefore, if we want the grep command to match PCRE, for instance, “\d“, we should use the -P option:
$ grep -P '\d' input.txt
This server is running the Linux kernel 5.16.5-arch1-1.
As we can see, grep supports “\d“, but we must use the right option.
4.2. GNU sed and GNU awk
As is the case with grep, sed uses BRE by default. Additionally, we can pass the -r option to tell sed to use GNU ERE for pattern matching:
$ sed -n '/awesome\|powerful/p' input.txt
Linux is awesome!
It has many powerful commands.
$ sed -nr '/awesome|powerful/p' input.txt
Linux is awesome!
It has many powerful commands
However, sed doesn’t support PCRE. Therefore, sed cannot interpret “\d”.**
On the other hand, GNU awk supports GNU ERE. Similarly, awk doesn’t support PCRE, either.
Consequently, we cannot use PCRE-unique features with sed and awk.
5. Conclusion
In this article, first, through an example, we’ve introduced the question that confused us: Why isn’t Regex \d supported by Linux commands, such as grep and sed?
Then, on the journey of seeking the answer to the question, we’ve discussed the three Regex flavors: BRE, ERE, and PCRE.
Further, we’ve talked about Regex compatibilities of common Linux commands such as grep, sed, and awk. Also, we’ve found the answer to the question.