1. Overview
Regular expressions, or regex, are a powerful tool for pattern matching and text manipulation. Oftentimes, text manipulation requires us to write multiple lines of regular code. However, when we use regular expressions, it immensely reduces the lines of code required to accomplish the same task.
However, they can also be complex to write and understand, especially for more advanced patterns. So, in this tutorial, we’ll learn how to find n consecutive characters in text using regular expressions. We’ll start by reviewing the differences between Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE).
Afterward, we’ll use the standard UNIX utilities like grep and egrep to find consecutive characters in the text.
2. Basic Regular Expressions vs. Extended Regular Expressions
Before we dive into the hands-on approach to finding n consecutive characters, we should get familiar with the BRE and ERE variants of regular expressions in the *nix ecosystem. Most of the utilities support both variants. However, some utilities might support one of the two.
BRE and ERE differ in terms of the available set of metacharacters. They also behave a bit differently.
2.1. Basic Regular Expressions (BRE)
The BRE syntax provides a simpler and more restricted set of pattern-matching capabilities compared to ERE. In BRE, characters are typically treated as literal unless they are metacharacters or escaped with a backslash.
Metacharacters have special meanings within patterns. For instance, a dot (.) matches any single character except a new line and an asterisk (*) matches zero or more occurrences of characters or groups of characters. In addition, there’s support for range notation, character classes, quantifiers, and escape characters.
Although simple, it still allows for effective text searching and manipulation. This is evident by tools like grep, sed, and Vi, which default to BRE syntax.
2.2. Extended Regular Expressions (ERE)
ERE is the expanded version of BRE. It provides additional features and metacharacters for pattern matching. Apart from that, it includes non-capturing groups and look-around assertions.
While ERE offers more features and flexibility, it’s not supported by all Linux utilities. However, tools like grep and sed support ERE with the -E flag. Moreover, tools like awk and egrep use ERE by default.
3. Finding Consecutive Characters in Text
In this section, we’ll use different tools to find consecutive characters in the text. We’ll break down the regular expressions used by the different tools. Additionally, we’ll make use of both BRE and ERE syntax.
For our example, we’ll use the Lorem ipsum placeholder text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Nam sed finibus orci.
Pellentesque vel nulla nec massa semper tincidunt.
Sed eleifenddiam sit amet finibus facilisis.
Integer in urna sit amet lorem cursus suscipit id ut est.
Aliquam lobortis magna nec mi vulputate, et elementum nunc fringilla.
3.1. grep
grep is a utility that we use to search files or input stream for lines that matches a specified pattern. There are two variants of grep – the BSD grep and the GNU grep. They both differ slightly, but our examples should work with both versions.
Let’s use grep with BRE syntax to match the lines with words that have double characters in them:
$ grep '.*\(.\)\1.*' lipsum.txt
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Pellentesque vel nulla nec massa semper tincidunt.
...
Let’s break this pattern down:
- .* matches any number of characters before and after the pattern
- \(.\) captures any character in a group for later reference
- \1 matches the same character as captured in the previous group
Similarly, we can alter this pattern to work with a character that we specify:
$ grep '.*m\{2\}.*' lipsum.txt
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Here, we’re looking for lines with words that have “mm” in them. In this case, it’s “commodo“. In the pattern, m\{2\} specifies that the character “m” should occur exactly two times consecutively. Therefore, we can change the number to specify the repetition count.
Moreover, we can also search for characters that repeat once or more than once, like “mm“, “mmm“, etc.:
$ grep '\(.\)\1\{1,\}' lipsum.txt
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Pellentesque vel nulla nec massa semper tincidunt.
...
Here’s the breakdown:
- \(.\) captures any single character and saves it in a group for later reference
- \1 matches the same character as captured in the previous group
- \{1,\} specifies that the previous character should occur at least once or more times consecutively
In the examples above, we can see that we’re escaping the parenthesis and braces. It’s because they have special meanings when used unescaped. So, we need to escape them to treat them as literal characters.
On the other hand, it’s not true for ERE. We don’t need to escape the parenthesis and braces when using the -E flag with grep. Therefore, we can use them directly as literal characters without backslashes:
$ grep -E '(.)\1+' lipsum.txt
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Pellentesque vel nulla nec massa semper tincidunt.
...
Here’s what happens:
- -E enables the ERE support for grep
- (.) in the pattern captures a single character
- \1+ matches one or more occurrences of the previously captured character
In the next section, we’ll explore another variant of grep that relies on ERE.
3.2. egrep
egrep is essentially the same as grep, but it enables the use of Extended Regular Expressions by default. For that reason, we don’t have to supply the -E option. So, examples given for grep work without escaping the curly braces and parenthesis.
Let’s see how we can find two consecutive characters in text using egrep:
$ egrep '.*(.)\1.*' lipsum.txt
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Pellentesque vel nulla nec massa semper tincidunt.
...
As we can see, we don’t need to escape the parenthesis.
In the same way, let’s use egrep to find words with exactly three consecutive characters:
$ egrep '(.)\1{2}' lipsum.txt
Obviously, we don’t have such words in the specified file. So, we’ll pipe some text that matches the pattern:
$ echo "HELLOOO\nWORLD" | egrep '(.)\1{2}'
HELLOOO
Similarly, for three or more characters, we can use the following pattern:
$ egrep '(.)\1{2,}' lipsum.txt
Here, we specify that the matched characters should be at least three or more in number. In the curly braces “*{2,}“, we can specify the range like “{2,10}*“.
4. Conclusion
In this article, we started by learning the basic differences between the BRE and ERE syntax. Then, we explored the different possibilities to find the words that contain n consecutive characters. For that purpose, we used the built-in grep and egrep utilities.