1. Overview
Regular expressions (regex) provide a powerful tool for pattern matching and manipulation of text in various programming languages, including Bash. In Bash scripting, we often use regex to search for specific patterns within strings or files.
In this tutorial, we’ll explore the concept of returning a regex match in Bash.
2. Sample Task
Let’s suppose we have a file named file.txt. We can show the contents of the file using the cat command:
$ cat file.txt
ABC123DEF
456GHI789
JKLMNOPQR
By definition, the lines in our file may or may not contain numeric sequences. Our objective is to extract any numeric sequences and output them on different rows. Specifically, we aim to find all patterns consisting of one or more digits enclosed by zero or more non-digit characters.
Let’s explore various methods to achieve this task.
3. Using grep
We can use the grep command to search for and extract specific patterns from the contents of file.txt:
$ grep -Eo '[0-9]+' file.txt
123
456
789
By using the -E option, we enable extended regular expressions (ERE). This allows for using more complex pattern-matching expressions in general. In this specific case, it allows us to use the + symbol without needing to escape it with a backslash. The pattern we’re seeking to match is [0-9]+, which represents a sequence of one or more digits in the character group 0-9. The -o option instructs grep to only output the matched parts of each line instead of the entire line.
When the command is executed, grep reads the contents of the input file and searches each line for instances of the specified pattern. It then extracts those matched sequences of digits and outputs them as separate lines.
4. Using [[]] and =~
Another way to perform regex matching is through the [[]] construct in conditional statements:
$ cat pattern_extraction.sh
#!/usr/bin/env bash
text="$(cat file.txt)"
for line in $text; do
while [[ "$line" =~ [0-9]+ ]]; do
echo "${BASH_REMATCH[0]}"
line="${line#*"${BASH_REMATCH[0]}"}"
done
done
The [[]] construct, along with the =~ operator, enables Bash to perform regex matching and return the matched portions. When a string matches the provided regex pattern, Bash stores the matching portions in a special array variable named BASH_REMATCH. The BASH_REMATCH array contains elements where index 0 represents the entire match, and subsequent indices represent captured groups.
First, we use the cat command to read the contents of file.txt within a subshell and assign it to the text variable. Then, we use a for loop to get each line of text in the line variable. For each line, a while loop continues as long as the line variable matches the [0-9]+ regular expression, which looks for one or more consecutive digits in the text. Inside the loop, we print the matched digit sequence stored in ${BASH_REMATCH[0]}.
Finally, we update the value of the line variable by removing the matched digit sequence and any characters before it using the ${text#*”${BASH_REMATCH[0]}”} syntax. This ensures that the loop continues to find and process any remaining digit sequences in each line value.
Overall, the script reads the contents of a file, goes through them line by line, searches for numeric sequences within the text, and prints each sequence on a new row until no more matches are found.
Next, we grant execute permission to the script with chmod and run it:
$ chmod u+x pattern_extraction.sh
$ ./pattern_extraction.sh
123
456
789
As a result, we’re able to extract all the numeric sequences appearing in the file.
5. Using expr
We can apply the same approach used in our previous script, but this time update the line variable within the while loop using the expr command:
$ cat pattern_extraction.sh
#!/usr/bin/env bash
text="$(cat file.txt)"
for line in "$text"; do
while [[ "$line" =~ [0-9]+ ]]; do
echo "${BASH_REMATCH[0]}"
line="$(expr "$line" : '[^0-9]*[0-9]\+\(.*\)')"
done
done
The expr command in Bash evaluates expressions. In this specific case, the regular expression pattern used with the expr command matches and captures the portion of text following a numeric sequence.
The [^0-9]* part of the pattern matches any non-digit characters. The [0-9]\+ part of the pattern matches one or more consecutive digits. Finally, the .* part of the pattern matches and captures the remaining portion of the text after the digits. We enclose this portion within \( and \) to capture it as a group. The loop repeats for each match while it updates the line variable using the captured group from the regular expression pattern.
6. Using Perl
Alternatively, we can use Perl to extract the digit sequences:
$ perl -lne 'print $1 while /([0-9]+)/g' file.txt
123
456
789
The -l option removes the trailing newline character when reading input and adds it back when printing output. The -n option creates a loop around the program code specified by the -e option. In this case, the loop reads input from file.txt. The -e option enables providing the Perl code directly on the command line as a string enclosed in quotes.
The /([0-9]+)/g regular expression matches one or more consecutive digits, globally. The g flag ensures that Perl finds all matches on a line, not just the first one. The while loop uses this regular expression to search for matches of one or more digits in each line of the input file. Then, *we use print $1 to print the value of the first capturing group, denoted by ([0-9]+), which captures one or more digits*. The while loop continues until there are no more matches to print.
7. Using awk
We can also use GNU awk to accomplish the task:
$ awk '{ gsub(/[^0-9]+/,"\n"); print }' file.txt | sed '/^$/d'
123
456
789
**For each line of file.txt, we use the gsub() function in awk to globally substitute any sequence of one or more non-digit characters ([^0-9]+) with a newline character (\n)**. The caret symbol (^) inside the square brackets denotes negation.
However, this process can result in empty lines interspersed between the numeric sequences. Therefore, we use sed to filter out any empty lines from the output. The /^$/ regular expression matches empty lines, while the d option in sed deletes these lines, effectively removing them from the output.
Alternatively, we can use another approach:
$ cat pattern_extraction.awk
{ while (match($0, /[0-9]+/)) {
print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + RLENGTH)
}
}
This awk script uses the match() function in a while loop to search for matches of one or more digits in each line, where $0 represents the entire line. The while loop continues as long as there are matches found. Within the loop, we use the substr() function to print the substring that matches the pattern. This substring is extracted from the current line, starting from the position indicated by RSTART, which is an awk variable denoting the index of the beginning of the match. The length of the match is denoted by RLENGTH.
Then, just like with our earlier scripts, we update $0, which holds the current line, by removing the portion of the line that includes the matched substring. This ensures that the loop can continue searching for the next match in the updated line.
Next, we run the awk command, using the -f flag to specify the script, and provide an input file for processing:
$ awk -f pattern_extraction.awk file.txt
123
456
789
The awk command reads each line of file.txt, searches for one or more digits in each line, and prints each matched pattern on a separate line.
8. Using the Bash Parameter Substitution Feature
Alternatively, we can make use of Bash’s built-in parameter substitution feature to remove non-digit characters:
$ for line in $(cat file.txt); do echo "${line//[^[:digit:]]/$'\n'}"; done | sed '/^$/d'
123
456789
We use a for loop in Bash to iterate over each line of file.txt. To do so, the cat command is called within a subshell to read the contents of the file.
For each line, we use Bash’s parameter substitution feature to replace all non-digit characters with newlines and then print the modified result. In particular, the ${line//[^[:digit:]]/$’\n’} parameter substitution pattern substitutes all non-digit characters in the line with the ANSI-quoted $’\n’ newline character. The ^[:digit:] pattern denoting a non-digit is specified within a character class. The double slashes // indicate that all occurrences of the pattern should be replaced, not just the first occurrence. Finally, we pipe the result to sed to remove any empty lines introduced by the replacements and lines in the file that don’t contain numeric sequences.
In summary, the command reads the contents of file.txt and iterates over each line. It replaces all non-digit characters with newlines and removes empty lines. This way, we display only the digits.
However, it’s important to note that simply removing instead of replacing the non-digit characters would work well only when there’s a single numeric sequence appearing in each line. This is because multiple numeric sequences in a line are squeezed into one when non-digit characters are just removed. This also makes the role of sed vital to this solution.
9. Using sed
Another option is to use the sed command to read each line of file.txt and extract the digit sequences:
$ sed -E 's/[^0-9]+/\n/g' file.txt | sed '/^$/d'
123
456
789
The -E option enables extended regular expressions in sed. We apply a substitution command that matches the pattern [^0-9]+ within the line and replaces it with a newline character (\n). In particular, [^0-9]+ matches a sequence of one or more non-digit characters.
The substitution can introduce empty lines in the result. So, we use sed again in the pipeline to remove those empty lines.
10. Conclusion
Regex matching is a powerful feature in Bash scripting. It enables searching for specific patterns and returning the matched portions.
In this article, we’ve seen how to use various methods to match a regex pattern, such as by using grep, the [[]] construct and the =~ operator, Bash’s parameter substitution feature, Perl, awk, expr, and sed.