1. Overview
Characters, such as commas, tabs, spaces, and so on, are standard delimiters in structured text files. Amongst these delimiters, tabs are commonly seen in many programs for indentations and consistent spacing between columns.
In this tutorial, we’ll explore multiple approaches to grep a pattern containing a tab character in Linux.
2. Scenario Setup
Let’s start by looking at the sample data.txt file that contains structured text:
$ cat data.txt
a1 b1 c1
a2 b2 c2 d2
a3 b3 c3 d3 e3 f4
We’ve got multiple occurrences of the tab character (\t) in the file. Our objective is to grep for the pattern that contains two words separated by a tab character. Further, we’ll consider a continuous sequence of alphanumeric characters as a word for our use case.
Now, let’s also see the expected search result:
a1 b1
a2 b2
c2 d2
b3 c3
e3 f3
Great! We understand the use case well enough to work on the solutions.
3. Using grep
The grep command is the de facto standard for searching for patterns in text files. In this section, let’s learn about different options for using grep to solve our use case.
3.1. Issue With Naive Approach
First, let’s define the regular expression for the pattern that we want to search:
[a-zA-Z0-9]+\t+[a-zA-Z0-9]+
We’ve used the [a-zA-Z0-9]+ regular expression to capture each word. Additionally, the two words are separated by a tab character (\t).
Now, let’s use grep with the -E and -o options for using extended regular expressions and showing only matching characters, respectively:
$ grep -o -E '[a-zA-Z0-9]+\t+[a-zA-Z0-9]+' data.txt
Unfortunately, it looks like something went wrong with our approach, as we didn’t get the desired result.
Lastly, let’s try to debug the issue by printing the pattern with the printf command:
$ printf '%s' '[a-zA-Z0-9]+\t+[a-zA-Z0-9]+'
[a-zA-Z0-9]+\t+[a-zA-Z0-9]+
We can see that the pattern string considered \t as a literal character. So, grep couldn’t find the pattern in the text file.
3.2. With $’\t’ Quoting
In our use case, we want the ‘\t‘ character to be interpreted within the pattern. So, we must use the ANSI-C style quoting $’\t’ in our search pattern.
Let’s first verify that the pattern is interpreted correctly:
$ printf '%s' '[a-zA-Z0-9]+'$'\t+[a-zA-Z0-9]+'
[a-zA-Z0-9]+ +[a-zA-Z0-9]+
Perfect! The revision worked as expected.
Now, we can use grep with the revised pattern to search in the data.txt file:
$ grep -o -E '[a-zA-Z0-9]+'$'\t+[a-zA-Z0-9]+' data.txt
a1 b1
a2 b2
c2 d2
b3 c3
e3 f4
Great! It looks like we’ve got this one right!
3.3. With -P Option
The Perl regular expressions can interpret the \t character implicitly. So, we can use the -P (–perl-regexp) option available with grep to solve our use case:
$ grep -o -P '[a-zA-Z0-9]+\t+[a-zA-Z0-9]+' data.txt
a1 b1
a2 b2
c2 d2
b3 c3
e3 f3
Fantastic! It works as expected without any additional quoting for the tab character sequence.
4. Using egrep
The egrep command-line utility inherently supports extended regular expressions. As a result, we can use egrep over grep -E to search using the same pattern.
Let’s put egrep to action with the regular expression having $’\t’ representation for tab character:
$ egrep -o '[a-zA-Z0-9]+'$'\t+[a-zA-Z0-9]+' data.txt
a1 b1
a2 b2
c2 d2
b3 c3
e3 f3
Like earlier, we got the correct results. Further, we used the -o option to show only the substring that matches the pattern.
5. Using awk
The Awk programming language is quite suitable for pattern search and processing. Moreover, regular expressions are one of its core fundamental parts.
We can use the match() function to match a string with a pattern. On a successful match, awk sets the RSTART and RLENGTH variables internally to the index of the first matching character and length of the matching substring, respectively.
Now, let’s write the search.awk script to search and print all occurrences of two words separated by a tab:
$ cat search.awk
{
while (match($0, /[a-z0-9]+\t+[a-z0-9]+/)) {
print substr($0, RSTART, RLENGTH);
$0 = substr($0, RSTART + RLENGTH);
}
}
We used the substr() function to extract the matching substring from the current record ($0). Additionally, we removed the matching part from $0 and continued the search iteratively.
Lastly, let’s run the search.awk script and see it in action:
$ awk -f search.awk data.txt
a1 b1
a2 b2
c2 d2
b3 c3
e3 f4
We can see all the expected matches. It looks like we nailed this one!
6. Conclusion
In this article, we learned how to grep a pattern containing a tab character. Furthermore, we learned about different options with grep, such as -E, -o, and -P, while solving the use case.
Lastly, we wrote an awk program using match() and substr() functions to print the substrings that match the pattern.