如何grep包含制表符的模式

1. Overview

Characters, such as commas, tabs, spaces, and so on, are standard delimiters in structured text files. Amongst these delimiters, tabs are commonly seen in many programs for indentations and consistent spacing between columns.

In this tutorial, we’ll explore multiple approaches to grep a pattern containing a tab character in Linux.

2. Scenario Setup

Let’s start by looking at the sample data.txt file that contains structured text:

$ cat data.txt
a1    b1 c1
a2    b2 c2    d2
a3 b3    c3 d3 e3    f4

We’ve got multiple occurrences of the tab character (\t) in the file. Our objective is to grep for the pattern that contains two words separated by a tab character. Further, we’ll consider a continuous sequence of alphanumeric characters as a word for our use case.

Now, let’s also see the expected search result:

a1    b1
a2    b2
c2    d2
b3    c3
e3    f3

Great! We understand the use case well enough to work on the solutions.

3. Using grep

The grep command is the de facto standard for searching for patterns in text files. In this section, let’s learn about different options for using grep to solve our use case.

3.1. Issue With Naive Approach

First, let’s define the regular expression for the pattern that we want to search:

[a-zA-Z0-9]+\t+[a-zA-Z0-9]+

We’ve used the [a-zA-Z0-9]+ regular expression to capture each word. Additionally, the two words are separated by a tab character (\t).

Now, let’s use grep with the -E and -o options for using extended regular expressions and showing only matching characters, respectively:

$ grep -o -E '[a-zA-Z0-9]+\t+[a-zA-Z0-9]+' data.txt

Unfortunately, it looks like something went wrong with our approach, as we didn’t get the desired result.

Lastly, let’s try to debug the issue by printing the pattern with the printf command:

$ printf '%s' '[a-zA-Z0-9]+\t+[a-zA-Z0-9]+'
[a-zA-Z0-9]+\t+[a-zA-Z0-9]+

We can see that the pattern string considered \t as a literal character. So, grep couldn’t find the pattern in the text file.

3.2. With $’\t’ Quoting

In our use case, we want the ‘\t‘ character to be interpreted within the pattern. So, we must use the ANSI-C style quoting $’\t’ in our search pattern.

Let’s first verify that the pattern is interpreted correctly:

$ printf '%s' '[a-zA-Z0-9]+'$'\t+[a-zA-Z0-9]+'
[a-zA-Z0-9]+    +[a-zA-Z0-9]+

Perfect! The revision worked as expected.

Now, we can use grep with the revised pattern to search in the data.txt file:

$ grep -o -E '[a-zA-Z0-9]+'$'\t+[a-zA-Z0-9]+' data.txt
a1    b1
a2    b2
c2    d2
b3    c3
e3    f4

Great! It looks like we’ve got this one right!

3.3. With -P Option

The Perl regular expressions can interpret the \t character implicitly. So, we can use the -P (–perl-regexp) option available with grep to solve our use case:

$ grep -o -P '[a-zA-Z0-9]+\t+[a-zA-Z0-9]+' data.txt
a1    b1
a2    b2
c2    d2
b3    c3
e3    f3

Fantastic! It works as expected without any additional quoting for the tab character sequence.

4. Using egrep

The egrep command-line utility inherently supports extended regular expressions. As a result, we can use egrep over grep -E to search using the same pattern.

Let’s put egrep to action with the regular expression having $’\t’ representation for tab character:

$ egrep -o '[a-zA-Z0-9]+'$'\t+[a-zA-Z0-9]+' data.txt
a1    b1
a2    b2
c2    d2
b3    c3
e3    f3

Like earlier, we got the correct results. Further, we used the -o option to show only the substring that matches the pattern.

5. Using awk

The Awk programming language is quite suitable for pattern search and processing. Moreover, regular expressions are one of its core fundamental parts.

We can use the match() function to match a string with a pattern. On a successful match, awk sets the RSTART and RLENGTH variables internally to the index of the first matching character and length of the matching substring, respectively.

Now, let’s write the search.awk script to search and print all occurrences of two words separated by a tab:

$ cat search.awk
{
    while (match($0, /[a-z0-9]+\t+[a-z0-9]+/)) {
        print substr($0, RSTART, RLENGTH);
        $0 = substr($0, RSTART + RLENGTH);
    }
}

We used the substr() function to extract the matching substring from the current record ($0). Additionally, we removed the matching part from $0 and continued the search iteratively.

Lastly, let’s run the search.awk script and see it in action:

$ awk -f search.awk data.txt
a1    b1
a2    b2
c2    d2
b3    c3
e3    f4

We can see all the expected matches. It looks like we nailed this one!

6. Conclusion

In this article, we learned how to grep a pattern containing a tab character. Furthermore, we learned about different options with grep, such as -E, -o, and -P, while solving the use case.

Lastly, we wrote an awk program using match() and substr() functions to print the substrings that match the pattern.

Persistence

REST

Security