1. Overview
Text data is ubiquitous in the digital world. Often, we need to filter and process it for various purposes:
- extract information from a log file
- remove unwanted lines from a text document
- perform calculations on a table of numbers
grep and awk are two powerful text processing tools that can help us with such tasks. grep is a command-line utility that searches for lines that match a given pattern in one or more files. On the other hand, awk is a programming language that enables us to manipulate text data based on patterns, fields, and actions.
Importantly, both tools support regular expressions, which are a way of describing complex patterns of characters.
In this tutorial, we’ll learn how to ignore lines matching a specific pattern using grep and awk. We’ll also see how to combine both tools for advanced text processing.
2. grep Command
grep stands for Global Regular Expression Print, one of the most widely used text processing commands in Unix-like systems. It takes a pattern as an argument and prints all the lines that match that pattern from the input files or standard input.
For instance, let’s say we have a file called fruits.txt:
$ cat fruits.txt
apple
banana
orange
grape
watermelon
Now, we can use the grep command to find all the lines that contain the letter o:
$ grep 'o' fruits.txt
orange
watermelon
Further, regular expressions can specify more complex patterns:
- ^: match the beginning of a line
- $: match the end of a line
Now, let’s find all the lines that start with a or end with e:
$ grep '^a\|e$' fruits.txt
apple
orange
grape
*The backslash \* is used to escape the special meaning of the | character, which is used for logical OR in regular expressions. So, the pattern matches either a at the beginning or e at the end of a line.
Notably, *one way we can avoid using the backslash is by using the *–**E option**.
In addition, *the *–**E option enables us to use more advanced features of regular expressions, i.e., extended regular expressions**:
- parentheses
- brackets
- quantifiers
It also makes grep equivalent to egrep.
3. Filtering Lines With grep
Sometimes, we may want to do the opposite of what grep does by default. Instead of printing the lines that match a pattern, we may want to print the lines that don’t match a pattern.
One way we can achieve this is by using the -v option, which stands for inverse match:
$ grep -v 'p' fruits.txt
banana
orange
watermelon
In this example, we find all the lines in fruits.txt that don’t contain the letter p.
We can also combine the -v option with other options, such as -E, which enables extended regular expressions:
$ grep -vE '^a|e$' fruits.txt
banana
watermelon
Thus, the output shows the lines in fruits.txt that don’t start with a or end with e.
4. awk Command
awk is another powerful text-processing tool that can be used as a standalone programming language or as a command-line utility. It was created by Alfred Aho, Peter Weinberger, and Brian Kernighan. Apparently, the AWK name is derived from their initials.
Furthermore, awk works by reading each line of input as a record, splitting it into fields based on a separator (usually whitespace), and applying an action on each record that matches a pattern. The action can be anything from printing, modifying, or calculating values based on the fields.
For instance, let’s say we have a file called grades.txt:
$ cat grades.txt
Alice 90 80 85
Bob 75 70 72
Charlie 95 88 91
David 60 65 63
Eve 82 86 84
Frank 78 74 76
Grace 92 94 93
Henry 66 68 67
Irene 88 90 89
Jack 70 72 71
Each line contains the name of a student and their scores in three exams. Thus, we can use the awk command to print the name and the average score of each student:
$ awk '{print $1, ($2 + $3 + $4) / 3}' grades.txt
Alice 85
Bob 72.3333
Charlie 91.3333
David 62.6667
Eve 84
Frank 76
Grace 93
Henry 67
Irene 89
Jack 71
The awk command splits each line into four fields based on whitespace and prints the first field and the average of the second, third, and fourth fields.
Notably, the $ symbol is used to refer to a field by its position. So, $1, $2, $3, and $4 designate the first, second, third, and fourth fields respectively.
5. Ignoring Lines With awk
We can also ignore lines with awk by matching a specific pattern using the ! symbol, which means logical NOT:
$ awk '!/p/' fruits.txt
banana
orange
watermelon
In this example, we print all the lines in fruits.txt that don’t contain the letter p.
The /p/ is a regular expression that matches any line that contains the letter p. The ! symbol negates the match, so only the lines that don’t match the pattern are printed.
We can also use more complex regular expressions with awk:
$ awk '!/^a|e$/' fruits.txt
banana
watermelon
The output is all the lines in fruits.txt that don’t start with a or end with e.
6. Combining grep and awk
Sometimes, we may want to combine grep and awk for more advanced text processing. We can do this by using pipes, which are a way of connecting the output of one command to the input of another command:
$ awk '{avg = ($2 + $3 + $4) / 3; if (avg > 80) print $1, avg}' grades.txt | grep -v 'Alice'
Charlie 91.3333
Eve 84
Grace 93
Irene 89
In this code snippet, the awk command calculates the average score of each student and prints the name and the average score if it’s more than 80. Then, the grep command uses the -v option and the name Alice to ignore Alice record if it’s there.
7. Conclusion
In this article, we’ve learned how to ignore lines matching a specific pattern using grep and awk. We’ve also seen how to combine both tools for advanced text processing.
To conclude, grep and awk are two powerful text-processing tools that can help us filter and process text data for various purposes.