grep、sed和awk之间的区别

1. Overview

In this article, we’ll go through the command-line tools grep, sed, and awk. In particular, we’ll study the differences in functionality among them.

2. Background

When it comes to text processing in Linux, the three tools that come in pretty handy are grep, sed, and awk. Although being totally different tools, their functionality seems to overlap in simple scenarios. For example, to find a pattern in a file and print the matches to the standard output, we’ll find that all of them can do that.

However, if we stretch beyond this simple exercise, we’ll find that grep is only good for simple text matching and printing.

On the other hand, in addition to match and print text, sed offers additional text transformation commands like substitution.

Finally, awk, being the most powerful of these tools, is a scripting language that offers a multitude of features that do not exists in the former two.

Before we begin, it is important to know that the purpose of this article is to make the distinction between these three tools clearer. Therefore, the examples we are covering are just a small subset of what is possible with each tool, especially in the case of sed and awk.

3. Text File

To facilitate our discussion, let’s define a text file log.txt:

Timestamp       Category        Message
1598843202      INFO    Booting up system
1598843402      INFO    Booting up critical service: Authorization
1598843502      INFO    System booted successfully
1598853502      INFO    User admin requested access for userlist
1598863888      ERROR   User annonymous attempt to access protected resource without credentials
1598863891      INFO    System health check status: passed
1598863901      ERROR   Requested resource not found
1598864411      INFO    User admin logged out

4. grep

The grep command searches for lines matching a regex pattern and prints those matching lines to the standard output. It is useful when we need a quick way to find out whether a particular pattern exists or not in the given input.

4.1. Basic Syntax

The syntax for grep is as follows:

grep [OPTIONS] PATTERN [FILE...]

PATTERN is a regex pattern defining what we want to find in the content of the files specified by the FILE argument. The OPTIONS optional parameters are flags that modify the behavior of grep.

4.2. Searching for Lines That Match a Regex Pattern

Let’s say we want to extract the ERROR events from log.txt. We can do that with grep:

$ grep "ERROR" log.txt
1598863888    ERROR    User annonymous attempt to access protected resource without credentials
1598863901    ERROR    Requested resource not found

What happens here is that grep will scan through the lines in log.txt and print those lines containing the word ERROR to the standard output.

4.3. Inverting the Match

We can invert the match using the -v flag:

grep -v "INFO" log.txt

When we execute the command above, grep* will print every line in the log.txt, except those lines matching the pattern *INFO.

4.4. Printing Preceding or Succeeding Lines

Sometimes, we may want to print the preceding or succeeding line around the matchings. To print the five lines after a match, we can use the flag -A:

grep -A 5 ERROR log.txt

On the other hand, to print the five lines before a match, we can use the flag -B:

grep -B 5 ERROR log.txt

Finally, the flag -C allows us to print both the five lines before and the five lines after a match:

grep -C 5 ERROR log.txt

5. sed

The sed command is a stream editor that works on streams of characters. It’s a more powerful tool than grep as it offers more options for text processing purposes, including the substitute command, which sed is most commonly known for.

5.1. Basic Syntax

The sed command has the following general syntax:

sed [OPTIONS] SCRIPT FILE...

The OPTIONS are optional flags that can be applied on sed to modify its behavior. Next, the SCRIPT argument is the sed script that will be executed on every line for the files that are specified by the FILE argument.

5.2. Script Structure

The sed script has the following structure:

[addr]X[options]

Where addr is the condition applied to the lines of the text file. It can be a fixed number or a regex pattern that is tested against the content of a line before processing it.

Next, the X character represents the sed command to execute. For example, the substitute command, which is denoted with a single character.

Finally, additional options can be passed to the sed command to specify its behavior.

5.3. Using sed as grep

As a starter, let’s see how we can duplicate the functionality of grep using sed:

sed -n '/ERROR/ p' log.txt

By default, sed will print every line it is scanning to the standard output stream. To disable this automatic printing, we can use the flag -n.

Next, it will run the script that comes after the flag -n and look for the regex pattern ERROR on every line in log.txt. If there is a match, sed will print the line to standard output because we’re using the p command in the script. Finally, we pass log.txt as the name of the file we want sed to work on as the final argument.

5.4. Substituting Matched String With Replacement

The sed‘s substitute command has the following structure:

's/pattern/replacement/'

When there is a match on a line for pattern, sed will substitute it with replacement.

For example, if we want to substitute the word ERROR in our log.txt with the word CRITICAL we can run:

sed 's/ERROR/CRITICAL/' log.txt

5.5. Modifying Files in Place

If we want sed to persist the change on the file it is operating on, we can use the flag -i along with a suffix. Before making changes in place, sed will create a backup of the file and append the suffix to this backup filename. For instance, when we run:

sed -ibackup 's/ERROR/CRITICAL/' log.txt

log.txt will be duplicated and renamed to log.txtbackup before sed applies the changes in place.

5.6. Restricting to a Specific Line Number

We can limit the sed command so it only operates on a specific line number using the addr slot in the script:

sed '3 s/ERROR/CRITICAL/' log.txt

This will run the script only on line 3 of log.txt.

Furthermore, we can specify a range of line numbers:

sed '3,5 s/ERROR/CRITICAL/' log.txt

In this case, sed will run the script on lines 3 to 5 of log.txt.

In addition, we can specify the bound with a regex pattern:

sed -n '3,/ERROR/ p' log.txt

Here, sed will print the lines of log.txt starting from line number 3, and ending when it finds the first line that matches the pattern /ERROR/.

6. awk

The awk is a full-fledged programming language that is comparable to Perl. It not only offers a multitude of built-in functions for string, arithmetic, and time manipulation but also allows the user to define his own functions just like any regular scripting language. Let’s take a look at some examples of how it works.

6.1. Basic Syntax

The awk syntax is of the following form:

awk [options] script file

It will execute the script against every line in the file. Let’s now expand the structure of the script:

'(pattern){action}'

The pattern is a regex pattern that will be tested against every input line. If a line matches the pattern, awk will then execute the script defined in action on that line. If the pattern condition is absent, the action will be executed on every line.

6.2. Replicating grep with awk

As we did with sed, let’s take a look at how we can emulate grep‘s functionality using awk:

awk '/ERROR/{print $0}' log.txt

The code above will find the regex pattern ERROR in the log.txt file and print the matching line to the standard output.

6.3. Substituting the Matching String

Similarly, we can use the awk‘s built-in method gsub to substitute all ERROR occurrences with CRITICAL just like in the sed example:

awk '{gsub(/ERROR/, "CRITICAL")}{print}' log.txt

The method gsub takes as arguments a regex pattern and the replacement string. Then, awk print the line to the standard output.

6.4. Adding Header and Footer to the Document

In awk, there’s a BEGIN block that will execute before it starts processing any line of the file. On the other hand, there is also an END block that allows us to define what should be run after all the lines have been processed.

Let’s use BEGIN and END blocks to add a header and a footer to our text document:

$ awk 'BEGIN {print "LOG SUMMARY\n--------------"} {print} END {print "--------------\nEND OF LOG SUMMARY"}' log.txt
LOG SUMMARY
--------------
Timestamp    Category    Message
1598843202      INFO    Booting up system
1598843402      INFO    Booting up critical service: Authorization
1598843502      INFO    System booted successfully
1598853502      INFO    User admin requested access for userlist
1598863888      ERROR   User annonymous attempt to access protected resource without credentials
1598863891      INFO    System health check status: passed
1598863901      ERROR   Requested resource not found
1598864411      INFO    User admin logged out
--------------
END OF LOG SUMMARY

6.5. Column Manipulation

Processing documents having a rows and columns structure (CSV style) is when awk really shines. For instance, we can easily print the first and second column, and skip the third one of our log.txt:

awk '{print $1, $2}' log.txt

6.6. Custom Field Separator

By default, awk handles white spaces as a delimiter. If the processing text is using a delimiter that is not white space (a comma, for example), we can specify it with the flag -F:

awk -F "," '{print $1, $2}' log.txt

6.7. Arithmetic Operation

The ability of awk to carry out arithmetic operations makes gather some numerical info about a text file easy. For example, let’s calculate the number of ERROR event occurrences in log.txt:

awk '{count[$2]++} END {print count["ERROR"]}' log.txt

In the script above, awk stores the counts of each distinct value Category column in the variable count. Then the script prints the count value at the end.

6.8. Numeric Comparison

Being a full-fledged scripting language, awk readily understands decimal values. This makes text processing easy when we need our script to interpret values as a number rather than as a simple string.

For example, let’s say we want to get all the log entries older than the timestamp 1598863888, we can use a greater than comparator:

$ awk '{ if ($1 > 1598863888 ) {print $0} }' log.txt
1598863891      INFO    System health check status: passed
1598863901      ERROR   Requested resource not found
1598864411      INFO      User admin logged ou

From the output, we can see that the command only prints log lines that are recorded later than the specified timestamp.

7. Conclusion

In this article, we started off with a basic introduction to grep, sed, and awk. Then, we showed the usage of grep on simple text scanning and matching. Next, we saw how sed is more useful than grep when we want to transform our text.

Finally, we’ve demonstrated how awk is capable of replicating grep and sed functionality while additionally providing more features for advanced text processing.

Persistence

REST

Security