1. Overview
In this article, we’ll go through the command-line tools grep, sed, and awk. In particular, we’ll study the differences in functionality among them.
2. Background
When it comes to text processing in Linux, the three tools that come in pretty handy are grep, sed, and awk. Although being totally different tools, their functionality seems to overlap in simple scenarios. For example, to find a pattern in a file and print the matches to the standard output, we’ll find that all of them can do that.
However, if we stretch beyond this simple exercise, we’ll find that grep is only good for simple text matching and printing.
On the other hand, in addition to match and print text, sed offers additional text transformation commands like substitution.
Finally, awk, being the most powerful of these tools, is a scripting language that offers a multitude of features that do not exists in the former two.
Before we begin, it is important to know that the purpose of this article is to make the distinction between these three tools clearer. Therefore, the examples we are covering are just a small subset of what is possible with each tool, especially in the case of sed and awk.
3. Text File
To facilitate our discussion, let’s define a text file log.txt:
Timestamp Category Message
1598843202 INFO Booting up system
1598843402 INFO Booting up critical service: Authorization
1598843502 INFO System booted successfully
1598853502 INFO User admin requested access for userlist
1598863888 ERROR User annonymous attempt to access protected resource without credentials
1598863891 INFO System health check status: passed
1598863901 ERROR Requested resource not found
1598864411 INFO User admin logged out
4. grep
The grep command searches for lines matching a regex pattern and prints those matching lines to the standard output. It is useful when we need a quick way to find out whether a particular pattern exists or not in the given input.
4.1. Basic Syntax
The syntax for grep is as follows:
grep [OPTIONS] PATTERN [FILE...]
PATTERN is a regex pattern defining what we want to find in the content of the files specified by the FILE argument. The OPTIONS optional parameters are flags that modify the behavior of grep.
4.2. Searching for Lines That Match a Regex Pattern
Let’s say we want to extract the ERROR events from log.txt. We can do that with grep:
$ grep "ERROR" log.txt
1598863888 ERROR User annonymous attempt to access protected resource without credentials
1598863901 ERROR Requested resource not found
What happens here is that grep will scan through the lines in log.txt and print those lines containing the word ERROR to the standard output.
4.3. Inverting the Match
We can invert the match using the -v flag:
grep -v "INFO" log.txt
When we execute the command above, grep* will print every line in the log.txt, except those lines matching the pattern *INFO.
4.4. Printing Preceding or Succeeding Lines
Sometimes, we may want to print the preceding or succeeding line around the matchings. To print the five lines after a match, we can use the flag -A:
grep -A 5 ERROR log.txt
On the other hand, to print the five lines before a match, we can use the flag -B:
grep -B 5 ERROR log.txt
Finally, the flag -C allows us to print both the five lines before and the five lines after a match:
grep -C 5 ERROR log.txt
5. sed
The sed command is a stream editor that works on streams of characters. It’s a more powerful tool than grep as it offers more options for text processing purposes, including the substitute command, which sed is most commonly known for.
5.1. Basic Syntax
The sed command has the following general syntax:
sed [OPTIONS] SCRIPT FILE...
The OPTIONS are optional flags that can be applied on sed to modify its behavior. Next, the SCRIPT argument is the sed script that will be executed on every line for the files that are specified by the FILE argument.
5.2. Script Structure
The sed script has the following structure:
[addr]X[options]
Where addr is the condition applied to the lines of the text file. It can be a fixed number or a regex pattern that is tested against the content of a line before processing it.
Next, the X character represents the sed command to execute. For example, the substitute command, which is denoted with a single character.
Finally, additional options can be passed to the sed command to specify its behavior.
5.3. Using sed as grep
As a starter, let’s see how we can duplicate the functionality of grep using sed:
sed -n '/ERROR/ p' log.txt
By default, sed will print every line it is scanning to the standard output stream. To disable this automatic printing, we can use the flag -n.
Next, it will run the script that comes after the flag -n and look for the regex pattern ERROR on every line in log.txt. If there is a match, sed will print the line to standard output because we’re using the p command in the script. Finally, we pass log.txt as the name of the file we want sed to work on as the final argument.
5.4. Substituting Matched String With Replacement
The sed‘s substitute command has the following structure:
's/pattern/replacement/'
When there is a match on a line for pattern, sed will substitute it with replacement.
For example, if we want to substitute the word ERROR in our log.txt with the word CRITICAL we can run:
sed 's/ERROR/CRITICAL/' log.txt
5.5. Modifying Files in Place
If we want sed to persist the change on the file it is operating on, we can use the flag -i along with a suffix. Before making changes in place, sed will create a backup of the file and append the suffix to this backup filename. For instance, when we run:
sed -ibackup 's/ERROR/CRITICAL/' log.txt
log.txt will be duplicated and renamed to log.txtbackup before sed applies the changes in place.
5.6. Restricting to a Specific Line Number
We can limit the sed command so it only operates on a specific line number using the addr slot in the script:
sed '3 s/ERROR/CRITICAL/' log.txt
This will run the script only on line 3 of log.txt.
Furthermore, we can specify a range of line numbers:
sed '3,5 s/ERROR/CRITICAL/' log.txt
In this case, sed will run the script on lines 3 to 5 of log.txt.
In addition, we can specify the bound with a regex pattern:
sed -n '3,/ERROR/ p' log.txt
Here, sed will print the lines of log.txt starting from line number 3, and ending when it finds the first line that matches the pattern /ERROR/.
6. awk
The awk is a full-fledged programming language that is comparable to Perl. It not only offers a multitude of built-in functions for string, arithmetic, and time manipulation but also allows the user to define his own functions just like any regular scripting language. Let’s take a look at some examples of how it works.
6.1. Basic Syntax
The awk syntax is of the following form:
awk [options] script file
It will execute the script against every line in the file. Let’s now expand the structure of the script:
'(pattern){action}'
The pattern is a regex pattern that will be tested against every input line. If a line matches the pattern, awk will then execute the script defined in action on that line. If the pattern condition is absent, the action will be executed on every line.
6.2. Replicating grep with awk
As we did with sed, let’s take a look at how we can emulate grep‘s functionality using awk:
awk '/ERROR/{print $0}' log.txt
The code above will find the regex pattern ERROR in the log.txt file and print the matching line to the standard output.
6.3. Substituting the Matching String
Similarly, we can use the awk‘s built-in method gsub to substitute all ERROR occurrences with CRITICAL just like in the sed example:
awk '{gsub(/ERROR/, "CRITICAL")}{print}' log.txt
The method gsub takes as arguments a regex pattern and the replacement string. Then, awk print the line to the standard output.
6.4. Adding Header and Footer to the Document
In awk, there’s a BEGIN block that will execute before it starts processing any line of the file. On the other hand, there is also an END block that allows us to define what should be run after all the lines have been processed.
Let’s use BEGIN and END blocks to add a header and a footer to our text document:
$ awk 'BEGIN {print "LOG SUMMARY\n--------------"} {print} END {print "--------------\nEND OF LOG SUMMARY"}' log.txt
LOG SUMMARY
--------------
Timestamp Category Message
1598843202 INFO Booting up system
1598843402 INFO Booting up critical service: Authorization
1598843502 INFO System booted successfully
1598853502 INFO User admin requested access for userlist
1598863888 ERROR User annonymous attempt to access protected resource without credentials
1598863891 INFO System health check status: passed
1598863901 ERROR Requested resource not found
1598864411 INFO User admin logged out
--------------
END OF LOG SUMMARY
6.5. Column Manipulation
Processing documents having a rows and columns structure (CSV style) is when awk really shines. For instance, we can easily print the first and second column, and skip the third one of our log.txt:
awk '{print $1, $2}' log.txt
6.6. Custom Field Separator
By default, awk handles white spaces as a delimiter. If the processing text is using a delimiter that is not white space (a comma, for example), we can specify it with the flag -F:
awk -F "," '{print $1, $2}' log.txt
6.7. Arithmetic Operation
The ability of awk to carry out arithmetic operations makes gather some numerical info about a text file easy. For example, let’s calculate the number of ERROR event occurrences in log.txt:
awk '{count[$2]++} END {print count["ERROR"]}' log.txt
In the script above, awk stores the counts of each distinct value Category column in the variable count. Then the script prints the count value at the end.
6.8. Numeric Comparison
Being a full-fledged scripting language, awk readily understands decimal values. This makes text processing easy when we need our script to interpret values as a number rather than as a simple string.
For example, let’s say we want to get all the log entries older than the timestamp 1598863888, we can use a greater than comparator:
$ awk '{ if ($1 > 1598863888 ) {print $0} }' log.txt
1598863891 INFO System health check status: passed
1598863901 ERROR Requested resource not found
1598864411 INFO User admin logged ou
From the output, we can see that the command only prints log lines that are recorded later than the specified timestamp.
7. Conclusion
In this article, we started off with a basic introduction to grep, sed, and awk. Then, we showed the usage of grep on simple text scanning and matching. Next, we saw how sed is more useful than grep when we want to transform our text.
Finally, we’ve demonstrated how awk is capable of replicating grep and sed functionality while additionally providing more features for advanced text processing.