1. Overview
When we work in the Linux command-line, we can do common line-based text searches by a handy utility: the grep command.
However, sometimes, our target data is in a block between two patterns. In this tutorial, we’re going to discuss how to extract data blocks between two patterns.
2. Introduction to the Problem
First of all, let’s see an example input file. It’ll help us understand the problem quickly:
kent$ cat input.txt
XXXX we want to skip this line XXXX
XXXX we want to skip this line XXXX
DATA BEGIN
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
DATA END
XXXX we want to skip this line XXXX
XXXX we want to skip this line XXXX
DATA BEGIN
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
DATA END
XXXX we want to skip this line XXXX
XXXX we want to skip this line XXXX
As the output above shows, in the input file, we have lines beginning with “[ Block #x ] …“. Those data blocks always sit between two patterns: “DATA BEGIN” and “DATA END.”
Our goal is to walk through the input file and extract all the data blocks between the two patterns.
Apart from printing the data blocks, in the real world, we may have various requirements regarding their boundaries, which are the lines matching the two patterns:
- Including both boundaries
- Including the “DATA BEGIN” line only
- Including the “DATA END” line only
- Excluding both boundaries
In this tutorial, we’re going to cover all the above scenarios, and we’ll address how to solve the problem using GNU Sed and GNU Awk.
3. Using the sed Command
The sed command is a common command-line text processing utility. It supports address ranges.
For example, sed /Pattern1/, /Pattern2/{ commands }… will apply the commands on the range of lines. In this example, the first line in the range is the line matching /Pattern1/, while the last line in the range is the line matching /Pattern2/.
The sed‘s address range can help us to solve our problem. Next, let take a closer look at the solutions.
3.1. Printing the Data Blocks Including Both Boundaries
First, let’s have a look at the sed command solving the problem:
kent$ sed -n '/DATA BEGIN/, /DATA END/p' input.txt
DATA BEGIN
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
DATA END
DATA BEGIN
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
DATA END
As we can see, the output is what we’re expecting. The command looks pretty straightforward.
But let’s quickly understand the -n option and the p command usage since we will use this combination to solve problems in other scenarios.
The sed command will, by default, print the pattern space at the end of each cycle.
However, in this example, we only want to ask sed to print the lines we need. Therefore, we’ve used the -n option to prevent the sed command from printing the pattern space. Instead, we’ll control the output using the p command.
3.2. Printing the Data Blocks Including the “BEGIN” Boundary Only
Now, we have a new requirement: including the “BEGIN” boundary only. In other words, we must suppress the “END” boundary output.
Therefore, we can do one more check on the lines in the address range and skip printing the line matching the “END” pattern:
kent$ sed -n '/DATA BEGIN/, /DATA END/{ /DATA END/!p }' input.txt
DATA BEGIN
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
DATA BEGIN
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
As the output above shows, we’ve solved the problem.
3.3. Printing the Data Blocks Including the “END” Boundary Only
Solving this problem won’t be a challenge to us now since it is quite similar to the one we’ve just conquered. What we need to do is change the pattern in sed‘s {… } block:
kent$ sed -n '/DATA BEGIN/, /DATA END/{ /DATA BEGIN/!p }' input.txt
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
DATA END
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
DATA END
3.4. Printing the Data Blocks Excluding Both Boundaries
Finally, let’s address how to print the data block only without the boundary lines.
We may think we can easily satisfy this requirement by joining two further checks with a logical AND, something like “sed -n ‘/BEGIN/, /END/{ ( /BEGIN/! AND /END/! ) { p } } ‘ ….”.
However, sed doesn’t support logical operations. Therefore we cannot join two addresses using the AND operation. Instead, we can nest the two checks to make it work as same as the AND operation.
Next, let’s see how it is done:
kent$ sed -n '/DATA BEGIN/, /DATA END/{ /DATA BEGIN/! { /DATA END/! p } }' input.txt
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
Thank sed‘s address ranges, we can solve our problem in all four scenarios using the sed command.
4. Using the awk Command
The awk command is a powerful command-line text processing tool as well.
If we review the sed solutions, we’ll realize that even though we can use sed to solve the problem, due to its minimal programming language feature supports, we cannot write our sed commands in a more natural way, particularly when the requirement is getting complicated.
Unlike the sed command, the awk command supports a scripting language with a “C-like” syntax. We can build our awk command/script using many programming language features we’re familiar with, such as declaring variables, logical operations, and functions.
Next, let’s see how to solve our problem using the awk command.
4.1. Printing the Data Blocks Including Both Boundaries
Similar to sed, the awk command supports range patterns too. Therefore, we can solve the problem in the same way:
kent$ awk '/DATA BEGIN/, /DATA END/' input.txt
DATA BEGIN
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
DATA END
DATA BEGIN
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
DATA END
In the awk command above, we didn’t explicitly write print to output. This is because a boolean True will trigger the default action: print the current line.
Apparently, only the lines within the range pattern will result in True. Therefore, we’ve got the expected data in the output.
Moreover, if a variable holds a non-zero value, the awk command will evaluate this variable as True as well.
Therefore, we can declare a variable to turn on and off printing under certain conditions. In this way, we can control the boundaries output more straightforwardly:
kent$ awk '/DATA BEGIN/{ f = 1 } f; /DATA END/{ f = 0 }' input.txt
DATA BEGIN
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
DATA END
DATA BEGIN
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
DATA END
This time, we don’t use the range patterns. Instead, we declared a variable f to work as a switch of the awk printer.
We turn it on when a line matches the “BEGIN” of a data block: /DATA BEGIN/{ f = 1 }, and print the BEGIN boundary by “f;”.
Since the switch f has been turned on, we will print all the following lines until the “END” line comes.
When the “END” line arrives, we first print it since the value of the variable f is still 1. Then, we turn off the switch: /DATA END/{f = 0} to prevent outputting the following lines.
We can use this “printer switch” idea to solve the problem in other scenarios.
Next, let’s see them in detail.
4.2. Printing the Data Blocks Including the “BEGIN” Boundary Only
We can slightly change the awk command in the previous section to let it print the target data blocks and the “BEGIN” boundary lines only:
kent$ awk '/DATA BEGIN/{ f = 1 } /DATA END/{ f = 0 } f' input.txt
DATA BEGIN
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
DATA BEGIN
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
Let’s compare this awk command to the one we print data, including both boundary lines:
... '/DATA BEGIN/{ f = 1 } f; /DATA END/{ f = 0 }' ... <--- Including both boundaries
... '/DATA BEGIN/{ f = 1 } /DATA END/{ f = 0 } f ' ... <--- Including the BEGIN boundary only
The only change we’ve made is to move the f after the “END” pattern check.
If the “END” boundary line comes, we turn off the switch. After that, we check the switch and print the output. That is, the “END” boundary lines won’t be printed.
4.3. Printing the Data Blocks Including the “END” Boundary Only
Following the same idea, if we move the f before the “BEGIN” pattern check, the “BEGIN” boundary lines won’t appear in the output:
kent$ awk 'f; /DATA BEGIN/{ f = 1 } /DATA END/{ f = 0 }' input.txt
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
DATA END
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
DATA END
The command isn’t hard to understand. However, let’s quickly explain why we can use the variable f before assigning a value to it.
In awk, if we use a variable that hasn’t been declared or assigned, its value will be an empty string or the number 0. Further, the variable will be evaluated as False. Thus the default action (print) won’t be triggered.
4.4. Printing the Data Blocks Excluding Both Boundaries
Now, let’s have a look at how to exclude all boundary lines in the output:
kent$ awk '/DATA BEGIN/{ f = 1; next } /DATA END/{ f = 0 } f' input.txt
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
This time, we cannot solve the problem only by tunning the position of the variable f.
As the example shows, the tricky part is, when the “BEGIN” pattern comes, we turn the output on and execute the next action immediately: ‘/DATA BEGIN/{ f = 1; next }.
The next action will stop processing the current line and read the next line from the input. Therefore, we only turn on the switch but don’t print the “BEGIN” boundary.
5. A Corner Case
We’ve learned to extract data lines between two patterns using awk and sed. In our input file, the “BEGIN” and the “END” patterns are well paired.
However, in the real world, the input file can be incomplete. Let’s see another example:
kent$ cat input2.txt
XXXX we want to skip this line XXXX
XXXX we want to skip this line XXXX
DATA BEGIN
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
DATA END
XXXX we want to skip this line XXXX
XXXX we want to skip this line XXXX
DATA BEGIN
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
DATA END
XXXX we want to skip this line XXXX
XXXX we want to skip this line XXXX
DATA BEGIN
[ Block #3 ] ... Incomplete data
[ Block #3 ] ... Incomplete data
In the input2.txt file, the last data block has only a “BEGIN” pattern. If we apply the sed and awk solutions on this file, the incomplete data lines will appear in the output as well:
kent$ awk '/DATA BEGIN/{ f = 1; next } /DATA END/{ f = 0 } f' input2.txt
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
[ Block #3 ] ... Incomplete data
[ Block #3 ] ... Incomplete data
kent$ sed -n '/DATA BEGIN/, /DATA END/{ /DATA BEGIN/! { /DATA END/! p } }' input2.txt
... the same output as the awk command...
Depends on the requirement, we may want to only print the complete data blocks and discard the incomplete data.
Next, let’s see how to handle this corner case using sed and awk.
5.1. Using the awk Command
First, let’s look at the working solution, and then we discuss how it works:
kent$ awk 'f { if (/DATA END/){
printf "%s", buf; f = 0; buf=""
} else
buf = buf $0 ORS
}
/DATA BEGIN/ { f = 1 }' input2.txt
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
In the command, we are still using the f variable as the output switch. However, an extra if… else… logic comes into the picture*.* Let’s understand how it works.
Before we dive into the awk code, let’s consider what our main problem is?
The awk command processes lines in an input file sequentially.
Therefore, the difficulty of solving this problem is, when we see the “BEGIN” line, we have no idea if the coming data block is complete. That is to say, we cannot decide if we should print a line in the data block until we reach the “END” line.
To solve this problem, we can first store the data lines in a variable, say buf, instead of printing them. Only if the “END” line comes, we print the value and reset the buf variable.
Now, let’s take a closer look at how the code works:
- f { … } : We’re still using the f variable as the flag to indicate if a line is in our target data block. If f is True, we’ll process the logic within { … }
- if (/DATA END/){printf “%s”, buf; f = 0; buf=””} : If the current line is the “END” boundary, it means the block is complete. Therefore, we print the value in buf , turn off the printer switch and reset the buf variable
- else buf = buf $0 ORS : However, if the current line isn’t the “END” boundary, we append the current line to the buf variable with a new line character
- /DATA BEGIN/ { f = 1 } : This line is not new to us. We turn on the switch f if the “BEGIN” boundary line comes
5.2. Using the sed Command
Unfortunately, sed doesn’t support variables, but we can still solve it by controlling the pattern and hold spaces:
kent$ sed -n '/DATA BEGIN/,/DATA END/{/DATA END/{s/.*//;x;s/^\n//;p;d};/DATA BEGIN/!H }' input2.txt
[ Block #1 ] ... 1992-08-08 08:08:08
[ Block #1 ] ... DATA #1 IN BLOCK
[ Block #1 ] ... 2018-03-06 15:33:23
[ Block #2 ] ... 2021-02-01 00:01:00
[ Block #2 ] ... DATA #2 IN BLOCK
[ Block #2 ] ... 2021-02-02 01:00:00
Probably, the command doesn’t look so straightforward. But it is not hard to understand.
Next, let’s walk through it quickly:
- /DATA BEGIN/,/DATA END/{ … } : sed’s range address isn’t new to us. The logic within { … } will be processed if a line is in the range
- /DATA END/{s/.*//;x;s/^\n//;p;d}; : If the current line is the “END” boundary, we clear the current pattern space (s/.*//;), exchange the content of pattern and hold spaces (x;), remove the first empty line (s/^\n//;), print the content (p;), and clear the current pattern space (d)
- /DATA BEGIN/!H : If the current line isn’t the “BEGIN” boundary, it’s a normal data line in our target data block. For such lines, we append them to the hold space (H)
As we can see, the sed command uses the hold space as a variable to hold the data lines. Basically, it implements the same idea as we’ve done with the awk command.
6. Conclusion
In this article, we’ve learned how to extract data lines between two patterns. We’ve addressed four different scenarios of boundary handling through examples.
Also, we’ve discussed how to handle the case if a data block is incomplete.