1. Overview
sed is a versatile utility for working with text data in Linux.
In this tutorial, we’ll learn how to use GNU sed to remove multi-line text blocks in different scenarios.
2. Deleting a Text Block Between Two Regexes
In this section, we’ll learn how to delete text blocks between two regular expressions.
2.1. Understanding the Scenario
Let’s start by looking at the sample.txt file containing a few comments between two tags:
$ cat sample.txt
sample_1
sample_2
<!-- BEGIN -->
comment_1
comment_2
comment_3
<!-- END -->
sample_3
sample_4
Herein, we aim to delete the entire comment block from the file.
Let’s go ahead and use the delete (d) function to delete the comment block between the two tags:
$ sed '/<!-- BEGIN -->/,/<!-- END -->/d' sample.txt
sample_1
sample_2
sample_3
sample_4
Great! We’ve got the expected results.
2.2. Deleting a Paragraph Starting With a Known Prefix
Let’s start by inspecting a sample text file containing multiple paragraphs:
$ curl --silent https://filesamples.com/samples/document/txt/sample2.txt
Aeque <para1_remaining_content>
Non <para2_remaining_content>
Qui <para3_remaining_content>
Eademne, <para4_remaining_content>
Quem <para5_remaining_content>
Indicant pueri, <para6_remaining_content>
It’s important to note that we’ve used the <paraN_remaining_content> placeholder text for the rest of the content so that we can focus on the prefix alone. Moreover, we can observe that each paragraph ends with an empty line. As a result, we can extend our understanding to delete a text block between regex for deleting a paragraph between known regex values.
For instance, let’s see how we can delete the paragraph starting with the prefix “Quem”:
% curl --silent https://filesamples.com/samples/document/txt/sample2.txt | sed -E -e '/^Quem/,/^$/d'
# original content excluding paragraph-5
2.3. Deleting Code Blocks Without Nesting
We can apply the same concept for deleting code blocks that don’t have nested blocks within them.
Let’s take a look at the if-block within the sampel1.c file:
$ curl --silent https://filesamples.com/samples/code/c/sample1.c
// code before if block
if (fptr == NULL)
{
printf("Error!");
exit(1);
}
//code after if block
To delete the if-block, we can use the if literal as the first regex and the } character as the second regex:
$ curl --silent https://filesamples.com/samples/code/c/sample1.c | sed -E -e '/if/,/}/d'
// code before if block
// code after if block
It works as expected for this code style. However, depending on the code style, we might be required to change the regex or the complete approach.
3. Deleting Nested Code Blocks
In this section, we’ll learn why our approach to deleting code blocks between regex won’t work if code blocks have nesting. Furthermore, we’ll write a sed script to solve this use case.
First, let’s take a look at the sample.json file that has nested blocks:
% cat sample.json
{
"L1": {
"L2": {
"L3": {
"L4": {
"name": "roy",
"age": 23
}
}
}
}
}
Now, let’s see if we can delete the entire JSON object associated with the L3 key using the text between regex-based approach:
$ cat sample.json | sed -E -e '/L3/,/}/d'
{
"L1": {
"L2": {
}
}
}
}
Unfortunately, the result is incorrect. Worse, the resultant text is not a valid JSON object because sed used the first closing brace it encountered, but that one belongs to the L4 JSON object.
Moving on, let’s write the delete_nested_codeblock.sed script to delete the nested JSON object corresponding to the L3 key:
$ cat delete_nested_codeblock.sed
:loop
$!N
$ b end
b loop
:end
s/[\t ]*"L3":[\t ]*\{\n[\t ]*"L4":[ \t]*\{[^{}]*\}[\t ]*\n[\t ]*}\n//
p
Let’s break this down to understand the nitty gritty of the logic. First, we’re using the N function in a loop to get the entire file content in the pattern space. Then, once we reach the last line, we break out of the loop and use a multi-line regular expression representing the target block to substitute it with empty text.
Finally, let’s see the script in action:
$ cat sample.json | gsed -E -n -f delete_nested_codeblock.sed
{
"l1": {
"l2": {
}
}
}
Great! We’ve got it right this time.
From this exercise, we must realize that building the correct regular expressions is critical when working with sed. So, it’s recommended that we do our due diligence to check its correctness using online tools such as regexr or regex101.
4. Length-Based Deletion
In this section, let’s explore a scenario where we want to delete all the text blocks that are shorter than a threshold length of 600 characters.
4.1. Understanding the Algorithm
Let’s start by sketching out an algorithm for removing all the paragraphs from a text file that are shorter than the threshold length:
assert para is empty
while !end_of_file:
while !is_para_complete(para):
para.append(next_line)
if len(para) >= threshold:
print(para)
empty(para)
The algorithm is self-explanatory. In short, it guides us to iteratively build new paragraphs from the text file and print them only when the content’s length is more than a threshold value.
4.2. sed Script
We’re going to follow our algorithm for writing the delete_shorter_paragraphs.sed script.
First, let’s take a look at the script in its entirety to build a complete picture of the flow:
$ cat delete_shorter_paragraphs.sed
:next
b build_para
:is_para_complete
s/(([^\n]{1}*[\n]{1}){1}*){1}\n/&/
t check_para_length
b next
:check_para_length
h
s/(([^\n]{1}*[\n]{1}){1}+)\n/\1/
s/(.{60}{10})(.{1}*)/\2/
t longer_than_threshold
b shorter_than_threshold
:longer_than_threshold
g
s/(([^\n]{1}*[\n]{1}){1}+)\n/\1/
p
s/.*//
$q
b next
:shorter_than_threshold
s/.*//
$q
b next
:append
N
b is_para_complete
:new
n
b is_para_complete
:build_para
s/^$//
t new
b append
We must note that we’ve created multiple labels. Further, we’ve used branching to control the execution flow of the script through these labels. Next, we’ll look at a subset of these labels and see how they work together.
To start with, let’s look at the append, new, and build_para labels:
:append
N
b is_para_complete
:new
n
b is_para_complete
:build_para
s/^$//
t new
b append
Over here, first, we check if the pattern space is empty. If yes, we branch to the new label and use the n to initialize its first line. Otherwise, it branches to the append label and uses the N function to append one more line from the input into the pattern space.
Moving on, let’s look at the check_para_length label, which holds the most critical part of the script:
:check_para_length
h
s/(([^\n]{1}*[\n]{1}){1}+)\n/\1/
s/(.{60}{10})(.{1}*)/\2/
t longer_than_threshold
b shorter_than_threshold
Since we don’t have any direct mechanism to find the length of pattern space, we’ve formulated a quantifier-based regex substitution logic for length comparison. To begin with, we use the h function to save the original copy in the hold space for later use and use a dummy substitution to check if the pattern space contains at least 600 characters. Subsequently, based on the substitution outcome, we branch out to the longer_than_threshold or the shorter_than_threshold labels.
Additionally, it’s important to note that we used the equivalent multiplier quantifier value of {60}{10} because {600} exceeds the maximum value supported by GNU sed.
Finally, let’s see the script in action:
$ curl --silent https://filesamples.com/samples/document/txt/sample2.txt \
| sed -E -n -f para_len.sed
Indicant pueri, in quibus ut in speculis natura cernitur.
...
Great! We’ve nailed it.
5. Conclusion
In this tutorial, we learned how to delete multi-line text blocks using sed. Additionally, we solved the use cases by writing sed scripts that required us to work with multiple lines in the pattern space and make comparisons using regular expressions.