1. Overview

sed is a powerful command-line tool for efficiently processing text files in a controlled and customizable manner. Furthermore, the ease by which sed allows pattern matching using regular expression is remarkable.

In this tutorial, we’ll learn how to match a specific pattern “N” times using sed.

2. Single-Line Input Scenario

Let’s start by looking at the content of the players.txt text file:

$ cat players.txt
player1 player2 player3 player4 player5 player6 player7 player8

We can notice that the file contains a list of player ids separated by whitespace.

In the following sections, our initial goal will be to explore different strategies to pair up the players and show each group on a separate line. For this purpose, we’ll need to match the pattern of player ids exactly twice.

3. Using Substitution and Backreference

In this section, we’ll solve our use case of grouping the players in pairs using substitution (s) and backreferencing:

3.1. With Pattern Backreferencing

For simplicity, let’s assume that player ids start with letters only. So, we can write a regular expression for identifying a player’s ids assuming there are no spaces:

[a-zA-Z][^ ]*

Before proceeding further, we can quickly visit regext101.com and validate this regular expression against a few ids available in the players.txt file.

Now, we’ve gained confidence in the correctness of our regular expression. So, let’s go ahead and write our sed script using substitution with grouping:

$ sed -E \
-e 's/([a-zA-Z][^ ]*) ([a-zA-Z][^ ]*) /\1 \2\n/g' players.txt
player1 player2
player3 player4
player5 player6
player7 player8

Our script is working as expected.

Furthermore, we must note we’re using \1 and \2 as a backreference to the pair of ids identified in the search pattern. Following it is the newline (\n) character to show each pair on a separate line.

3.2. With Pattern Repetition

We can further simplify our approach to matching the pattern twice by specifying the exact number of occurrences to match within the curly braces ({}):

<regex>{k}

So, let’s apply this concept to modify our sed script:

$ sed -E -e 's/([a-zA-Z][^ ]* ){2}/&\n/g' players.txt
player1 player2
player3 player4
player5 player6
player7 player8

Great! It’s more concise now and gives us the same results. Interestingly, another difference in our script is using & to refer to the entire string that matched our search pattern.

3.3. With Explicit Newline

For code portability and maintenance purposes, using the \n escape sequence for a newline is usually a better choice than using an explicit newline (pressing enter). However, from a readability and consistency perspective or depending on the project’s specific requirement, we can specify the newline character explicitly in our sed script.

Let’s tweak our first script to see this approach in action:

$ sed -E -e 's/([a-zA-Z][^ ]*) ([a-zA-Z][^ ]*) /\1 \2\
/g' players.txt
player1 player2
player3 player4
player5 player6
player7 player8

The output looks correct. Further, we can notice the explicit newline as part of the replacement text in the substitution (s) command.

Additionally, let’s do the same for our script that used pattern repetition:

$ sed -E -e 's/([a-zA-Z][^ ]* ){2}/&\
/g' players.txt
player1 player2
player3 player4
player5 player6
player7 player8

It gives the desired output, as expected. Nevertheless, let’s note that this approach should be used only if an existing project requires us to specify the newline explicitly for a readability and consistency perspective.

3.4. With Character Class

Using character classes, we can improve our regular expression even further. In our scenario, we expect the player’s id to be an alphanumeric set of characters, so we can use the [[:alnum:]] character class to identify each id. Additionally, we can use the [[:blank:]] character class to consider space separation between player ids.

Let’s go ahead and implement this approach in our sed script:

$  sed -E -e 's/([[:alnum:]]*[[:blank:]]*){2}/&\n/g' players.txt
player1 player2
player3 player4
player5 player6
player7 player8

Perfect! Our script is more concise and readable while giving the same results.

4. Multi-Line Input Scenario

In this section, let’s extend our scenario to multi-line input and solve our use case using the sed utility.

4.1. Understanding the Scenario

First, let’s start by taking a look at the players_multiline.txt text file that contains multi-line input:

$ cat players_multiline.txt
player1 player2 player3
player4
player5
player6 player7 player8

We can notice that lines can have an arbitrary number of player ids.

Like earlier, our goal is to match the pattern for player ids and group these players into individual pairs.

4.2. With tr Utility

We’ve solved the single-line input scenario using a sed script. If we can transform our multi-line input into a single-line input, we can use the same sed script to solve this scenario conveniently.

So, let’s start by using the tr command to convert the newline escape character (\n) into a single space:

$ cat players_multiline.txt | tr '\n' ' '
player1 player2 player3 player4 player5 player6 player7 player8 

Now, we’ve reduced the problem to a single-line input scenario, so we can directly use the earlier sed script to show the output containing one pair in each output line:

$ cat players_multiline.txt  | tr '\n' ' ' | sed -E -e 's/([a-zA-Z][^ ]*[$ ]{1}){2}/&\n/g'
player1 player2
player3 player4
player5 player6
player7 player8

Fantastic! Our approach worked as planned.

4.3. With N Command

We can further refine our approach by removing the dependency on the tr command. For this purpose, we can use the N command to get the entire file’s content in the pattern space as a single line.

Let’s go ahead and write a sed script to solve our use case using this approach:

$ cat match_multiline.sed
:parse_multi_line
$! {
N
b parse_multi_line
}
s/[[:space:]]*\n[[:space:]]*/ /g
s/([a-zA-Z][^ ]*[$ ]{1}){2}/&\n/g
p

We must note that we’ve used the N command to append the immediate next line into the pattern space. Further, this operation is repeated until we reach the last line ($). Then, we use the substitution (s) command to replace the contiguous whitespace with a single space. Lastly, we use our earlier one-liner substitution command to match the pattern twice and show them on separate lines.

Now, let’s see our script in action:

$ sed -n -E -f match_multiline.sed players_multiline.txt
player1 player2
player3 player4
player5 player6
player7 player8

Great! We’ve got this one right!

5. Using Hold Space

Due to memory constraints, getting the entire file content into the pattern space in a single go for large input files is usually not recommended. So, let’s explore an optimized approach to solve our use case using the hold space.

5.1. Algorithm With Pseudo Code

While our use of hold space is memory-optimized, it comes with increased code complexity. So, let’s start by looking at our approach using pseudo-code, which is easier to relate as compared to a sed script:

pattern_space=first_line
while true
    copy(pattern_space, hold_space)
    if pattern_space.has_prefix(pattern)
        print pattern
        pattern_space=""
        swap(pattern_space, hold_space)
        pattern_space.remove_prefix(pattern)
    else
        if pattern_space!="" and is_last_line():
            print(pattern_space)
            quit()
        else
            pattern_space.append("\n").append(next_line)
            trim_left(pattern_space," ")
            trim_right(pattern_space,"\n")
        fi
    fi
done

Now, let’s understand the logic.

Firstly, we start with the initial input line in the pattern_space and iterate through the file, checking for the prefix pattern. To preserve the current value, we save it in the hold_space. Furthermore, if we find the prefix, we print it and proceed with the remaining line, appending the next line if the pattern isn’t a prefix in the pattern_space.

Lastly, it’s important to note that we trim leading and trailing whitespace from the pattern_space before processing it.

5.2. Writing the sed Script

Now, let’s use the algorithm to write the match_using_holdspace.sed script and see it in its entirety:

$ cat match_using_holdspace.sed
:loop
h

/^[[:alnum:]]+ [[:alnum:]]+/ {
b found
}

b fetch_more

:fetch_more
${/.+/p;q;}
N
s/^\n//g
s/[[:space:]]*\n/ /g
b loop

:found
s/^([[:alnum:]]+[[:space:]]+[[:alnum:]]+)(.*?)/\1/
p
s/.*//
x
s/^([[:alnum:]]+[[:space:]]+[[:alnum:]]+[[:space:]]*)(.*?)/\2/

b loop

In our script, we used labels like loop, found, and fetch_more for control flow, together with t and b branch commands to translate our while loop and if-else logic. We save the pattern space to the hold space with h and append content from the hold space using N. Lastly, we also utilize substitutions for clearing or trimming the pattern space while using backreferences to extract or remove prefixes.

5.3. Script in Action

Now, let’s see our script in action for the multi-line input:

$ sed -n -E -f match_using_holdspace.sed players_multiline.txt
player1 player2
player3 player4
player5 player6
player7 player8

Fantastic! We nailed this one.

Additionally, we can verify that our script is also working fine for the single-line input:

$ sed -n -E -f match_using_holdspace.sed players.txt
player1 player2
player3 player4
player5 player6
player7 player8

6. Conclusion

In this article, we learned how to match a pattern “N” times using sed. Furthermore, we solved the use case for single-line and multi-line inputs using techniques primarily involving substitution with backreferencing, the N command, and hold space.


« 上一篇: Vim中Ex模式的用途