使用多个分隔符在Awk中

1. Overview

Awk is a robust text-processing language that we can use to parse and process delimited data. Furthermore, we can also specify multiple delimiters for splitting the text.

In this tutorial, we’ll learn how to use multiple delimiters in Awk.

2. Separating the Fields

Awk supports multiple approaches that we can use to separate the fields using delimiters. Additionally, each way supports specifying multiple delimiters using Basic Regular Expressions (BRE).

Let’s start by learning each of the approaches.

2.1. Using the -F Option

Let’s say we’ve got the people_emails.txt file that contains names and email addresses:

$ cat people_emails.txt
name,email
P1,[email protected]
P2,[email protected]
P3,[email protected]
P4,[email protected]

It’s important to note that we’ll reuse this file for other scenarios, too.

Now, let’s use the -F option to split the fields using a comma (,) as a delimiter and get the first field ($1) from each row:

$ awk -F',' '{print $1}' people_emails.txt
name
P1
P2
P3
P4

Next, let’s see how we can use multiple delimiters – both comma(,) and @ as delimiters – to extract the name, username, and domain for each record:

$ awk -F'[,@]' '
BEGIN {
    print "name|username|Domain"
} 
NR>1 {
    print $1"|"$2"|"$3
}' people_emails.txt
name|username|domain
P1|p1|example.com
P2|P2|example.com
P3|p3|example.com
P4|p4|example.com

Lastly, let’s note that we used the BEGIN block to print the header values separated by a pipe(|). Moreover, we used the NR>1 pattern for the code in the main block to show each record’s name, username, and domain.

2.2. Using the FS Variable

Alternatively, we can use the built-in FS variable to specify the delimiter in the BEGIN block:

$ awk 'BEGIN {FS=","} {print $1}' people_emails.txt
Name
P1
P2
P3
P4

We got the first field ($1) when separating the records using a comma.

Now, let’s initialize the FS variable in the BEGIN block with multiple delimiters [@,] to get the name, username, and domain for each record:

$ awk '
BEGIN {
    FS="[@,]"; 
    print "name|username|Domain"; 
} 
NR>1 {
    print $1"|"$2"|"$3
}' people_emails.txt
name|username|Domain
P1|p1|example.com
P2|P2|example.com
P3|p3|example.com
P4|p4|example.com

Great! We got this one right.

2.3. Using the split() Function

Another approach is to use the split() function to split the input record into multiple fields:

split(<record>, <fields_array>, <delimiter>)

Let’s use this to split the records in people_emails.txt using comma (,) as a delimiter and print the first field from the fields array:

$ awk '{split($0, fields, ","); print fields[1];}' people_emails.txt
Name
P1
P2
P3
P4

Like earlier, let’s see how we can use multiple delimiters [@,] with the split() function to show the name, username, and domain for each record:

$ awk '
BEGIN {
    print "name|username|domain"
} 
NR>1 {
    split($0, fields, "[@,]"); 
    print fields[1]"|"fields[2]"|"fields[3];
}' people_emails.txt
name|username|domain
P1|p1|example.com
P2|P2|example.com
P3|p3|example.com
P4|p4|example.com

It worked correctly!

2.4. Using the match() and substr() Functions

We can also use a combination of the match() and substr() methods to match the record against a delimiter pattern and extract the substring:

$ awk '{
    match($0, /,/); 
    print substr($0, 1, RSTART - 1);
} people_emails.txt
Name
P1
P2
P3
P4

On a successful match, awk sets the RSTART as the starting index for the substring that matches the pattern. So, we can retrieve the first field from the input record ($0) using the substr() method.

Next, let’s see how we can apply this concept to use multiple delimiters [,@] for extracting the name, username, and domain values for each record:

$ awk '
BEGIN {
    print("name|username|domain")
}
NR>1 {
    while (match($0, /[,@]/)) {
        printf substr($0, 1, RSTART - 1) "|"
        $0 = substr($0, RSTART + RLENGTH)
    }
    print
}' people_emails.txt
name|username|domain
P1|p1|example.com
P2|P2|example.com
P3|p3|example.com
P4|p4|example.com

It looks like we nailed this one!

Further, we must note that we used an iteration to continuously reset the current record ($0) to the substring left after the first occurrence of the delimiter.

3. Advanced Regex Scenario

So far, we’ve seen multiple approaches that we can use to use multiple delimiters for splitting the fields. Now, let’s solve a slightly advanced scenario using regular expressions.

For this scenario, let’s imagine we have a fruits_vegetables.txt file:

$ cat fruits_vegetables.txt
apple-1 mango-4 tomato-78
banana-3 orange-10 potato-8
grapes-10 pineapple-220 carrots-9

We need to extract only the names of fruits and vegetables from this file.

So, we’ll need to split each record based on a regular expression containing multiple delimiters:

(-[0-9]+ ?)

We’ve used a group to club multiple delimiters together. It starts with the hyphen(–) symbol, followed by at least one digit in the [0-9] range. At the end, it optionally expects a space.

Moving ahead, let’s use the -F option to specify the delimiter to extract out the names:

$ awk -F"(-[0-9]+ ?)" '{
print $1,$2,$3;
}' fruits.txt
apple mango tomato
banana orange potato
grapes pineapple carrots

Perfect! Our awk script works perfectly.

Furthermore, we must note that we could use any of the earlier approaches with the same regular expression to split the record.

4. Conclusion

In this article, we learned how to use multiple delimiters in Awk to separate fields in an input record. Furthermore, we learned about split(), match(), and substr() functions while solving the use case.

Persistence

REST

Security