1. Overview

awk has built-in support for splitting text data into fields. It treats fields as “first-class citizens” and provides convenient ways to access and manipulate them.

In this tutorial, we’ll explore multiple strategies to split a parameter (input record) by a character using awk.

2. Scenario Setup

Let’s start by taking a look at comma-separated values in the numbers.txt text file:

$ cat numbers.txt
10,21,33,42
14,20,30
1,3,5
8,45,64,23
111,3,5

We can notice that each record contains numeric values. Our goal is to sum the numbers in each record and show the output in an equation format:

num1 + num2 + num3 ... = sum

For this purpose, we must split each input record (parameter) by a comma (,) and add individual values to a running sum value.

3. Using Field Separator (FS)

awk can split the input parameters using the field separator (FS) character. Let’s see how we can define it in the BEGIN block and get the individual values:

$ awk 'BEGIN {
    FS=",";
}
{
    print $1
}' numbers.txt
10
14
1
8
111

We can notice that $1 refers to the first field value. Similarly, we can use $2, $3, and so on till $NF to retrieve individual field values, wherein NF is the total count of individual values for the current record.

Now, let’s extend our understanding to write a for loop that computes the sum using field values:

$ awk '
BEGIN{
    FS=",";
}
{
    sum=0;
    for (i=1; i<=NF; i++) { 
        printf "%s%s", $i, (i < NF ? " + " : "");
        sum += $i;
    } 
    printf " = %d\n", sum; 
}' numbers.txt
10 + 21 + 33 + 42 = 106
14 + 20 + 30 = 64
1 + 3 + 5 = 9
8 + 45 + 64 + 23 = 140
111 + 3 + 5 = 119

We can see that we’ve used a standard approach of initializing the sum variable to 0 and then using an iteration to increment the sum by the current field’s value ($i). Furthermore, we’ve used printf statements to show the output in an equation format.

4. Using Field Pattern (FPAT)

While FS defines the field separator, FPAT is an in-built variable that defines the regular expression for individual fields. In our case, we can define a field using a regular expression, [^,], to represent a sequence of characters not containing a comma. Let’s use this concept to show the value for the first field ($1):

$ awk 'BEGIN{ FPAT="[^,]+"} { print $1 }' numbers.txt
10
14
1
8
111

It works as expected.

Like earlier, let’s use a for loop to compute the sum for the individual field values while defining the FPAT variable in the BEGIN block:

$ awk '
BEGIN {
    FPAT="[^,]+";
}
{ 
    sum=0; 
    for (i=1; i<=NF; i++) { 
        sum += $i; 
        printf "%s%s", $i, (i < NF ? " + " : "");
    } 
    printf " = %d\n", sum; 
}' numbers.txt
10 + 21 + 33 + 42 = 106
14 + 20 + 30 = 64
1 + 3 + 5 = 9
8 + 45 + 64 + 23 = 140
111 + 3 + 5 = 119

Fantastic! We’ve got this one right.

5. Using the split() Function

We can use the split() function to split a string into an array of substrings based on a specified separator:

count = split(string, array, separator)

Additionally, we get the total number of splits in the count variable.

Let’s use this to split the numeric values in the numbers.txt file:

$ awk '{ split($0, arr, ","); print arr[1]; }' numbers.txt
10
14
1
8
111

We must remember that awk uses 1-based indexing for arrays. So, we used arr[1] to retrieve the first value.

Next, let’s write a for loop to compute the sum for the split values iteratively:

$ awk '{
sum=0; 
n = split($0, arr, ","); 
for (i=1; i<=n; i++) { 
    sum += arr[i];
    printf "%s%s", arr[i], (i< n ? " + " :"");
} 
printf " = %d\n", sum; 
}' numbers.txt
10 + 21 + 33 + 42 = 106
14 + 20 + 30 = 64
1 + 3 + 5 = 9
8 + 45 + 64 + 23 = 140
111 + 3 + 5 = 119

Great! The result looks accurate.

6. Using the substr() Function

In this section, we’ll learn how to use the string extraction approach, primarily with the substr() function, to solve our use case of showing sum equations.

6.1. With the index() Function

We can use the index() function to find the first occurrence of a substring with a given string:

index(string, substring)

Further, given a starting position (start), we can extract a substring from a given string using the substr() function:

substring = substr(string, start [, length])

We must note that we can trim the substring to a specific length with an optional argument. In its absence, awk includes all characters from the start position to the end of the string.

Now, let’s see how we can use index() and substr() to extract the first numeric value from each record in the numbers.txt text file:

$ awk '{ 
    pos = index($0, ","); 
    print substr($0, 1, pos-1);
}' numbers.txt
10
14
1
8
111

We’ve initialized the pos variable with the first occurrence of a comma using the index() function. Then after, we used it to extract a  substring of length pos – 1 from the start. Again, we must note that, like arrays, awk uses 1-based indexing for strings.

Next, let’s go ahead and write an awk script to solve our use case of showing the sum equations:

$ awk '{ 
    sum=0; 
    pos = index($0, ","); 
    while (pos) { 
        sum += substr($0, 1, pos-1); 
        printf "%d + ", substr($0, 1, pos-1); 
        $0 = substr($0, pos+1);
        pos = index($0, ","); 
    } 
    sum += $0; 
    printf "%d = %d\n", $0, sum; 
}' numbers.txt
10 + 21 + 33 + 42 = 106
14 + 20 + 30 = 64
1 + 3 + 5 = 9
8 + 45 + 64 + 23 = 140
111 + 3 + 5 = 119

Our script looks correct. Let’s break this down to understand the nitty gritty of the logic. Firstly, we’ve initialized the sum and pos variables. Then, we’ve got a while loop that resets the pos variable to the position of the next occurrence of a comma. Moreover, we’re shortening the $0 parameter to the remaining substring after pos. Lastly, we continue this loop until pos is non-empty, a string equivalent to the boolean true in awk.

6.2. With the match() Function

We can use the match() function to search for a specific regular expression (regexp) within a string:

position = match(string, regexp)

On a successful match, it returns the position of the first occurrence of the regexp within the string. Additionally, it sets the values for in-built variables, namely, RSTART and RLENGTH, as the starting position and length of the substring matched. However, without any occurrences, it returns 0, which is an invalid position because awk uses 1-based indexing for strings.

Now, let’s go ahead and use the match() and substr() functions to extract the first numeric value from each record of numbers.txt:

$ awk '{
    match($0, /[^,]*/); 
    print substr($0, RSTART, RLENGTH);
}' numbers.txt
10
14
1
8
111

We must note that we’re using the RSTART and RLENGTH variables set on the execution of the match() function.

Next, let’s go ahead and write an awk script to solve the use case of showing the sum equations:

$ awk '{
    isFirst="true"; 
    sum=0; 
    while (match($0, /[^,]+/)) { 
        num = substr($0, RSTART,RLENGTH); 
        sum += num; 
        if (isFirst != "true") {
            printf " + %d",num;
        } else {
            printf "%d", num; 
            isFirst="false"
        } 
        $0 = substr($0, RSTART + RLENGTH); 
    } 
    printf " = %d\n", sum; 
}' numbers.txt
10 + 21 + 33 + 42 = 106
14 + 20 + 30 = 64
1 + 3 + 5 = 9
8 + 45 + 64 + 23 = 140
111 + 3 + 5 = 119

Our script looks self-explanatory, as we’ve already seen a similar code flow where we used the index() function to find the positions. However, there’s one additional change where we’re using the isFirst variable to identify if the current number is the first value from the current record. Further, its purpose is limited to showing the + delimiter before the numeric value in the sum equation, except for the first one.

7. Using Substitution

In this section, we’ll learn how to use the string substitution concept for solving the use case of generating sum equations from a text file containing comma-separated numeric values.

7.1. With the sub() Function

We can use the sub() function to search for a regular expression (regexp) and replace it with a replacement string:

sub(regexp, replacement [, target])

By default, awk uses $0 as the target string. So, for default behavior, we can omit the third argument.

Now, let’s see how to use the sub() function to get the first value of each record from the numbers.txt file:

$ awk '{ sub(/,/, " "); print $1; }' numbers.txt
10
14
1
8
111

It’s interesting to note that we replaced the first occurrence of a comma with a space. Further, awk treats whitespace as the default field separator. As a result, the $1 variable gets the numeric value before the first comma in each record.

Next, let’s extend this concept for writing our awk script to show the sum equations:

$ awk '{
sum=0;
while(sub(/,/, " + "));
printf "%s = ", $0;
while(sub("+", " "));
for (i=1; i<=NF; i++) {
    sum += $i;}
    printf "%d\n", sum;
}' numbers.txt
10 + 21 + 33 + 42 = 106
14 + 20 + 30 = 64
1 + 3 + 5 = 9
8 + 45 + 64 + 23 = 140
111 + 3 + 5 = 119

Great! Our script is giving the correct output.

Lastly, it’s critical to understand that we call the sub() function within two while loops. The first one replaces all occurrences of commas with +, and it helps us to print the LHS of the sum equation. On the other hand, the second one replaces all occurrences of + with space and helps introduce the default field delimiter required to compute the sum of the numerical values.

7.2. With the gsub() Function

While the sub() function can only replace the first occurrence of a pattern, we can use the gsub() function to replace all the occurrences in a single go:

gsub(regexp, replacement [, target])

Like sub(), the default behavior of gsub() is to use $0 as the target string. So, it’s optional to specify the third argument.

Now, let’s put this in action by replacing all occurrences of commas with whitespace so that awk uses the default input field separator to identify the individual numerical values:

$ awk '{ gsub(",", " "); print $1}' numbers.txt
10
14
1
8
111

Next, let’s modify our previous awk script by replacing the iterative calls to the sub() function in our earlier script with the gsub() function without any loops:

$ awk '{ 
sum=0; 
gsub(",", " + "); 
printf "%s = ", $0; 
gsub("+", " "); 
for (i=1; i<=NF; i++) { 
    sum += $i;} 
    printf "%d\n", sum; 
}' numbers.txt
10 + 21 + 33 + 42 = 106
14 + 20 + 30 = 64
1 + 3 + 5 = 9
8 + 45 + 64 + 23 = 140
111 + 3 + 5 = 119

Fantastic! We reduced a few lines of code and achieved the same results.

7.3. With the gensub() Function

We can use another variant of the substitution function, gensub(), to solve our use case:

result = gensub(regexp, replacement [, how] [, target])

It’s a generic substitution where we can use the third argument, how, to specify the number of occurrences we want to replace. The default behavior is to replace the first occurrence, like the sub() function. However, we can use the “g” flag to mimic the behavior of the gsub() function.

Now, let’s go ahead and use the gensub() function to split each record and get its first numerical value from the numbers.txt file:

$ awk '{$0 = gensub(",", " ", 1); print $1}' numbers.txt
10
14
1
8
111

We can see that we’ve explicitly instructed the gensub() function to replace only a single occurrence of commas into space.

Moving on, let’s apply our understanding to modify our awk script further by replacing the usage of the gsub() function with equivalent calls to the gensub() function:

$ awk '{
sum=0;
$0=gensub(/,/, " + ", "g");
printf "%s = ", $0;
$0=gensub("+", " ", "g");
for (i=1; i<=NF; i++) {
    sum += $i;}
    printf "%d\n", sum;
}' numbers.txt
10 + 21 + 33 + 42 = 106
14 + 20 + 30 = 64
1 + 3 + 5 = 9
8 + 45 + 64 + 23 = 140
111 + 3 + 5 = 119

Once again, we can see that the result is as expected.

8. Conclusion

In this article, we learned how to use awk to split input parameters by a character. Furthermore, we explored several concepts, such as field separators, field patterns, substring extraction, and string substitution, to solve the use case of generating sum equations from comma-separated numbers.

Additionally, while solving the use case, we developed insights into several in-built functions such as index(), substr(), match(), split(), sub(), gsub(), and gensub().