1. Overview
The cut command is good at handling column-based text. However, it only supports one single character as the delimiter.
In this tutorial, we’ll discuss how to handle column-based input data separated by multiple spaces when using cut.
2. Introduction to the Problem
As usual, let’s understand the problem by an example. Let’s first take a look at our input file:
$ cat orders.txt
Order-id Date Cost(USD) Details
1 2022-02-20 200 Orange 100kg
2 2022-02-21 300 Apple 250kg
3 2022-02-22 250 Apple 100kg and Orange 100kg
The orders.txt file contains four columns: Order-id, Date, Cost, and Details. The separator between two columns is three space characters.
Now, let’s say we’d like to extract the Date and the Cost data from the input. That is to say, we need fields 2 and 3. So, let’s first try to get it using the cut command:
$ cut -d" " -f2,3 orders.txt
cut: the delimiter must be a single character
Try 'cut --help' for more information.
We’ve tried to set three space characters as the field separator in the command above. Unfortunately, the output shows it didn’t work as expected, and the error message is clear: cut only allows one character to be the delimiter.
Next, let’s see how to solve the problem and get our expected data.
3. Squeezing Spaces Using the tr Command
We’ve learned that the cut command only accepts one single character as the field delimiter. So, one idea to solve the problem is converting the “three spaces separated value” format into a “single-space separated value” format.
The tr utility can read a byte stream from standard input (Stdin), translate or delete characters, then write the result to the standard output (Stdout). Further, tr can squeeze repeating characters with the -s option.
We can “squeeze” the continuous space characters to turn the three spaces into one single space:
$ tr -s " " <orders.txt
Order-id Date Cost(USD) Details
1 2022-02-20 200 Orange 100kg
2 2022-02-21 300 Apple 250kg
3 2022-02-22 250 Apples 100kg and Oranges 100kg
As the command above shows, we’ve replaced three continuous spaces characters with one single one.
It’s worth mentioning that we’ve redirected the orders.txt file to Stdin in the command, as tr cannot directly read a file – it only reads input from Stdin.
Next, let’s pipe this output to the cut command and extract the two required fields:
$ tr -s " " <orders.txt | cut -d " " -f 2,3
Date Cost(USD)
2022-02-20 200
2022-02-21 300
2022-02-22 250
Good, the output above shows that we’ve solved the problem. But it’s worth mentioning if one of the columns contains spaces, this solution may have a problem. We’ll take a closer look at this problem in a later section.
As we’ve seen, the cut command is good at processing column-based inputs. Next, we’ll introduce another solution with a command-line utility more powerful than cut.
4. Using the awk Command
awk is a great tool for processing text, especially if the input is column-based. Moreover, awk covers all functionalities that cut can do.
By default, awk takes the regular expression [ \t\n]+ as the field separator (FS). In other words, awk treats continuous whitespace characters as FS by default.
Therefore, we can solve the problem in one shot:
$ awk '{ print $2, $3 }' orders.txt
Date Cost(USD)
2022-02-20 200
2022-02-21 300
2022-02-22 250
The command above is pretty compact and straightforward. awk uses the default FS to parse the input and print the second and third fields out.
Next, let’s extend the requirement a bit and feel the power of awk.
5. awk Is More Flexible and Powerful Than cut
So far, we’ve solved the problem using cut and awk. Now, let’s say we would like to extract one more field, the Details column, to know what the costs are.
First, let’s solve it using our tr | cut approach:
$ tr -s " " <orders.txt | cut -d " " -f 2,3,4
Date Cost(USD) Details
2022-02-20 200 Orange
2022-02-21 300 Apple
2022-02-22 250 Apples
As we’ve seen in the command above, we added a “4” to the cut command to extract the fourth field. It worked. The Details column appears in the output. However, if we check the output carefully, we find the “details” values have been truncated. This is because the values contain space characters. Even when we “squeeze” the three spaces down to one, the cut command still cannot tell if a space is a field delimiter or a character in fields’ values.
On the other hand, the awk command can easily handle this case with a small adjustment:
$ awk -F" " '{print $2,$3,$4}' orders.txt
Date Cost(USD) Details
2022-02-20 200 Orange 100kg
2022-02-21 300 Apple 250kg
2022-02-22 250 Apples 100kg and Oranges 100kg
We don’t use the default FS value in the awk command above. Instead, we set the three spaces as FS. As we can see, the Details columns have been printed entirely.
Of course, awk can do much more than this.
Next, let’s see a few simple examples to adapt different requirements.
If we review the output we’ve got, although we’ve solved the problems, the output is no longer in the “three spaces separated value” format. So, first, let’s keep that format in the output:
$ awk 'BEGIN{ FS=OFS=" "}{print $2,$3,$4}' orders.txt
Date Cost(USD) Details
2022-02-20 200 Orange 100kg
2022-02-21 300 Apple 250kg
2022-02-22 250 Apples 100kg and Oranges 100kg
Now, we’ve kept the original field separators in the output by setting three spaces to both the FS and OFS variables.
Second, let’s say we still want to extract these three columns, but we would like to put the Cost column at the very front:
$ awk 'BEGIN{ FS=OFS=" "}{print $3, $2, $4}' orders.txt
Cost(USD) Date Details
200 2022-02-20 Orange 100kg
300 2022-02-21 Apple 250kg
250 2022-02-22 Apples 100kg and Oranges 100kg
Finally, let’s print the orders only if the Cost value is more than 200:
$ awk 'BEGIN{ FS=OFS=" "}NR==1 || $3>200 {print $3, $2, $4}' orders.txt
Cost(USD) Date Details
300 2022-02-21 Apple 250kg
250 2022-02-22 Apples 100kg and Oranges 100kg
As we can see from the examples above, awk can flexibly control the processing logic and output.
6. Conclusion
In this article, we’ve addressed two approaches to extracting fields from input when the delimiter is multiple spaces.
Further, we’ve realized awk is more powerful and flexible than the cut command through some examples.