1. Overview
Extracting a substring from a string is a fundamental and common operation of text processing in Linux.
In this tutorial, we’ll examine various ways to extract substrings using the Linux command line.
2. Introduction to the Problem
As the name suggests, a substring is a part of a string. The problem is pretty straightforward; we want to extract a part of a given string. However, there are two different types of extraction requirements: index-based and pattern-based.
Let’s illustrate the two different requirements with a couple of examples.
An index-based substring is defined by the start and end indexes of the original string. Let’s look at a scenario of extracting an index-based substring.
Given that we have an input string, “0123Linux9“, we want to extract the substring from index positions 4 through 8. The expected result will be “Linux“.
Next, let’s see an example of the pattern-based substring.
For instance, say we have an input string, “Eric,Male,28,USA“. It’s a string of comma-separated values (Name,Gender,Age,Country).
Now, let’s say we want to extract the third field, 28, which is the age of Eric. In this case, we can’t predict the start index of the target substring, since the Name and Gender have dynamic length. Therefore, the implementation will be different from the index-based extraction.
In this article, we’ll address some common ways to extract substrings in the Linux command line. Of course, we’ll cover both extraction types.
3. Extracting an Index-Based Substring
First, let’s have a look at how to extract index-based substrings. We’ll introduce four ways to do this:
- Using the cut command
- Using the awk command
- Using Bash’s substring expansion
- Using the expr command
Next, we’ll see them in action.
3.1. Using the cut Command
We can extract from the Nth to the Mth character from the input string using the cut command: cut -c N-M.
As we discussed in an earlier section, our requirement is to take the substring from index 4 through index 8.
Here, when we talk about the index, it’s in Bash’s context, which means it’s a 0-based index.
Therefore, if we want to solve the problem using the cut command, we need to add one to the beginning and ending index. Thus, the range will become 5-9.
Now let’s see if the cut command can solve the problem:
$ cut -c 5-9 <<< '0123Linux9'
Linux
As the output shows, we got the expected substring, “Linux“, so problem solved.
In the example above, we passed the input string to the cut command via a here-string and saved an echo process.
3.2. Using the awk Command
When we need to solve a text processing problem in Linux, we shouldn’t forget the Swiss army knife: awk.
Awk script has a built-in substr() function, so we can directly call the function to get the substring.
The substr(s, i, n) function accepts three arguments. Let’s take a closer look at them:
- s – The input string
- i – The start index of the substring (awk uses the 1-based index system)
- n – The length of the substring. If it’s omitted, awk will return from index i until the last character in the input string as the substring.
Now let’s see if awk‘s substr() function can give us the expected result:
$ awk '{print substr($0, 5, 5)}' <<< '0123Linux9'
Linux
Good! The awk command works as expected.
Here we pass i=5. This is because we need the 1-based index. The second argument, 5, is the length of the target substring, and we get it by 8-4+1.
3.3. Using Bash’s Substring Expansion
We’ve seen how cut and awk can easily extract index-based substrings.
Alternatively, Bash is sufficient to solve the problem, since it supports substring expansion via ${VAR:start_index:length}.
Today, Bash is the default shell for many modern Linux distros. In other words, we can solve the problem without using any external command:
$ STR="0123Linux9"
$ echo ${STR:4:5}
Linux
As we can see in the output above, we solved the problem using pure Bash.
3.4. Using the expr Command
Even if Bash is available on most Linux distros, there are still a few Linux systems that ship without Bash, particularly in the embedded Linux world.
The expr command is a member of the Coreutils package. Therefore, it’s available on all Linux systems.
Further, expr also has a substr subcommand that we can use to extract index-based substrings easily:
expr substr <input_string> <start_index> <length>
It’s worth mentioning that the expr command uses the 1-based index system.
Let’s use expr with the substr command to solve our problem:
$ expr substr "0123Linux9"5 5
Linux
The output above shows that the expr command has solved the problem.
4. Extracting a Pattern-Based Substring
We’ve learned several ways to extract index-based substrings. Next, in this section, we’ll look into the pattern-based substrings.
The solutions may look different from the index-based ones, but they’re also pretty straightforward to learn.
We’ll address two approaches to solve our problem:
- Using the cut command
- Using the awk command
Further, we’ll have a look at a different pattern-based substring extraction problem.
4.1. Using the cut Command
The cut command is a handy tool for working with field-based data.
Let’s review our problem quickly. Our input string is comma-separated values, “Eric,Male,28,USA”, and our goal is to extract the third field, “28“.
To solve the problem, we can tell cut that the string is separated by commas (-d ,), and ask cut to give us the third field (-f 3):
$ cut -d , -f 3 <<< "Eric,Male,28,USA"
28
We got the expected result and solved the problem.
4.2. Using the awk Command
awk is also good at handling field-based data. A compact awk one-liner can solve the problem:
$ awk -F',' '{print $3}' <<< "Eric,Male,28,USA"
28
Moreover, since awk‘s field separator (FS) supports regex, we can build more general solutions with awk.
For instance, if we change the input string by adding a space after each comma, we have “Eric, Male, 28, USA“. This is a common format we can see in the real world.
In that case, the cut command wouldn’t be a good choice to solve the problem. This is because the cut command only supports a single character as the field delimiter.
However, it’s still a piece of cake for awk:
$ awk -F', ' '{print $3}' <<< "Eric, Male, 28, USA"
28
We can even write one awk command to work for both cases. This could be a useful trick in the real world:
$ awk -F', ?' '{print $3}' <<< "Eric, Male, 28, USA"
28
$ awk -F', ?' '{print $3}' <<< "Eric,Male,28,USA"
28
4.3. A Different Pattern-Based Substring Case
So far, we’ve solved our “Eric’s age” problem. In this problem, our input is a field-based value.
However, in practice, the pattern-based substring may not always be located in a CSV entry. Let’s see another example.
Given that we have an input string, “whatever dataBEGIN:Interesting dataEND:something else“, our goal is to extract the substring between “*BEGIN:” and “END:*“. That is, between two patterns.
Obviously, the cut command can’t help us in this case. But it’s still not a challenge for awk. It can solve this problem in different ways.
So let’s see how awk solves it. We save the input string in a variable $STR to make the commands easier to read:
$ STR="whatever dataBEGIN:Interesting dataEND:something else"
$ awk -F'BEGIN:|END:' '{print $2}' <<< "$STR"
Interesting data
$ awk '{ sub(/.*BEGIN:/, ""); sub(/END:.*/, ""); print }' <<< "$STR"
Interesting data
The first awk command defines “*BEGIN:” or “END:*” as the field separator and takes the second field.
However, the second awk solution doesn’t tweak the field separator. Instead, it applies two regex substitutions to achieve the goal:
- sub(/.*BEGIN:/, “”) – Removes everything from the beginning of the string until “*BEGIN:*“
- sub(/END:.*/, “”) – Removes from “*END:*” until the end of the input string
After the execution of these two substitutions, we’ll have our expected result. All we need to do is print it out.
5. Conclusion
Extracting a substring is a fundamental technique of text processing in Linux. Depending on the requirement, the substring extraction can be index-based or pattern-based.
In this article, we addressed how to extract substrings in both types through examples.
We also explored the power of the handy text processing utility awk.