1. Overview
In Linux, we often need to handle files or text in CSV format. In this quick tutorial, we’ll explore how to split CSV data and store the result in Bash.
2. Introduction to the Problem
As usual, we’ll understand the problem through an example. First, let’s see our input:
$ echo $INPUT
Kotlin Corroutines,Java Stream API,Ruby On Rails,Other Language Features
As the above output shows, we have one line of comma-separated values stored in the variable $INPUT. Sharp eyes may have noticed that the values contain spaces in the input string.
Our goal is to parse the value of the $INPUT variable in Bash.
As we can see, the input is a single-line input. We’ll first focus on how to solve the single-line input scenario.
In the end, we’ll extend the solution to solve the multi-line input case.
3. How About Using awk?
We know awk is a powerful text processing tool in the Linux command line. Particularly, it’s good at handling column-based data, such as CSV. So, for example, we can parse $ INPUT’s value easily with awk:
$ awk -F',' '{ for( i=1; i<=NF; i++ ) print $i }' <<<"$INPUT"
Kotlin Corroutines
Java Stream API
Ruby On Rails
Other Language Features
As we’ve seen in the awk command above, the input text has been correctly parsed.
However, it’s worth mentioning that the parsed values live in the awk command. If we just want to transform the input text to another format, awk is one good choice. But, for example, if we want to save each parsed value in a shell variable for further usage, we need to do it in a shell script.
4. Preprocessing the Data Then Storing Them in an Array
We’ve discussed that awk cannot directly provide the parsed results to the shell. However, since it can parse the CSV data and transform it into the multi-line format, we can use awk to transform the input and then save the multi-line values in an array.
If our Bash is version 4 or above, we can use the built-in readarray command to read the awk command’s output:
$ readarray -t the_array < <(awk -F',' '{ for( i=1; i<=NF; i++ ) print $i }' <<<"$INPUT")
$ declare -p the_array
declare -a the_array=([0]="Kotlin Corroutines" [1]="Java Stream API" [2]="Ruby On Rails" [3]="Other Language Features")
We use the readarray command to store the preprocessed data into the array variable the_array. Also, as the declare command’s output shows, the the_array array contains the expected data.
The readarray command is convenient, but it’s not available for older Bash versions.
However, the read command is available in all Bash versions. Therefore, we can adjust the IFS variable and use the read command to store the preprocessed data into an array variable:
$ IFS=$'\n' read -r -d '' -a the_array2 < <(awk -F',' '{ for( i=1; i<=NF; i++ ) print $i }' <<<"$INPUT")
$ declare -p the_array2
declare -a the_array2=([0]="Kotlin Corroutines" [1]="Java Stream API" [2]="Ruby On Rails" [3]="Other Language Features")
It’s worthwhile to mention that the IFS variable change will only set the variable for the immediately following read statement. It won’t interfere with the current shell environment at all.
5. Setting IFS=,
In a previous example, we’ve set IFS=$’\n’ and used the read command to store the preprocessed data into an array. Alternatively, we can set IFS=, (comma) and read each value from the raw input directly.
For example, if we know the number of fields, we can read and assign each value to an individual shell variable:
$ cat four_fields.sh
#!/bin/bash
INPUT="Kotlin Corroutines,Java Stream API,Ruby On Rails,Other Language Features"
IFS=,
read KOTLIN JAVA RUBY OTHER <<<$INPUT
#Verify the result
echo 'Var $KOTLIN has the value:'$KOTLIN
echo 'Var $JAVA has the value:'$JAVA
echo 'Var $RUBY has the value:'$RUBY
echo 'Var $OTHER has the value:'$OTHER
To make it easy to demonstrate, we put everything in a shell script named four_fields.sh. In this small script, we set IFS=,. Then, as we know, there are four fields in the input. Next, we assign each value to an individual variable. In the end, we print the values of the four variables to verify if the input gets parsed correctly.
Now, let’s execute the script:
$ ./four_fields.sh
Var $KOTLIN has the value:Kotlin Corroutines
Var $JAVA has the value:Java Stream API
Var $RUBY has the value:Ruby On Rails
Var $OTHER has the value:Other Language Features
Apparently, this approach works as expected.
In case we don’t know the number of fields, we can still set IFS=, and use read to store the values into an array:
$ cat ./fields_in_array.sh
#!/bin/bash
INPUT="Kotlin Corroutines,Java Stream API,Ruby On Rails,Other Language Features"
IFS=,
read line <<<$INPUT
FIELDS=( $line )
#verify the result
declare -p FIELDS
$ ./fields_in_array.sh
declare -a FIELDS=([0]="Kotlin Corroutines" [1]="Java Stream API" [2]="Ruby On Rails" [3]="Other Language Features")
As the example above shows, we can use the read command to parse CSV data and store values in an array by setting IFS=,.
6. When the Input Is a Multi-Line CSV File
So far, we’ve explored how to parse and store the single-line CSV input. If we need to handle a multi-line CSV file, we can extend the solutions with a while loop.
Let’s see an example:
$ cat input.csv
Kotlin Corroutines,Java Stream API,Ruby On Rails,Other Language Features
Kotlin Corroutines 2,Java Stream API 2,Ruby On Rails 2,Other Language Features 2
Kotlin Corroutines 3,Java Stream API 3,Ruby On Rails 3,Other Language Features 3
Now, we have a CSV file input.csv containing three lines.
Let’s see how to parse every line using while and read:
$ cat ./fields_in_file.sh
#!/bin/bash
IFS=,
while read line; do
FIELDS=( $line )
# using the array to do further operations
#....
declare -p FIELDS
done < input.csv
$ ./fields_in_file.sh
declare -a FIELDS=([0]="Kotlin Corroutines" [1]="Java Stream API" [2]="Ruby On Rails" [3]="Other Language Features")
declare -a FIELDS=([0]="Kotlin Corroutines 2" [1]="Java Stream API 2" [2]="Ruby On Rails 2" [3]="Other Language Features 2")
declare -a FIELDS=([0]="Kotlin Corroutines 3" [1]="Java Stream API 3" [2]="Ruby On Rails 3" [3]="Other Language Features 3")
As we can see, the script fields_in_file.sh extends the previous field_in_array.sh with a while loop.
After parsing a line and storing values in the array FIELDS, we just print the content of FIELDS via the declare command. However, we can use the array to perform some meaningful tasks in the real world.
When we execute the script, we can see that it prints the expected output.
7. Conclusion
In this article, we’ve addressed parsing CSV format data in Bash through examples.