1. Overview
In this tutorial, we’ll learn how we can add up a column of numbers in a Bash shell. We’ll take a closer look at some of the Bash utilities that can be used for this purpose. We’ll also benchmark the performance of the solutions offered.
2. Setup
First, let’s set up our input file, which we’re going to use in most of the tutorial:
$ for i in `seq 1000000`; do echo $(($RANDOM%100)); done >numbers.csv
Here, we’re generating a file, numbers.csv, which contains one million random numbers within the range 1-100. We’re using the seq command to run a for loop for generating 1,000,000 numbers using the RANDOM built-in variable.
In the coming sections, we’ll also look at the local speed of the solutions offered using the time command to get a feel for how each command performs.
3. Using the awk Tool
Let’s start with the awk command to calculate the sum of numbers in a column:
$ awk '{Total=Total+$1} END{print "Total is: " Total}' numbers.csv
Total is: 49471228
Now, let’s take a look at the execution time using the time command:
$ time awk '{Total=Total+$1} END{print "Total is: " Total}' numbers.csv
Total is: 49471228
real 0m0.228s
user 0m0.141s
sys 0m0.047s
It’s pretty fast! We could calculate the sum of a million numbers in 0.228 seconds. In fact, awk is one of the most powerful tools in Bash for file processing.
3.1. When the File Contains Multiple Columns
So far, we know a way to add up numbers in a column using awk. Let’s take a look at the case where we have multiple columns in a file and we’re interested in calculating the sum for a particular column only:
$ cat prices.csv
Books,40
Bag,70
Dress,80
Box,10
Here, the file prices.csv contains two columns. Now, let’s calculate the sum of elements in the second column:
$ awk -F "," '{Total=Total+$2} END{print "Total is: " Total}' prices.csv
Total is: 200
3.2. When the File Contains a Header Line
Occasionally, text or CSV files also contain a header line. This header line generally holds the column names, for better readability. Let’s modify our prices.csv and add a header line:
$ cat prices.csv
Item,Value
Books,40
Bag,70
Dress,80
Box,10
When the file contains a header line, we’ll want to eliminate this header line before the text processing takes place. There are several ways to achieve this. In this case, we’ll use the awk tool to ignore the header line. So, let’s go ahead and modify our command to calculate the column sum:
$ awk -F "," 'NR!=1{Total=Total+$2} END{print "Total is: " Total}' prices.csv
Total is: 200
In the next sections, we’ll check a few other ways to add up numbers in a column and assess how the awk solution performs relative to those methods.
4. Iterating with the Bash Loops
awk is a great tool, however, we could also use a loop to iterate over the column values.
4.1. Using the expr Command
Let’s run an experiment and check the effectiveness of the expr command to calculate the sum within the for loop:
$ time (sum=0;for number in `cat numbers.csv`; do sum=`expr $sum + $number`; done; echo "Total is: $sum")
Total is: 49471228
real 212m48.418s
user 7m19.375s
sys 145m48.203s
The processing was terribly slow. With the expr command, it took over 3.5 hours to add a million numbers. Notably, the expr utility is a holdover from the early days of Bash and we should only use it in cases where our scripts need to be interoperable with legacy (pre-POSIX) implementations.
4.2. Using Arithmetic Expansion
Since the use of the expr command didn’t help much, let’s try another approach using arithmetic expansion:
$ time (sum=0;for number in `cat numbers.csv`; do sum=$((sum+number)); done; echo "Total is: $sum")
Total is: 49471228
real 0m1.961s
user 0m1.813s
sys 0m0.125s
Here, we are calculating the sum using the arithmetic expansion, with the $((..)) form. Contrary to the expr command, using arithmetic expansion, we were able to add a million numbers within two seconds. Arithmetic expansion allows us to perform simple integer arithmetic. However, it doesn’t work with floating-point numbers. Hence, for floating-point operations, we must use the bc command. We’ll check the implementation of the bc command in the next section.
5. Adding Values With the bc Command
The bc command performs a calculation on a single-line expression. Therefore, we’ll need to combine the numbers into a single line, separated by the addition operator. We’ll then pass the expression to bc to calculate the total. Let’s look at a few ways to accomplish this.
5.1. Using the paste Command
First, let’s look at the paste command to arrange the first 10 numbers of our dataset on a single line, with a plus (+) operator between them:
$ cat numbers.csv| head -10 | paste -sd+ -
2+44+6+15+23+0+15+88+82+1
The option -s ensures that paste joins all the numbers in a single line. We also specified the d+ option to add the ‘+’ character as a delimiter while joining the entries.
With that, we’re ready to provide this sequence as stdin to the bc command:
$ time echo "Total is: $(cat numbers.csv | paste -sd+ - | bc)"
Total is: 49471228
real 0m0.244s
user 0m0.203s
sys 0m0.063s
Notably, the performance is better than what we observed with Bash loops (~ 2 seconds). Also, it came close but couldn’t beat the performance of the awk command (0.228 seconds).
5.2. Using the tr Command
Similar to the paste command, let’s generate a sequence again using the tr command:
$ cat numbers.csv | head -10 |tr "\n" "+"
2+44+6+15+23+0+15+88+82+1+
Here, we translated each newline (‘\n’) to the plus (‘+’) character. But, notice the extra ‘+’ at the end of the sequence. As a workaround, we can add an extra zero at the end to take care of this before we pass it to the bc command:
$ cat numbers.csv | head -10 |tr "\n" "+" ; echo "0"
2+44+6+15+23+0+15+88+82+1+0
Now, let’s redirect the output to the bc command:
$ time ((cat numbers.csv | tr "\n" "+" ; echo "0") | bc)
49471228
real 0m0.217s
user 0m0.203s
sys 0m0.031s
A combination of tr and bc command executes faster than the awk solution.
5.3. Using the sed Command
Finally, we’ll use the sed command to generate the sequence:
$ cat numbers.csv | head -10 | sed -z 's#\n#+#g'
2+44+6+15+23+0+15+88+82+1+
Again, we’ve replaced newlines (‘\n’) with plus (‘+’) characters using the search and replace option of the sed command. Also, we’re printing zero at the end to deal with the extra plus operator, similar to the previous section:
$ time ((cat numbers.csv | sed -z 's#\n#+#g' ; echo "0") | bc)
49471228
real 0m0.343s
user 0m0.281s
sys 0m0.109s
Here, using the -z option changes the meaning of the newline for the sed command. It will no longer consider \n to be the end of the line, and instead, the null character will be interpreted as the end of the line. In effect, we could replace the newline (‘\n’) with the plus (‘+’) character.
Note that replacing the characters with sed is slow as compared to the tr and the paste options.
Before we conclude, we should be aware that non-awk alternatives might run faster for our file containing a single numeric column. But in many real-world scenarios, the files will contain multiple columns and some additional information is to be stripped off (somewhat similar to our discussion in Section 3.1) before the actual calculation happens.
In such cases, awk should be the go-to tool because all the speed advantages of non-awk alternatives will be eaten up by the time taken to pre-process the file for extracting a single column before calculating the sum of its elements.
6. Conclusion
In this tutorial, we discussed ways to add up a column of numbers in a Bash shell.
We started by discussing the solution using the awk tool. Additionally, we also explored ways to handle files containing multiple columns or a header line.
Then we implemented the solution using Bash loops. Finally, we presented techniques to calculate the sum with the bc command by converting the number into a single-line expression with the tr, paste, and sed utilities.