1. Overview

Nowadays, data science impacts virtually all aspects of our lives. Usually, we associate it with sophisticated algorithms and specialized pieces of software. However, we can easily perform data processing using Bash tools as well.

In this tutorial, we’ll cover the basics of data science with Bash.

2. Dataset

As an example dataset, let’s collect our Linux box IPv4 activity data. Thus, we’ll use the sar command to dump the number of datagrams every three minutes (180 s):

$ sar -n IP 180 > net_data.txt

Let’s take a look at the collected data in the net_data.txt file:

Linux 6.2.0-37-generic (ubuntu)     29.11.2023     _x86_64_    (4 CPU)

17:49:00       irec/s  fwddgm/s    idel/s     orq/s   asmrq/s   asmok/s  fragok/s fragcrt/s
17:52:00        70.64      0.00     70.64     48.79      0.00      0.00      0.00      0.00
17:55:00        79.14      0.00     79.14     50.12      0.00      0.00      0.00      0.00
17:58:00        17.99      0.00     17.99     12.46      0.00      0.00      0.00      0.00
18:01:00        19.12      0.00     19.12     16.00      0.00      0.00      0.00      0.00
18:04:00        13.93      0.00     13.93     10.84      0.00      0.00      0.00      0.00
18:07:00        44.97      0.00     44.97     36.77      0.00      0.00      0.00      0.00
18:10:00         3.13      0.00      3.13      3.72      0.00      0.00      0.00      0.00
18:13:00        48.21      0.00     48.21     28.32      0.00      0.00      0.00      0.00
18:16:00        81.50      0.00     81.50     57.01      0.00      0.00      0.00      0.00
18:19:00         9.73      0.00      9.73     10.46      0.00      0.00      0.00      0.00
18:22:00        18.03      0.00     18.03     14.08      0.00      0.00      0.00      0.00
18:25:00       100.17      0.00    100.17     84.67      0.00      0.00      0.00      0.00
18:28:00        56.29      0.00     56.29     37.79      0.00      0.00      0.00      0.00
18:31:00        45.04      0.00     45.04     34.93      0.00      0.00      0.00      0.00
18:34:00        71.61      0.00     71.61     67.84      0.00      0.00      0.00      0.00
18:37:00       103.88      0.00    103.88     61.68      0.00      0.00      0.00      0.00
18:40:00        57.96      0.00     57.96     50.42      0.00      0.00      0.00      0.00
18:43:00         7.61      0.00      7.61      8.07      0.00      0.00      0.00      0.00
18:46:00        45.97      0.00     45.97     39.38      0.00      0.00      0.00      0.00
18:49:00        24.08      0.00     24.08     22.77      0.00      0.00      0.00      0.00


Average:        45.95      0.00     45.95     34.81      0.00      0.00      0.00      0.00

Throughout this tutorial, we’ll focus on the irec/s column with the number of received datagrams per second.

3. Preprocessing Data

Before we start the analysis, we should extract the interesting part of the data. Therefore, we need to remove the header and footer and eliminate redundant columns.

To remove leading and trailing rows, we’ll use a well-known combination of the tail and head commands in a less obvious way. First, let’s print the file starting from the fourth line with tail:

$ tail -n +4 net_data.txt

Note the plus sign before the number of lines in the tail invocation. In this way, we can get rid of the header.

Similarly, let’s insert minus at the front of a number of lines in the head -n call to print all but NUM last lines. So, let’s remove the three-line footer with:

$ head -n -3 net_data.txt

3.2. Extracting Columns

Let’s extract only the relevant columns with cut:

$ cut -f1-2 -d' ' <input_file>

Thanks to the -f1-2 option, we obtain the first and the second columns of input_file. We’ve assumed that fields are separated by a single space, as pointed by the -d’ ‘ option.

However, the problem is that our numbers are separated by multiple spaces. Therefore, we need to squeeze delimiters before processing with cut. Let’s use the tr command for this purpose. With the -s’character’ option, it merges all consecutive occurrences of character into one. So, let’s chain both commands:

$ cat net_data.txt | tr -s ' ' | cut -f1-2 -d' '

Finally, let’s combine all operations and save cleaned data into a new file, net_data_clean.txt:

$ tail -n +4 net_data.txt | head -n -3 | tr -s ' ' | cut -f1-2 -d' ' > net_data_clean.txt

3.3. Truncating Float Point Numbers

An inherent Bash feature is its inability to perform floating point arithmetic. Therefore, let’s convert floating point numbers to integers. However, we don’t want to modify the data file. Instead, let’s truncate numbers on the fly with awk:

$ cat net_data_clean.txt | awk '{print $1, int($2)}'

Here, awk split each line into two variables. By default, the command uses space as a delimiter, exactly as in our data. Then, the first one, $1, is left unchanged, while the second one, $2, is truncated. We can use this construct as a part of a pipe if required.

At the very end, let’s show the result of applying all the discussed commands:

$ tail -n +4 net_data.txt | head -n -3 | tr -s ' ' | cut -f1-2 -d' ' | awk '{print $1, int($2)}'
17:52:00 70
17:55:00 79
17:58:00 17
18:01:00 19
18:04:00 13
18:07:00 44
18:10:00 3
18:13:00 48
18:16:00 81
18:19:00 9
18:22:00 18
18:25:00 100
18:28:00 56
18:31:00 45
18:34:00 71
18:37:00 103
18:40:00 57
18:43:00 7
18:46:00 45
18:49:00 24

4. Basic Sample Statistics

Let’s describe irec/s data with statistical figures.

4.1. Number of Data Points

The number of data points in the net_data_clean.txt file is equal to the number of lines, so let’s use wc to count them:

$ wc -l net_data_clean.txt
20 net_data_clean.txt

4.2. Minimum and Maximum

Next, let’s find minimal and maximal values of datagrams per second with sort. It compares arguments as numbers, thanks to the -n option. Then we take the minimal value with head:

$ cut -f2 -d' ' net_data_clean.txt | sort -n | head -1
3.13 #minimal value

Similarly, we use tail to find the maximal value:

$ cut -f2 -d' ' net_data_clean.txt | sort -n | tail -1
103.88 #maximal value

In both cases, we extract the second column with the cut‘s -f2 option.

4.3. Mean Value

Let’s calculate the average value of the number of datagrams in the sample with the help of awk:

$ awk '{ total_sum += $2 } END { print total_sum/NR }' net_data_clean.txt
45.95

Inside awk, we define the total_sum variable to keep the sum of entries. We update it with the current value from the $2 awk variable, which refers to the second field in each line.

At the end, we obtain the mean value by dividing total_sum by the number of processed lines provided in the NR awk variable.

5. Plotting Data

Now, let’s visualize our data. We’ll use the command line so that the graph will be somehow sketchy. The plotxy script accepts the file name and shows a simple plot in the terminal:

#!/bin/bash

bar_max="======================================================================>" #max value to be shown = 99

while read x y; do
    printf '%s:\t|%s\n' "${x}" "${bar_max:0:$y}"
done <"$1"

The bar_max variable is a graphical representation of the maximal y value, which can be displayed in the terminal.

We use the while loop to read the input file line by line. Each line is divided into x and y variables, corresponding to the space-separated columns.

The next part is responsible for actual plotting. With printf, we print the x value first. Then we print only the y first characters from bar_max, ${bar_max:0:$y}, thanks to the string editing operator : (colon). If the value exceeds the maximal range, the whole bar_max string is shown, with the > character to indicate data clipping. This approach demands y to be an integer.

Let’s take a look at the graph. Of course, the y-axis is horizontal in this approach:

$ ./plotxy <(cat net_data_clean.txt | awk '{print $1, int($2)}')
            17:52:00:|======================================================================
            17:55:00:|===============================================================================
            17:58:00:|=================
            18:01:00:|===================
            18:04:00:|=============
            18:07:00:|============================================
            18:10:00:|===
            18:13:00:|================================================
            18:16:00:|=================================================================================
            18:19:00:|=========
            18:22:00:|==================
            18:25:00:|==================================================================================================>
            18:28:00:|========================================================
            18:31:00:|=============================================
            18:34:00:|=======================================================================
            18:37:00:|==================================================================================================>
            18:40:00:|=========================================================
            18:43:00:|=======
            18:46:00:|=============================================
            18:49:00:|========================

We need to truncate y values with awk before they go into the script. Therefore, we use the process substitution <() to pass data from a command as if they were in a file.

6. Histogram

A histogram shows the data distribution. It consists of bins. The height of the bin informs us how many data points are in the range spanned by the bin base. We’ll make a histogram of the irec/s reads.

6.1. The awk Solution

We’ll calculate the histogram with an awk script hist_awk:

#!/usr/bin/awk -f

BEGIN{
    bin_width=20;
}

{
    a[NR]=$1 #read whole file
}

END{
    idata = 1;
    ibin = 1;
    while (idata<=NR)
    { 
        if ( a[idata] < ibin * bin_width )
        {
            hist[ibin]+=1; # increase count in this bin
            idata++;       # move to the next data point
        }
        else
        {
            ibin++;        # move to the next bin
            hist[ibin]=0;  # do not count current data here
        }
    }
        
    for (h in hist)
        printf "<%-2.2f %i \n", h*bin_width, hist[h]

}

Let’s examine the algorithm’s assumptions. We create bins of a predefined width of 20. Initially, we don’t know how many bins will be produced. Next, we start the data range from zero. Our data span a range from around three to one hundred datagrams per second, so it’s a quite reasonable assumption. Last but not least, the algorithm works with data points sorted in ascending order.

To create bins, we iterate all data points in the while loop. If the value fits the existing bin, we bump up the bin’s count. Otherwise, we create a consecutive bin. However, we don’t assign the current data point to this bin, as the value can fall out of the newly created bin too. Instead, in the while loop, we recheck the current data point.

6.2. Running the Script

Now, let’s run our script. Beforehand, we need to extract the second column from our data file net_data_clean.txt and sort it:

$ cat net_data_clean.txt | cut -f2 -d ' ' | sort -n | ./hist_awk
<20.00 7 
<40.00 1 
<60.00 6 
<80.00 3 
<100.00 1 
<120.00 2 

We can learn from the output, that for example, we have seven data points with irec/s lower than 20 and two reads with this figure greater than 100 but less than 120.

Finally, let’s plot the histogram with plotxy:

$ ./plotxy <(cat net_data_clean.txt | cut -f2 -d ' ' | sort -n | ./hist_awk)
    <20.00:    |=======
    <40.00:    |=
    <60.00:    |======
    <80.00:    |===
   <100.00:    |=
   <120.00:    |==

7. Conclusion

In this article, we used various Bash tools to deal with the sample statistics. First, we collected data related to the network IPv4 activity. Then, we extracted interesting data in the preprocessing step. Next, we calculated figures such as the number of data points, minimal and maximal values in the sample, and the mean value.

Next, we used an ASCII plot to show how the network activity varied with time. Finally, we calculated the histogram of the data sample. Throughout, we focused on combining well-known Bash commands into a pipe to perform more complicated tasks.