1. Overview
In this tutorial, we’ll focus on tools to print aggregated statics of numbers in a file. We’ll be evaluating statistics such as mean, median, mode, standard deviation, and many more.
2. Setup
Let’s create a sample.txt file containing a list of numbers separated by newlines:
$ echo '1 2 3 4 5 6 7 8 9 10' |tr ' ' '\n' > sample.txt
Here, we’re using echo to output all the numbers from 1 to 10, separated by spaces. Next, we’re piping the results to the tr command, which converts all spaces to newlines.
We can use the cat command to view the contents of the sample.txt file:
$ cat sample.txt
1
2
3
4
5
6
7
8
9
10
Next, we’ll evaluate statistics like mean, median, mode, average, and standard deviation on these numbers.
3. Using awk
awk is a powerful scripting language designed for text processing, extraction, and generation of data reports.
awk doesn’t require compilation and allows us to use logical operators, variables, string functions, and numeric functions.
Let’s print the mean of the numbers in the sample.txt file:
$ awk '{a+=$1} END{print "mean = " a/NR}' sample.txt
5.5
Here, we create a variable named a and then add up all the numbers in our file in the first field*.* In awk, the first field in input is represented as $1. Afterward, we divide the value of variable a by the total number of records (NR) and print the result.
To get the median, we use gawk, the GNU representation of awk. gawk has extra commands that aren’t available in the standard awk utility.
First, let’s install gawk:
$ sudo apt install gawk
Once installed, let’s get the median:
$ gawk -v max=100 '
function median(r,s) {
asort(s,t)
if (r%2) return t[(r+1)/2]
else return (t[r/2+1]+t[r/2])/2
}
{
count++
values[count]=$1
if (count >= max) {
print median(count,values); count=0
}
}
END {
print "median = " median(count,values)
}
' sample.txt
median = 5.5
Here, we’re using the -v flag to set the value of max to 100. In other words, we’re using the value of 100 as a limiter.
We’re also defining a get_median() function that evaluates the numbers and prints out the median.
Let’s also get the standard deviation of the numbers in the sample.txt file:
$ awk '{total+=$1; totalsq+=$1*$1} END {print "stdev = " sqrt(totalsq/NR - (total/NR)**2)}' sample.txt
stdev = 2.87228
We’re getting the sum of the number and the sum of their squares, then using them to calculate the standard deviation.
4. Using ministat
ministat is a statistics utility tool used in the calculation of core statistical properties of numerical data in input files or standard input.
It’s a tool from FreeBSD but also packaged for popular distributions like Debian and Ubuntu.
On Linux, we can install ministat using the package manager:
$ sudo apt install ministat
Alternatively, we can download, build and install it.
Once it’s installed, let’s print statistical data based on our sample.txt file:
$ cat sample.txt| awk '{print $1}' | ministat -w 70
x <stdin>
+--------------------------------------------------------------------------+
|x x x x x x x x x x|
| |________________________A___M___________________| |
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 1 10 6 5.5 3.0276504
Here, we’re printing the data in the sample.txt file with the cat command. Next, we’re piping the result to awk, which prints the first row of numbers. Finally, we’re piping the results to ministat, which performs statistical calculations.
We’ve used the -w flag to set the width of the output to 70 if the standard output is not a terminal.
5. Using perl
perl stands for Practical Extraction and Report Language. It’s very effective in printing reports based on data input through a file or standard input. It has grown into a general-purpose language widely utilized for writing programs from quick one-liners to full-scale applications.
Let’s print the aggregated statistical data of the numbers in the sample.txt file:
$ cat sample.txt | perl -e '
use List::Util qw(max min sum);
@r=();
while(<>){
$sqtotal+=$_*$_; push(@r,$_)
};
$count=@r; $total=sum(@r); $average=$total/@r; $m_num=max(@r); $mm_num=min(@r);
$stdev=sqrt($sqtotal/$count-($total/$count)*($total/$count));
$middle_num=int @r/2; @srtd=sort @r;
if(@r%2){
$median=$srtd[$middle_num];
}
else{
$median=($srtd[$middle_num-1]+$srtd[$middle_num])/2;
};
print "records:$count\n sum:$total\n avg:$average\n std:$stdev\n med:$median\n max:$m_num min:$mm_num";'
records:10
sum:55
avg:5.5
std:2.87228132326901
med:4.5
max:10
min:1
Here, we’re using the -e flag to execute our Perl code. Here’s a breakdown of some parts of the script:
- use List::Util qw(max min sum): this is a module that enables us to use the max, min, and sum functions.
- @r=(): we’re defining an array variable named @r and setting its value to a blank list
- while(<>)…: this is a while loop that gets the sum of the square of each number in our sample.txt file. We’re also pushing each digit in the file to the @r array variable.
Then, we’re creating and evaluating variables that represent the number of records ($count), sum ($total), average ($average), standard deviation ($stdev), median ($median), max ($m_num) and min ($mm_num).
6. Using datamash
GNU datamash is a command-line utility that performs textual, numerical, and statistical operations on data files or standard input. It’s portable and aids in automating analysis pipelines without writing code or short scripts.
Let’s install datamash from the local package manager:
$ sudo apt install datamash
Once installed, let’s print aggregated statistical data based on the numbers in the sample.txt file:
$ cat sample.txt | datamash sum 1 mean 1 median 1 mode 1 sstdev 1
55 5.5 5.5 1 3.0276503540975
Here, we’re using datamash to print the sum, mean, median, mode, and sample standard deviation.
7. Using st
st is a simple command-line utility to display statistics of numbers from standard input or a file.
To install it, we first download it from its repository on GitHub:
$ git clone https://github.com/nferraz/st.git
Then, let’s navigate into the directory and use the perl command to generate the build files:
$ cd st && perl Makefile.PL
Generating a Unix-style Makefile
Writing Makefile for App::St
Writing MYMETA.yml and MYMETA.json
Finally, we use the make command to build and install st:
$ sudo make install
Manifying 1 pod document
Manifying 1 pod document
Appending installation info to /usr/local/lib/x86_64-linux-gnu/perl/5.30.0/perllocal.pod
After installation, we can navigate back to our working directory to generate aggregated statistical data:
$ st sample.txt
N min max sum mean stddev
10 1 10 55 5.5 3.02765
It’s also possible to filter the results by using some of the options available:
$ st --sum sample.txt
55
8. Using clistats
clistats is a command-line utility for the computation of statistical data from a set of delimited input numbers.
The numbers can be separated by either commas or tabs. However, the default delimiter is a comma.
We can pass input from a file, redirected pipes, or standard input.
To use clistats, let’s first download it from its repository:
$ git clone https://github.com/dpmcmlxxvi/clistats.git
Next, we can navigate into the downloaded directory and run the make command to build clistats:
$ cd clistats && make
g++ -O2 src/clistats.cpp -o clistats
This creates a file named clistats in the directory. We’ll be using this file to generate the reports.
Finally, let’s copy the sample.txt file to the clistats directory and then generate aggregated statistical data:
$ ./clistats < sample.txt
#=================================================================
# Statistics
#=================================================================
# Dimension Count Minimum Mean Maximum Stdev
#-----------------------------------------------------------------
1 10 1.000000 5.500000 10.000000 2.872281
9. Conclusion
In this article, we’ve looked at some Linux tools that are useful in generating aggregated statistical reports. These reports include statistics like max, min, median, mode, standard deviation, and many more.