如何在Bash中按第N列对CSV文件进行排序

1. Overview

In this tutorial, we’ll explore how to sort a CSV file with a header row by nth column in Bash.

All commands have been tested on Debian 12 running GNU Bash 5.2.15, tail 9.1, awk 5.2.1, sort 9.1, csvkit 1.0.7, and miller 6.6.0.

2. Sample File

Let’s create a sample CSV file consisting of a header row and ten data records:

$ mkdir sortcsv && cd sortcsv
$ cat > sample.csv << EOF
id,title,author,published_date
1,Linux from Scratch,Gerard Beekmans,1999
2,Practical Rust,Rustacean Team,2023
3,System Design Interview,Alex Xu,2020
4,The Amulet of Samarkand,Jonathan Stroud,2003
5,Atomic Habits,James CLear,2018
6,The Anatomy of the State,Murray N. Rothbard,1974
7,The Richest Man in Babylon,George S. Clason,1926
8,The Great Reset,Klaus Schwab,2020
9,1984,George Orwell,1949
10,The Gulag Archipelago,Aleksandr Solzhenitsyn,1973
EOF

We created a directory using the mkdir command, and then entered the directory with cd. Afterward, we used the cat command to append lines to a file by reading the input until it encountered a certain text (EOF).

3. Using tail and sort

We’re going to use the sort command to sort the records. However, sort doesn’t support the option to ignore the first or header line.

Therefore, we’re going to use tail to get the content of the file from the second line to the end of the file, and pass them to the sort command:

$ tail -n+2 sample.csv | sort
10,The Gulag Archipelago,Aleksandr Solzhenitsyn,1973
1,Linux from Scratch,Gerard Beekmans,1999
2,Practical Rust,Rustacean Team,2023
...

The option -n+2 for the tail command means show the file content starting with line two. Then, we pipe the tail output using the | operator to the sort command.

3.1. Sorting the Lines Alphabetically vs by the ASCII Code

The sort command sorts the entire lines alphabetically by default. Occasionally, the command might sort the data slightly differently, such as:

$ tail -n+2 sample.csv | sort
1,Linux from Scratch,Gerard Beekmans,1999
10,The Gulag Archipelago,Aleksandr Solzhenitsyn,1973
2,Practical Rust,Rustacean Team,2023
...

The sort command above sorted the lines by the ASCII code, as indicated by printing the line “1,Linux …” first, followed by “10,The …“. This is because the ASCII code of the comma (,) and 0 characters is 44 and 48 respectively.

This sorting mode can be triggered by setting the LC_ALL=C environment variable:

$ tail -n+2 sample.csv | LC_ALL=C sort
1,Linux from Scratch,Gerard Beekmans,1999
10,The Gulag Archipelago,Aleksandr Solzhenitsyn,1973
2,Practical Rust,Rustacean Team,2023
...

In the command above, we temporarily set the LC_ALL=C for the sort command, so it won’t affect the LC_ALL of the shell.

3.2. Sorting the CSV File by the Nth Column

Let’s sort the CSV records by the published_date or the fourth column:

$ tail -n+2 sample.csv | sort -t "," -k 4
7,The Richest Man in Babylon,George S. Clason,1926
9,1984,George Orwell,1949
10,The Gulag Archipelago,Aleksandr Solzhenitsyn,1973
...

Let’s review the options we used for the sort command:

-t “,”: specifies the delimiter as a comma character
-k 4: the column index

At this point, *we’ve successfully sorted the records by the nth column using the tail and sort commands*.

4. Using awk and sort

In the previous section, we used tail to get the file content without the header line.

An alternative to tail is awk:

$ awk 'NR==1; NR > 1 {print $0 | "sort -t, -k4"}' sample.csv
id,title,author,published_date
7,The Richest Man in Babylon,George S. Clason,1926
9,1984,George Orwell,1949
10,The Gulag Archipelago,Aleksandr Solzhenitsyn,1973
...

The command printed out the same result, plus the header line.

The command above consists of three parts:

awk: the awk command itself
‘NR==1; NR > 1 {print $0 | “sort -t, -k4”}’: the awk script or program
sample.csv: the CSV file

Let’s break down the awk script:

NR==1: if the record number is 1, then simply print out the line
NR > 1 {print $0 | “sort -t, -k4”}: if the record number is greater than 1, print the line and pipe it to sort

While the awk command may seem more complicated than the tail command, awk is more sophisticated and flexible due to its program feature.

Additionally, we also got the header line printed out without any extra command.

5. Using csvkit

csvkit is a command-line tools suite designed to work with CSV files specifically.

5.1. Installation

The csvkit package is available on the Debian official repository under the canonical name csvkit:

$ apt search csvkit
csvkit/stable,stable 1.0.7-1 all
command-line tools for working with CSV
$ sudo apt install csvkit
$ csvsort --version
csvsort 1.0.7

Since csvkit is a suite, it has a bunch of executables. So, to verify that the installation finished successfully, we could use one of its binaries, such as csvsort, to check its version.

5.2. Sorting the CSV File by the Nth Column

Let’s sort the sample file by the fourth column with csvsort:

$ csvsort -c4 sample.csv
id,title,author,published_date
7,The Richest Man in Babylon,George S. Clason,1926
9,1984,George Orwell,1949
10,The Gulag Archipelago,Aleksandr Solzhenitsyn,1973
...

The csvsort command assumes the CSV file has the header line, otherwise, we need to pass the -H option.

The only option we used is -c, which is to specify the column index.

The command is effortless and straightforward, although we might notice a slight decrease in performance. Nevertheless, it does the job.

In case we have a huge CSV file, we might need to explore this further.

6. Using miller

Similar to csvkit, miller is also a command-line tool. However, miller supports various data formats, including but not limited to CSV, TSV, and JSON.

6.1. Installation

The miller package is also available on the Debian official repository under the package name miller:

$ apt search miller
miller/stable,now 6.6.0-2+b3 amd64
name-indexed data processing tool
...
$ sudo apt install miller
$ mlr --version
mlr 6.6.0

Once the installation is finished, we can verify if it was installed successfully by running the miller binary (mlr) to check its version.

6.2. Sorting the CSV File by the Nth Column

Let’s sort the sample file by the published_date column with mlr:

$ mlr --csv sort -f published_date sample.csv
id,title,author,published_date
7,The Richest Man in Babylon,George S. Clason,1926
9,1984,George Orwell,1949
10,The Gulag Archipelago,Aleksandr Solzhenitsyn,1973
...

Let’s review the options:

–csv: specifies the input file is in CSV format
sort -f published_date: sorts the table by the published_date column
sample.csv: the CSV file

The mlr command has various sorting options:

-f: lexical ascending
-r: lexical descending
-c: case-folded lexical ascending
-cr: case-folded lexical descending
-n: numerical ascending; nulls sort last
-nf: same as -n
-nr: numerical descending; nulls sort first
-t: natural ascending
-tr|-rt: natural descending

Let’s sort the table by the id column in descending order:

$ mlr --csv sort -r id sample.csv
id,title,author,published_date
9,1984,George Orwell,1949
8,The Great Reset,Klaus Schwab,2020
7,The Richest Man in Babylon,George S. Clason,1926
...
2,Practical Rust,Rustacean Team,2023
10,The Gulag Archipelago,Aleksandr Solzhenitsyn,1973
1,Linux from Scratch,Gerard Beekmans,1999

Notice that row ten with id 10 got sorted in lexical descending order. If we want the table to be sorted naturally, we can use the -t option:

$ mlr --csv sort -t -r id sample.csv
id,title,author,published_date
10,The Gulag Archipelago,Aleksandr Solzhenitsyn,1973
9,1984,George Orwell,1949
8,The Great Reset,Klaus Schwab,2020
...

Further, we can prettify the output by using the –opprint option:

$ mlr --csv --opprint sort -t -r id sample.csv
id title author published_date
10 The Gulag Archipelago Aleksandr Solzhenitsyn 1973
9 1984 George Orwell 1949
8 The Great Reset Klaus Schwab 2020
...

At this point, we’ve successfully sorted the CSV file by a specific column name using miller.

As a reference, the miller package has many more features that we can utilize to manipulate data files, which we can learn further from its website.

7. Conclusion

In this article, we learned how to sort a CSV file by the nth column using various tools.

First, we used the basic Linux tools, such as tail and awk, to get the file’s content. Afterward, we used the sort command to sort them. Finally, we used more sophisticated tools such as csvkit and miller, which offer more features to deal with data files.

While all tools provide a solution to sort a CSV file, they might have different performance characteristics, which we may need to explore further if we’re dealing with big data files.

Persistence

REST

Security