如何在Linux中比较两个文本文件中的特定列

1. Overview

In this tutorial, we’re going to compare a specific column from two text files that contain comma-separated or any character-separated values. Then, we print out the differences along with the entire line.

To compare, we’ll use essential Linux tools such as the awk, join, and grep commands.

All examples have been tested on Debian 12 with awk 5.2.1, join 9.1, and grep 3.8.

2. Sample Files

To start, let’s create two comma-separated files:

$ mkdir compare && cd compare
$ cat > file1 << EOF
aaa,111,red
bbb,222,green
ccc,333,yellow
ddd,444,blue
eee,555,white
EOF
$ cat > file2 << EOF
aaa,112,red
bbc,222,green
ccp,333,yellowish
ccp,333,yellowish
ddd,444,blue
EOF

We created a directory (compare) using mkdir and then entered the directory with cd. Next, we used the cat command to append lines to a file by reading the input until it encountered a certain text (EOF).

Both file1 and file2 have five rows, each containing three columns. In addition, several values differ within each column.

For example, in file1, column one contains bbb, ccc, and eee. While in file2, it’s slightly different with bbc, two ccp‘s, and no eee.

Another example is that in file1, column two has 111, while in file2, it has 112.

3. Using awk

Let’s compare the first column of both files with the awk command:

$ awk -F ',' 'NR==FNR{a[$1];next}!($1 in a)' file1 file2
bbc,222,green
ccp,333,yellowish
ccp,333,yellowish
$ awk -F ',' 'NR==FNR{a[$1];next}!($1 in a)' file2 file1
bbb,222,green
ccc,333,yellow
eee,555,white

The first awk command printed out the differences that exist in file2. The second one printed out the ones in file1.

Let’s break down the command.

First, we tell awk that the delimiter is a comma character (-F ‘,’). We can change the delimiter accordingly.

Then, we set the awk script (NR==FNR{a[$1];next}!($1 in a)).

This part of the script, NR=FNR, checks if the total number of records processed (NR) equals the number in the current input file (FNR). This returns true while processing the first input file. It’s similar to a for-loop mechanism:

for (i = 1; i <= 10; i++)

where NR is i, and FNR is 10.

If NR=FNR is true, then awk executes {a[$1];next}:

a[$1]: create an element in the array (a) with the value of the first column ($1) as the key
next: continue to the next line in the input file

If NR=FNR is false, then awk executes !($1 in a). This script means that if the value in the first column of the second input file doesn’t exist as a key in the array a (note the ! negation operator), then awk prints the line from the second input file.

4. Using join

Another alternative is using the join command. It’s a bit uncommon, but it does the job.

First, we must sort both files by the column we want to compare. In this case, it’s the first column:

$ sort -t ',' -k1,1 file1 > sorted_file1
$ sort -t ',' -k1,1 file2 > sorted_file2

In the command above, we tell the sort command that the delimiter is a comma character (-t ‘,’), and we want to sort the first column (-k,1). Afterward, we store the results into sorted_file* files with the redirection operator (>).

Next, we use the join command to compare the first column of both files:

$ join -t ',' -1 1 -2 1 -a 1 -a 2 -v 1 -v 2 sorted_file1 sorted_file2
bbb,222,green
bbc,222,green
ccc,333,yellow
ccp,333,yellowish
ccp,333,yellowish
eee,555,white

As we can see from the output above, the join command printed out all the differences along with the entire line from both files.

First, we tell the join command that the delimiter is a comma character (-t ‘,’). Next, we compare the first column of both files (-1 1 -2 1), including the unpairable lines from both files (-a 1 -a 2). Finally, we select non-matching lines from each file (-v 1 -v 2). The last two parameters (sorted_file1, sorted_file2) are the input files.

To avoid creating temporary files for the sorted files:

$ join -t ',' -1 1 -2 1 -a 1 -a 2 -v 1 -v 2 <(sort -t ',' -k1,1 file1) <(sort -t ',' -k1,1 file2)

The <(…) template is Bash’s process substitution, which enables the output of a command to be treated as a temporary file. In this case, we used two process substitutions to pass the sorted outputs of the sort commands to the join command.

5. Using grep

Yet another less conventional approach is by using grep, combined with the cut and sort commands:

$ grep -v -F -f <(cut -d ',' -f1 file2 | sort -u) file1
bbb,222,green
ccc,333,yellow
eee,555,white
$ grep -v -F -f <(cut -d ',' -f1 file1 | sort -u) file2
bbc,222,green
ccp,333,yellowish
ccp,333,yellowish

Let’s break down the command.

First, we need to extract the first column’s values from each file (cut -d ‘,’ -f1 file*). We tell the cut command that the delimiter is a comma character (-d ‘,’), and the input file is file* (-f1 file*):

$ cut -d ',' -f1 file2
aaa
bbc
ccp
ccp
ddd

Afterward, we redirect the results (|) to the sort command (sort -u). The sort command then sorts them uniquely (-u):

$ cut -d ',' -f1 file2 | sort -u
aaa
bbc
ccp
ddd

Notably, the -u option removed the duplicate ccp.

Finally, we use the grep command (grep -v -F -f <(…) file1) to find and print lines in file1 where its first column does not exist in the first column of file2:

$ grep -v -F -f <(cut -d ',' -f1 file2 | sort -u) file1
bbb,222,green
ccc,333,yellow
eee,555,white

The options that we use for grep:

-v: find non-matching lines
-F: interpret search pattern as fixed strings, not regex
-f <(…): search the patterns received from Bash’s process substitution
file1: input file

Like the awk command, we need to run the grep command twice to print out the differences between both files.

6. Conclusion

In this article, we learned how to compare a specific column in comma-delimited files using Linux basic tools like awk, join, and sort. While all three commands can achieve our goal, the awk command might be the most suitable for this type of operation.

Persistence

REST

Security