1. Overview
When we talk about removing duplicate lines in the Linux command line, many of us may come up with the uniq command and the sort command with the -u option.
Indeed, both commands can remove duplicate lines from input, for example, a text file. However, the uniq command requires the file to be sorted, and sort first sorts lines in the file.
In this tutorial, we’ll explore a method to remove duplicate lines from an input file without sorting.
2. Introduction to the Problem
Before we come to the solution to the problem, let’s discuss the scenarios in which we cannot or shouldn’t sort a file before removing duplicates.
Most of us may have thought of the first reason we should avoid sorting the input file: performance. If our final goal is to remove the duplicate lines, the sorting step isn’t necessary. Moreover, sorting is relatively expensive, especially for huge input files.
Additionally, sorting a file may change the original order of the lines. Therefore, we shouldn’t sort the file when we want to preserve the order of the lines.
A simple example can explain it clearly. Let’s say we have a file called input.txt:
$ cat input.txt
Linux
is
Linux
nice
is
In the input file, we have duplicate lines, such as “Linux” and “is“. If we remove duplicate lines and keep the lines in the original order, we should get:
Linux
is
nice
However, if we first sort the file and then remove duplicates, we’ll have:
$ sort -u input.txt
is
Linux
nice
As the output above shows, the duplicate lines are removed. However, the lines’ order is not what we expect. Further, it’s pretty hard to restore the original order.
Next, let’s see how to remove duplicate lines from a file without sorting.
3. Removing Duplicate Lines Without Sorting
First, let’s say we have a file called price_log.txt, which holds products’ price updates:
$ cat price_log.txt
Product, Price, Last Update
Table, 150, 2020-11-10
Table, 150, 2019-10-10
Table, 150, 2019-10-10
Table, 170, 2020-12-10
Chair, 57, 2019-05-05
Chair, 57, 2019-05-05
Chair, 57, 2020-02-04
Bed, 400, 2020-07-07
Bed, 400, 2020-07-07
Bed, 420, 2020-08-08
Bed, 420, 2020-07-10
As we’ve seen in the output, the records are not sorted since the file is maintained manually. Apart from that, there are some duplicate records in the file.
Now, we would like to keep the records’ original order and remove duplicate lines.
We’ll use the awk command to solve the problem. Let’s first see the solution, then understand how it works:
$ awk '!a[$0]++' price_log.txt
Product, Price, Last Update
Table, 150, 2020-11-10
Table, 150, 2019-10-10
Table, 170, 2020-12-10
Chair, 57, 2019-05-05
Chair, 57, 2020-02-04
Bed, 400, 2020-07-07
Bed, 420, 2020-08-08
Bed, 420, 2020-07-10
As the output above shows, such a compact awk one-liner has solved the problem. Next, let’s see how it works.
First, in awk, a non-zero number pattern will be evaluated as true. Further, a true pattern will trigger the default action: print.** So, for example, awk ’42’ input_file will print all lines in input_file.
Oppositely, a false pattern will do nothing. For instance, awk ‘0’ input_file outputs nothing, no matter how many lines the file input_file has.
Now, let’s have a look at the command:
- $0 means the whole line in awk
- When awk reads a line, say “A LINE“, it creates an associative array element: a[“A LINE”] with the default value: 0.
- So, the expression a[“A LINE”]++ first returns 0, then increments its value by 1.
- Therefore, !a[“A LINE”]++ will become !0, as we’ve mentioned earlier, !0=true, thus, it triggers the default action: print the current line. Then after the ‘++’ operation, a[“A LINE”] has the value 1 now.
- When a duplicate line “A LINE” comes again, awk does the same routine: take *a[“A LINE”]*‘s value (1), negate it (!1=false), and increment *a[“A LINE]*‘s value (2).
- We’ve learned when the pattern is false, awk prints nothing. Therefore, the duplicate “A LINE” won’t be printed.
In this way, only the first “A LINE” line gets printed by the awk command, as all “A LINE” lines coming later will make !a[“A LINE”]++ be evaluated as false.
Once we understand how the awk solution works, we can easily adjust the solution to fit new requirements. Next, let’s look at some examples.
4. Adjusting the Solution for New Requirements
We’ve learned we can use the compact one-liner awk ‘!a[$0]++’ input to remove duplicate lines from an input file. Here, the $0 means the whole line.
Let’s say now we’ve got a new requirement. In our price_log.txt file, for the same product, we would like to only leave price-unique records in the file. In other words, we need to check the combination of Product and Price for duplicates.
As we’ve understood how the awk one-liner works, the key to solving the problem is to use Product and Price as the key to the associative array:
$ awk -F', ' '!a[$1 FS $2]++' price_log.txt
Product, Price, Last Update
Table, 150, 2020-11-10
Table, 170, 2020-12-10
Chair, 57, 2019-05-05
Bed, 400, 2020-07-07
Bed, 420, 2020-08-08
As we can see in the command above, we set the ‘*,* ‘ as the field separator (FS) and use $1 FS $2 as the key to the associative array ‘a‘.
5. Conclusion
In this article, we’ve first discussed when we want to remove duplicate lines from a file without sorting. Then, we’ve addressed the compact awk one-liner solution through an example.
Further, we’ve shown how to adjust the awk solution to solve similar problems. For example, the duplicate check is the combination of several fields instead of the whole line.