1. Overview
When we work under the Linux command line, manipulating text files is a standard operation.
In this tutorial, we’ll address how to remove the lines that appear in file B from another file A.
2. Introduction to the Problem
Let’s understand the problem quickly through an example.
First of all, let’s create a text file called books.txt:
$ cat books.txt
PRIDE AND PREJUDICE - Austen,Jane
AROUND THE WORLD IN 80 DAYS - Verne, Jules
A TALE OF TWO CITIES - Dickens, Charles
LES MISERABLES - Hugo,Victor
ANNA KARENINA - Tolstoy, Leo
CRIME AND PUNISHMENT - Dostoevsky, Fyodor
THE COUNT OF MONTE CRISTO - Dumas, Alexandre, pere
JANE EYRE - Bronte, Charlotte
MADAME BOVARY - Flaubert,Gustave
THE HUNCHBACK OF NOTRE DAME - Hugo, Victor
The books.txt contains ten lines. We want to remove some lines from the file. We define the lines we want to delete in another text file:
$ cat delete_list.txt
MADAME BOVARY - Flaubert,Gustave
THE COUNT OF MONTE CRISTO - Dumas, Alexandre, pere
A TALE OF TWO CITIES - Dickens, Charles
THE HUNCHBACK OF NOTRE DAME - Hugo, Victor
In this tutorial, we’ll address several approaches to solving the problem:
- Using the comm and sort commands
- Using the join and sort commands
- Using the grep command
- Using the awk command
3. Using the comm and sort Commands
We can use the comm command to get the common or unique lines from two input files.
If we’ve removed lines in the delete_list.txt file from books.txt, the result is the same as the unique lines from the two files.
The output of the comm FILE1 FILE2 command has three columns:
- 1 – The lines unique to FILE1
- 2 – The lines unique to FILE2
- 3 – The lines in both FILE1 and FILE2
If we want to suppress a column in the result, we add the -columnNumber option. For example, comm -12 FILE1 FILE2 will suppress columns 1 and 2. Only column 3 will be shown in the output. Thus, we will have the output of common lines in both input files.
In our scenario, since we’re only interested in the first column, we can use the comm command to solve the problem:
comm -23 books.txt delete_list.txt
However, the comm command works only with sorted input files. We can use process substitution to pass sorted input files to the comm command:
$ comm -23 <(sort books.txt) <(sort delete_list.txt)
ANNA KARENINA - Tolstoy, Leo
AROUND THE WORLD IN 80 DAYS - Verne, Jules
CRIME AND PUNISHMENT - Dostoevsky, Fyodor
JANE EYRE - Bronte, Charlotte
LES MISERABLES - Hugo,Victor
PRIDE AND PREJUDICE - Austen,Jane
If we check the output above, the lines in delete_list.txt have been removed. So we’ve solved the problem using the comm command.
4. Using the join and sort Commands
The join command can join lines by the given field from two sorted input files. Further, the command provides the -v FILENUM option to print the lines that cannot be joined.
In our example, the books.txt file’s un-joinable lines are exactly the result we’re looking for.
Similarly, let’s combine the join and sort commands as we’ve done in the comm solution above:
$ join -v 1 <(sort books.txt) <(sort delete_list.txt)
ANNA KARENINA - Tolstoy, Leo
AROUND THE WORLD IN 80 DAYS - Verne, Jules
join: /proc/self/fd/11:3: is not sorted: A TALE OF TWO CITIES - Dickens, Charles
A TALE OF TWO CITIES - Dickens, Charles
CRIME AND PUNISHMENT - Dostoevsky, Fyodor
JANE EYRE - Bronte, Charlotte
LES MISERABLES - Hugo,Victor
PRIDE AND PREJUDICE - Austen,Jane
join: input is not in sorted order
Oops! The output contains error messages:
join: /proc/self/fd/11:3: is not sorted: A TALE OF TWO CITIES - Dickens, Charles
join: input is not in sorted order
But we do sort the input files in process substitution. Why does the join command still complain about that?
This is a common pitfall of combining the sort and join commands. Let’s understand the cause of the mistake.
If we don’t define any key, the sort command will sort the input files by the entire line. However, by default, the join command will join two input files by the first field. The default field separator is a blank.
When we use the join and the sort commands together, we should make sure both commands are joining and sorting by the same field.
Let’s go back to our problem. In this case, we want to join the two sorted files by the entire line and print the un-joined lines. Therefore, we need to use the -t option to set newline (\n) as the field separator of the join command:
$ join -t $'\n' -v 1 <(sort books.txt) <(sort delete_list.txt)
ANNA KARENINA - Tolstoy, Leo
AROUND THE WORLD IN 80 DAYS - Verne, Jules
CRIME AND PUNISHMENT - Dostoevsky, Fyodor
JANE EYRE - Bronte, Charlotte
LES MISERABLES - Hugo,Victor
PRIDE AND PREJUDICE - Austen,Jane
Great! Our problem is solved.
Perhaps it’s worth mentioning that when we pass the newline character to the -t option, we used ANSI-C quoted $’\n’ instead of ‘\n’.
This is because different join implementations may behave differently when we pass multiple characters. For example, if we pass the ‘\n’ directly to the -t option, the widely used GNU join will complain: “join: multi-character tab ‘\\n’.”
So far, we’ve seen two solutions to our problem: the comm solution and the join solution. Both of them require sorted input files.
Moreover, if we take a closer look at our result, we’ll see the lines we want to remove are not there anymore. It’s pretty good. However**, the original order of lines in the books.txt file has been changed due to the sort operation**.
Next, let’s see two other approaches to solve the problem without sorting the input files.
5. Using the grep Command
The grep command is good at searching and matching patterns. If we look at the lines in delete_list.txt as patterns, our problem can be converted into: “find the lines in the books.txt file that don’t match the patterns in the delete_list.txt file.”
Thus, we can solve it using the grep command. Let’s first have a look at the solution and understand the options we used in the command later:
$ grep -Fvxf delete_list.txt books.txt
PRIDE AND PREJUDICE - Austen,Jane
AROUND THE WORLD IN 80 DAYS - Verne, Jules
LES MISERABLES - Hugo,Victor
ANNA KARENINA - Tolstoy, Leo
CRIME AND PUNISHMENT - Dostoevsky, Fyodor
JANE EYRE - Bronte, Charlotte
The output above shows the lines saved in the delete_list.txt file have been removed. Also, the order of lines in the books.txt file is preserved.
We’ve used four options in the above grep command:
- -f delete_list.txt – grep will take patterns from the file delete_list.txt
- -x – We tell grep to consider matches only if the entire line matches
- -v – Here, we do an inverted match since we only want those unmatched lines
- -F – We are going to do a fixed string match instead of the regex match
6. Using the awk Command
The awk command is a powerful command-line text processing utility. It can work with multiple input files.
Therefore, we can solve the problem using a straightforward awk one-liner:
$ awk 'NR==FNR{del[$0];next} !($0 in del)' delete_list.txt books.txt
PRIDE AND PREJUDICE - Austen,Jane
AROUND THE WORLD IN 80 DAYS - Verne, Jules
LES MISERABLES - Hugo,Victor
ANNA KARENINA - Tolstoy, Leo
CRIME AND PUNISHMENT - Dostoevsky, Fyodor
JANE EYRE - Bronte, Charlotte
Finally, let’s understand how the short command works:
- NR==FNR{del[$0];next} – When we process delete_list.txt, we save each line in an associative array del
- !($0 in del) – When we read the second file, books.txt, we match and print lines that are not in the array del
7. Conclusion
In this article, we addressed how to remove the lines which appear on file B from another file A through examples. We’ve discussed four different approaches to solve the problem.
Further, while we talk about the join solution, we’ve learned a common pitfall of using the join and the sort commands together and how to fix the problem.