1. Overview
When we work with the command-line under Linux, we often need to process text files. In this tutorial, we’ll address different ways to remove the last n lines from an input file.
Also, we’ll discuss the performance of those approaches.
2. Introduction to the Example
First of all, let’s create an input file to understand the problem:
$ cat input.txt
01 is my line number. Keep me please!
02 is my line number. Keep me please!
03 is my line number. Keep me please!
04 is my line number. Keep me please!
05 is my line number. Keep me please!
06 is my line number. Keep me please!
07 is my line number. Keep me please!
08 is my line number. Delete me please!
09 is my line number. Delete me please!
10 is my line number. Delete me please!
As the output above shows, our input.txt contains ten lines.
Now, suppose we want to remove the last three (n=3) lines from the input.txt file.
In this tutorial, we’ll address solutions to the problem using four techniques:
- Using the head command
- Using the wc and the sed commands
- Using the tac and the sed commands
- Using the awk command
After that, we’ll discuss the performance of the solutions and find out which is the most efficient approach to the problem.
3. Using the head Command
Using the head command, we can print all lines but the last x lines of the file by passing a number following the hyphen (-) together with the -n option, for instance, -n -x.
Therefore, we can use this option to solve our problem in a straightforward way:
$ head -n -3 input.txt
01 is my line number. Keep me please!
02 is my line number. Keep me please!
03 is my line number. Keep me please!
04 is my line number. Keep me please!
05 is my line number. Keep me please!
06 is my line number. Keep me please!
07 is my line number. Keep me please!
But the head command prints the result in stdin. We can save the result back to input.txt via a temp file:
$ head -n -3 input.txt > tmp.txt && mv tmp.txt input.txt
4. Using the wc and sed Commands
Using the sed command and its address range, we can quickly delete lines from a file starting at a given line number until the last line:
sed 'GIVEN_LINE_NO, $ d' input_file
For example, let’s delete from line 5 until the end of our input.txt:
$ sed '5,$ d' input.txt
01 is my line number. Keep me please!
02 is my line number. Keep me please!
03 is my line number. Keep me please!
04 is my line number. Keep me please!
However, our problem is to delete the last three lines from the input file. Since our input file has ten lines, the sed command: sed ‘8,$ d’ input.txt will be the solution to the problem.
Thus, the problem turns into how to calculate the line number “8“, which is the first line number to be deleted.
Now, it’s time to introduce the wc command. Using the wc command with the -l option, we can easily get the total number of lines (TOTAL) in a file:
$ wc -l input.txt
10 input.txt
Further, we can get the first line number to delete by calculating TOTAL – n + 1. In our example, we have n=3:
$ echo $(( $(wc -l <input.txt)-3+1 ))
8
Let’s take a closer look at the command above:
- wc -l <input.txt: Here we redirect the input.txt file to stdin to skip the filename from the output
- $(wc -l <input.txt): We used a command substitution to capture the TOTAL result
- $(( TOTAL – 3+1 )): The arithmetic expansion will evaluate the math expression
Now, let’s assemble the two parts together and try to solve our problem:
$ sed '$(( $(wc -l <input.txt)-3+1 )),$ d' input.txt
sed: -e expression #1, char 2: unknown command: `('
Oops! Why does the sed command complain about the “*(*“?
This is because bash expansions and command substitutions will not get expanded between single quotes.
Let’s change the single quotes in our sed command into double quotes and test again:
$ sed "$(( $(wc -l <input.txt)-3+1 )),$ d" input.txt
01 is my line number. Keep me please!
02 is my line number. Keep me please!
03 is my line number. Keep me please!
04 is my line number. Keep me please!
05 is my line number. Keep me please!
06 is my line number. Keep me please!
07 is my line number. Keep me please!
Great! Now the problem is solved.
If we are using the popular GNU sed, we can use the -i option to write the change back to the input file:
$ sed -i "$(( $(wc -l <input.txt)-3+1 )),$ d" input.txt
5. Using the tac and the sed Commands
In this section, we’ll still solve the problem using the sed command, but from a different perspective.
We have learned that the difficulty of solving the problem using sed is to calculate the first line number to delete.
However, if we can reverse the order of lines in the input file, the problem will turn into “remove first n lines from a file.” A straightforward sed one-liner sed ‘1,n d’ can remove the top n lines. After that, if we reverse the lines again, our problem gets solved.
The tac command can reverse the order of lines in a file. That is, we can try to solve our problem through a command “tac INPUT_FILE | sed ‘1,n d’ | tac”.
Finally, let’s test if it will work for our example:
$ tac input.txt | sed '1,3 d' | tac
01 is my line number. Keep me please!
02 is my line number. Keep me please!
03 is my line number. Keep me please!
04 is my line number. Keep me please!
05 is my line number. Keep me please!
06 is my line number. Keep me please!
07 is my line number. Keep me please!
Yes! It works. We get the expected result.
6. Using the awk Command
The awk command is a powerful text-processing utility. We can let awk go through the input file twice to solve the problem.
In the first pass, it’ll find out the total number of lines in the file, and in the second pass, we print those lines we want to keep:
$ awk -v n=3 'NR==FNR{total=NR;next} FNR==total-n+1{exit} 1' input.txt input.txt
01 is my line number. Keep me please!
02 is my line number. Keep me please!
03 is my line number. Keep me please!
04 is my line number. Keep me please!
05 is my line number. Keep me please!
06 is my line number. Keep me please!
07 is my line number. Keep me please!
As the output above shows, the awk command solved our problem.
Finally, let’s understand how the one-liner works:
- -v n=3: We declared an awk variable n=3
- NR==FNR{total=NR;next}: This is the first pass. In this pass, the awk command saves the current line number to a variable called total. After the first pass, the total variable holds the total number of lines in the input file
- FNR==total-n+1{exit} 1: This is the second pass. If the FNR==total-n+1, it means we have reached the first line that needs to be removed, so we exit. Otherwise, we just print the line. Here, non-zero number 1 will be evaluated as true and trigger the default action of awk: print
7. Performance
So far, we’ve learned different ways to solve the problem. Now, let’s discuss their performance.
We’ll create a big input file with 100 million lines and test each solution on it to remove the last 1 million lines:
$ wc -l big.txt
100000000 big.txt
To benchmark their performance, we’ll use the time command:
- The head solution: time head -n -1000000 big.txt > /dev/null
- The wc and sed solution: time sed “$(( $(wc -l <big.txt)-1000000+1 )),$ d” big.txt > /dev/null
- The tac and sed solution: time tac big.txt | sed ‘1,1000000 d’ | tac > /dev/null
- The awk solution: time awk -v n=1000000 ‘NR==FNR{total=NR;next} FNR==total-n+1{exit} 1’ big.txt big.txt > /dev/null
Now, let’s have a look at the test result:
Solutions
time output
The head solution
real 0m0.238s
user 0m0.087s
sys 0m0.150s
The wc and sed solution
real 0m6.328s
user 0m6.062s
sys 0m0.254s
The tac and sed solution
real 0m7.780s
user 0m8.284s
sys 0m2.239s
The awk solution
real 0m36.553s
user 0m36.234s
sys 0m0.297s
As the table shows, the head solution is the fastest. It’s about 30 times faster than the sed solutions and 150 times faster than the awk command.
This is because the head command only reads the newline characters without doing any pre-processing or holding the text of a line. It seeks until it finds the target line number and dumps the contents into the output.
On the other hand, the sed and the awk command will read every line of the input file and do some pre-processing. For example, the awk command initializes some internal attributes depending on the given FS and RS, such as fields, NF, records, and so on. Therefore, it adds a lot of overhead that isn’t needed for our problem.
Even though the sed and the awk solutions are much slower than the head solution to solve this problem, it’s still worthwhile to learn them and understand how they work. That’s because they are much more extendable than the head command.
For instance, let’s say we face a new problem, changing all “foo“s into “bar“s in the last n lines of an input file. Now, the head command cannot solve the problem. However, we can extend the sed or the awk command to solve it.
8. Conclusion
In this article, we have addressed different ways to remove the last n lines from an input file. After that, we discussed the performance of the solutions.
If we need to solve this problem on a large input file, the head solution will give us the best performance.