1. Overview
When we work with the Linux command-line, we often need to process text files. In this tutorial, we’ll address different ways to remove the first line from an input file.
Also, we’ll discuss the performance of those approaches.
2. Introduction to the Example
Let’s first create an example text file to process. We’ll go with a CSV file for our use case since these often contain column names in the first line. If we can remove the first line from the CSV file, it can make later processing easier.
So, let’s create the file books.csv for our example:
$ cat books.csv
ID, BOOK_TITLE, AUTHOR, PRICE($)
1, A Knock at Midnight, Brittany K. Barnett, 13.99
2, Migrations: A Novel, Charlotte McConaghy, 13.99
3, Winter Counts, David Heska, 27.99
4, The Hour of Fate, Susan Berfield, 30.00
5, The Moon and Sixpence, W. Somerset Maugham, 6.99
In this tutorial, we’ll remove the first line from books.csv using three techniques:
As the example shows, our books.csv contains only six lines. However, in the real world, we might face much bigger files.
Therefore, after we address the solutions, we’ll discuss the performance of the solutions and find out which is the most efficient approach to the problem.
3. Using the sed Command
sed is a common text processing utility in the Linux command-line. Removing the first line from an input file using the sed command is pretty straightforward.
Let’s see how to solve the problem with sed:
$ sed '1d' books.csv
1, A Knock at Midnight, Brittany K. Barnett, 13.99
2, Migrations: A Novel, Charlotte McConaghy, 13.99
3, Winter Counts, David Heska, 27.99
4, The Hour of Fate, Susan Berfield, 30.00
5, The Moon and Sixpence, W. Somerset Maugham, 6.99
The sed command in the example above isn’t hard to understand. The parameter ‘1d’ tells the sed command to apply the ‘d’ (delete) action on line number ‘1’.
It’s worth mentioning that if we use GNU sed, we can add the -i (in-place) option to write the change back to the input file instead of printing the result to stdout:
sed -i '1d' books.csv
4. Using the awk Command
awk is another powerful Linux command-line text processing tool. A short awk one-liner can solve our problem:
$ awk 'NR>1' books.csv
1, A Knock at Midnight, Brittany K. Barnett, 13.99
2, Migrations: A Novel, Charlotte McConaghy, 13.99
3, Winter Counts, David Heska, 27.99
4, The Hour of Fate, Susan Berfield, 30.00
5, The Moon and Sixpence, W. Somerset Maugham, 6.99
The awk command above prints the line from the input file if its line number (NR) is greater than 1.
Since version 4.1.0, GNU awk supports the “inplace” extension to emulate the -i (in-place) option of GNU sed:
gawk -i inplace 'NR>1' books.csv
If our awk implementation doesn’t ship with the “in-place” feature, we can always do the “in-place” change using a temp file:
awk 'NR>1' books.csv > tmp.csv && mv tmp.csv books.csv
5. Using the tail Command
Usually, we use the “tail -n x file” command to get the last x lines from an input file. If we prepend a “+” sign to the “x“, the “tail -n +x file” command will print starting with the xth line until the end of the file.
Therefore, we can convert our “removing the first line from a file” problem into “get the second line until the end of the file”:
$ tail -n +2 books.csv
1, A Knock at Midnight, Brittany K. Barnett, 13.99
2, Migrations: A Novel, Charlotte McConaghy, 13.99
3, Winter Counts, David Heska, 27.99
4, The Hour of Fate, Susan Berfield, 30.00
5, The Moon and Sixpence, W. Somerset Maugham, 6.99
Similarly, we can write the change back to the input file through a temp file:
tail -n +2 books.csv > tmp.csv && mv tmp.csv books.csv
6. Performance
Our books.csv has only six lines, so all the commands we’ve seen finish almost instantly.
However, in the real world, we usually need to process bigger files. Let’s discuss the performance of our approaches and find the most efficient solution to the problem.
First of all, we’ll create a big input file with 100 million lines:
$ wc -l big.txt
100000000 big.txt
Then, we’ll test each solution on our big input file to remove the first line.
To benchmark their performance, we’ll use the time command:
- The sed solution: time sed ‘1d’ big.txt > /dev/null
- The awk solution: time awk ‘NR>1’ big.txt > /dev/null
- The tail solution: time tail -n +2 big.txt > /dev/null
Now, let’s have a look at the result:
Solutions
time output
The sed solution
real 0m6.630s
user 0m6.053s
sys 0m0.559s
The awk solution
real 0m15.799s
user 0m15.282s
sys 0m0.499s
The tail solution
real 0m0.582s
user 0m0.097s
sys 0m0.474s
As the table shows, the tail command is the most efficient solution to the problem. It’s about 13 times faster than the sed command and approximately 30 times faster than the awk command.
This is because the tail command seeks until it finds the target line number and dumps the contents into the output. Thus, it only reads the newline characters without pre-processing or holding the line’s text.
On the other hand, the sed and the awk command will read and pre-process every line of the input file. For example, the awk command initializes some internal attributes depending on the given FS and RS, such as fields, NF, and records. Therefore, it adds a lot of overhead even though it’s not needed for our problem.
Although the sed and the awk solutions are much slower than the tail solution to solve our problem, it’s still worthwhile to learn sed and awk because they’re much more powerful and extendable than the tail command.
7. Conclusion
In this article, we’ve addressed different ways to remove the first line from an input file. After that, we discussed the performance of the solutions.
If we need to solve this problem on a large input file, the tail solution will give us the best performance.