1. Overview

As we know, the split command can help us to split a big file into a number of small files by a given number of lines.

However, if the input file contains a header line, we sometimes want the header line to be copied to each split file. By default, the split command is not able to do that.

In this tutorial, we’ll discuss how to solve this problem.

2. Introduction to the Problem

A concrete example can help us to understand the problem quickly.

First, let’s take a look at our input example. The tokyo_medal.tsv file holds the data of the top 10 from the Tokyo Olympics medal table:

$ cat tokyo_medal.tsv
Rank    Country    Gold    Silver    Bronze    Total
1    United States    39    41    33    113
2    China    38    32    18    88
3    Japan    27    14    17    58
4    Great Britain    22    21    22    65
5    ROC    20    28    23    71
6    Australia    17    7    22    46
7    Netherlands    10    12    14    36
8    France    10    12    11    33
9    Germany    10    11    16    37
10    Italy    10    10    20    40

As we can see in the output above, the file is a TSV file. Further, the file contains a header line to tell the meanings of the values in each column. It’s pretty common that a TSV or CSV file contains a header line.

Now, our goal is to split the tokyo_medal.tsv file into pieces. Let’s say we want each piece to have three records. Moreover, each piece must have a header line as well.

In this tutorial, we’ll address three different ways to solve the problem:

  • Using the split command with the version >= 8.13
  • Writing a simple shell script
  • Using the awk command

Next, let’s see them in action.

3. Using the Newer split Command

The split command is a member of the GNU Coreutils package.

Since version 8.13, the split utility has introduced a new –filter=COMMAND option.

We’ll solve the problem using this option. First, we’ll have a look at the command that does the job. Then, we’ll understand why it works.

3.1. The Solution

The –filter=COMMAND option allows us to write the split result to a shell command. In other words, we can post-process the split pieces using the filter command.

Next, let’s see how this option helps us to solve our problem:

$ tail -n +2 tokyo_medal.tsv | split -d -l 3 - --filter='sh -c "{ head -n1 tokyo_medal.tsv; cat; } > $FILE"' part_
$ ls -1 part*
part_00
part_01
part_02
part_03

As we’ve seen in the output above, four files have been created after we execute the command. Now, let’s check the content of the files:

$ head part*
==> part_00 <==
Rank    Country    Gold    Silver    Bronze    Total
1    United States    39    41    33    113
2    China    38    32    18    88
3    Japan    27    14    17    58

==> part_01 <==
Rank    Country    Gold    Silver    Bronze    Total
4    Great Britain    22    21    22    65
5    ROC    20    28    23    71
6    Australia    17    7    22    46

==> part_02 <==
Rank    Country    Gold    Silver    Bronze    Total
7    Netherlands    10    12    14    36
8    France    10    12    11    33
9    Germany    10    11    16    37

==> part_03 <==
Rank    Country    Gold    Silver    Bronze    Total
10    Italy    10    10    20    40

So, we’ve got the expected result. Thus, we’ve solved the problem.

3.2. How the Command Works

Now, let’s walk through each part of the command and understand how it works:

tail -n +2 tokyo_medal.tsv | split -d -l 3 - --filter='sh -c "{ head -n1 tokyo_medal.tsv; cat; } > $FILE"' part_
  • tail -n +2 tokyo_medal.tsv – The tail command cuts the header line from the input file, and then we pipe all data records to the next command
  • … | split -d -l 3 –  –filter=’…’ part_ – Let’s skip the –filter=’…’ part first. The split command reads the data records from stdin () and splits them by every three lines (-l 3). The -d option tells split to use numeric suffixes in generated filenames
  • –filter=’sh -c “{ head -n1 tokyo_medal.tsv; cat; } > $FILE”‘The command in the –filter option does post-processing on the split data. We declared a command group with head and cat commands. It reads the header line from the input file and then joins the split records. Finally, we redirect the records with the header line to $FILE, which is the part_x file

However, if the version of the Coreutils package on our system is older than 8.13, we need to solve the problem in different ways. So, we’ll now turn our attention to some other approaches.

4. Writing a Simple Shell Script

Even though the older split command cannot solve the problem on its own, we can wrap it with a shell script to handle the header line.

4.1. Solving the Problem

Simply put, we can solve the problem in two steps:

  • Step 1: Splitting the input file without the header line
  • Step 2: Adding the header line to each split file

Following this idea, we can build a script:

#!/bin/bash
INPUT=tokyo_medal.tsv

# Step 1: split the input file without the header line
tail -n +2 "$INPUT" | split -d -l 3 - sh_part_

# Step 2: add the header line to each split file
for file in sh_part_*
do
    head -n 1 "$INPUT" > with_header_tmp
    cat "$file" >> with_header_tmp
    mv -f with_header_tmp "$file"
done

As the script shows, when we implement step 2, we created a temp file with_header_tmp to hold the header line and then appended the split result.

Note that the argument handling is skipped in this example script. For example, the input file and split options are hardcoded in the script.

That’s because this tutorial is focusing on the file splitting implementation. However, we should add argument processing in the real world if we want to make our script reusable.

4.2. Test the Script

Now, let’s name our script split_with_header.sh and test if it works as we expected:

$ ./split_with_header.sh

$ head sh_part_*

==> sh_part_00 <==
Rank    Country    Gold    Silver    Bronze    Total
1    United States    39    41    33    113
2    China    38    32    18    88
3    Japan    27    14    17    58

==> sh_part_01 <==
Rank    Country    Gold    Silver    Bronze    Total
4    Great Britain    22    21    22    65
5    ROC    20    28    23    71
6    Australia    17    7    22    46

==> sh_part_02 <==
Rank    Country    Gold    Silver    Bronze    Total
7    Netherlands    10    12    14    36
8    France    10    12    11    33
9    Germany    10    11    16    37

==> sh_part_03 <==
Rank    Country    Gold    Silver    Bronze    Total
10    Italy    10    10    20    40

Great! Our script works.

Usually, when we’re facing file splitting problems, the split command will come up first. But, actually, other Linux commands can do this kind of file splitting task as well.

Next, let’s solve the problem using the awk command.

5. Using the awk Command

awk is a powerful weapon for text processing. Further, awk has defined its own C-like script language. It can solve this problem without using any external command.

5.1. The awk Solution

First, let’s look at how awk solves the problem:

$ awk -v lines="3" -v pre="awk_part_" '
        NR==1 { header=$0; next}
        (NR-1) % lines ==1 { fname=pre c++; print header > fname}
        {print > fname}' tokyo_medal.tsv

$ head awk_part_*
==> awk_part_0 <==
Rank    Country    Gold    Silver    Bronze    Total
1    United States    39    41    33    113
2    China    38    32    18    88
3    Japan    27    14    17    58

==> awk_part_1 <==
Rank    Country    Gold    Silver    Bronze    Total
4    Great Britain    22    21    22    65
5    ROC    20    28    23    71
6    Australia    17    7    22    46

==> awk_part_2 <==
Rank    Country    Gold    Silver    Bronze    Total
7    Netherlands    10    12    14    36
8    France    10    12    11    33
9    Germany    10    11    16    37

==> awk_part_3 <==
Rank    Country    Gold    Silver    Bronze    Total
10    Italy    10    10    20    40

As the output above shows, the input file has been split with the header line as we expected.

5.2. How the awk Command Works

Now, let’s pass through the awk command and understand how it works:

  • awk -v lines=”3″ -v pre=”awk_part_” – First, we’ve declared two awk variables to define how many records are in each split file and the prefix of the filenames
  • NR==1 { header=$0; next} – When awk reads the first line, it stores the line in a header variable and stops further processing
  • (NR-1) % lines ==1 { fname=pre c++; print header > fname} – When the current line is the first record of a trunk, we need to update the filename (fname) by incrementing a counter (c). Also, since this would be a new file, we add the value of the header variable to the file as the first line
  • {print > fname}’ tokyo_medal.tsv – Then, we can just redirect each record line to the current fname file

In this way, awk reads through the input file only once and solves the problem.

6. Conclusion

In this article, we’ve learned how to split an input file with the header line.

If our system’s Coreutils version is 8.13 or later, we can use the split command’s new –filter=COMMAND option to achieve our goal.

Otherwise, we can still write a simple bash script to solve the problem in two steps: splitting the file without the header line and adding the header line to each split file.

Also, we’ve seen an example of how we can use the powerful awk command to do the job.