1. Overview

awk is a convenient and powerful command-line utility for processing text. Sometimes, we need to read and process multiple input files.

In this tutorial, we’ll learn how to process multiple input files using the awk command.

2. Processing Multiple Files

Sometimes, we want to process a collection of data files and generate some output.

For example, suppose we have three input files containing user scores:

$ head score*.txt
==> score1.txt <==
Tom 20
Jerry 40
Mark 25
Amanda 37

==> score2.txt <==
Mark 75
Tom 70
Jerry 7
Amanda 40

==> score3.txt <==
Mark 73
Amanda 47
Jerry 79
Tom 40

Notice that all files share the same format: each line contains a name and a score, separated by whitespace.

Let’s calculate the sum of scores for each user from the files above:

$ awk '{ sum[$1]+=$2 } END { for(user in sum) print user, sum[user] }' score*.txt
Tom 130
Jerry 126
Mark 173
Amanda 124

In the code above, we created an associative array sum to calculate and store the sum of scores of each user. Finally, in the END block, we printed elements in the array.

When our input files share the same format, we can treat multiple input files as a single merged input. This is a relatively simple situation.

However, in practice, we often need to handle the associations between input files. In the following sections, we’ll see these situations in detail.

3. Processing Two Associated Input Files

In our next example, we’ll show how to process two associated input files using line numbers and awk‘s built-in NR and FNR variables.

3.1. Understanding the NR and FNR

NR and FNR are two built-in awk variables. NR tells us the total number of records that we’ve read so far, while FNR gives us the number of records we’ve read in the current input file.

Let’s understand the two variables through an example. First, let’s create two files:

$ head file1.txt file2.txt 
==> file1.txt <==
file1-1
file1-2
file1-3
file1-4
file1-5

==> file2.txt <==
file2-1
file2-2
file2-3
file2-4
file2-5

Then we create a simple awk one-liner, which takes the two files above as input and prints lines in each file together with the values of NR and FNR:

$ awk '{ printf "Line:%s, NR:%d, FNR:%d\n", $0, NR, FNR}' file1.txt file2.txt
Line:file1-1, NR:1, FNR:1
Line:file1-2, NR:2, FNR:2
Line:file1-3, NR:3, FNR:3
Line:file1-4, NR:4, FNR:4
Line:file1-5, NR:5, FNR:5
Line:file2-1, NR:6, FNR:1
Line:file2-2, NR:7, FNR:2
Line:file2-3, NR:8, FNR:3
Line:file2-4, NR:9, FNR:4
Line:file2-5, NR:10, FNR:5

The output above shows us:

  • For the first input file, the values of NR and FNR are always the same
  • When awk reads a new input file, the FNR variable will be reset to 1, whereas NR keeps incrementing

In the next section, we’ll see how to distinguish between the input files from the NR and FNR and handle the relations.

3.2. Print Lines by Defined Line Numbers

Let’s start with an example.

We prepared two files:

$ head all_lines.txt lines_to_show.txt 
==> all_lines.txt <==
line-01
line-02
line-03
line-04
line-05
line-06
line-07
line-08
line-09
line-10

==> lines_to_show.txt <==
2
3
4
5
7

In the file all_lines.txt, we have ten lines of text, while the file lines_to_show.txt stores line numbers. Now, we want to output a line from the all_lines.txt file only if its line number is defined in the file lines_to_show.txt.

Let’s have a look at the solution, then understand how it works:

$ awk 'NR==FNR { out[$1]=1; next } { if (out[FNR]==1) print $0 }' lines_to_show.txt all_lines.txt 
line-02
line-03
line-04
line-05
line-07

We solved this problem in two steps:

  1. Read the file lines_to_show.txt and save the line numbers in an array.
  2. As we read lines from file all_lines.txt, we print the line if the current line number exists in the array.

Now, let’s take a closer look at the awk code above to understand how it works.

Step 1: NR==FNR{ out[$1]=1; next }

  • awk reads the first line from the first file lines_to_show.txt, which is: 2
  • Both NR and FNR now have the same value 1, so we create an associative array named out and set out[2]=1
  • The next statement will make awk skip the remaining processing and read the next record
  • Because during the processing of the first input file, NR==FNR is always True, after awk processes the file lines_to_show.txt, we have: out[2]=out[3]=out[4]=out[5]=out[7]=1

Step 2: { if (out[FNR]==1) print $0 }

  • When we start processing the second file, all_lines.txt, FNR is reset to 1, thus, FNR and NR have different values
  • In the array out, we don’t have an element out[1], so we don’t print
  • awk reads the next line, line-02; now FNR is 2, and we have out[2]=1, so this line will be printed out by print $0
  • In this way, after awk goes through the second input file, we’ll get the required output

It’s worthwhile to mention that, in awk:

  • A non-zero number will be evaluated as True — in other words, ‘*{ if (out[FNR] == 1) print $0 }’* can be written as ‘*{ if(out[FNR]) print $0 }’*
  • A True value will trigger the default action: printing the current record, so ‘*{ if(out[FNR]) print $0 }’* can be written as ‘out[FNR]’

Therefore, we can write the awk one-liner solution to this problem more compactly:

$ awk 'NR==FNR { out[$1]=1; next } out[FNR]' lines_to_show.txt all_lines.txt

3.3. Join and Calculate

In this section, we’ll see another practical example. As usual, let’s first take a look at the two input files:

$ head price.txt purchasing.txt
==> price.txt <==
Product Price(USD/Kg) Supplier
Apple 3.20 Supplier_X
Orange 3.00 Supplier_Y
Peach 5.35 Supplier_Y
Pear 5.00 Supplier_X
Mango 12.00 Supplier_Y
Pineapple 7.70 Supplier_X

==> purchasing.txt <==
Product Volume(Kg) Date
Orange 120 2020-04-02
Apple 400 2020-04-03
Peach 70 2020-04-05
Pear 50 2020-04-17

We want to generate a cost report containing Product, Date, and a new column, Cost, where Cost = Price * Volume.

Let’s look at the solution first:

$ awk 'BEGIN { print "Product Cost Date" }
       FNR>1 && NR==FNR { price[$1]=$2; next }
       FNR>1 { printf "%s $%.2f %s\n",$1, price[$1]*$2, $3}' price.txt purchasing.txt

Product Cost Date
Orange $360.00 2020-04-02
Apple $1280.00 2020-04-03
Peach $374.50 2020-04-05
Pear $250.00 2020-04-17

Now let’s take a closer look at the code and understand how it works:

  • The BEGIN block prints the header
  • FNR>1 skips the header line from the input file
  • NR==FNR{ price[$1]=$2; next } creates an associative array price, reads each line from the first input file, and stores Name:Price as Key:Value elements in the array
  • When we process the second file, we find the price value from the associative array price, calculate the Cost, and print the output using printf

3.4. Common Pattern for Handling Two Input Files

If we need to handle two input files using awk, we can consider using this typical pattern to solve the problem:

awk 'NR==FNR {
    // read lines from the first input file
    // do calculation and save required value
    // in variables or arrays
    next
}

{
    // process the lines from the second file
    // with the variables or arrays we prepared above
}'  inputFile1 inputFile2

4. Processing More Than Two Associated Input Files

We’ve learned the compact way to handle two input files by comparing the values of FNR and NR.

However, if we have more than two input files, this method will not work.

This is because the FNR is always going to be reset to 1, once the input file changes. We cannot distinguish between the input files by the FNR variable anymore.

4.1. The FILENAME Variable

FILENAME is a built-in variable that stores the name of the input file the awk command is currently processing:

$ awk '{ print $0 " => " FILENAME}' file1.txt file2.txt file3.txt
file1-1 => file1.txt
file1-2 => file1.txt
file1-3 => file1.txt
file1-4 => file1.txt
file1-5 => file1.txt
file2-1 => file2.txt
file2-2 => file2.txt
file2-3 => file2.txt
file2-4 => file2.txt
file2-5 => file2.txt
file3-1 => file3.txt
file3-2 => file3.txt
file3-3 => file3.txt
file3-4 => file3.txt
file3-5 => file3.txt

We can make use of this variable to distinguish the input files and apply different processing logic.

4.2. Join and Calculate Revised

In an earlier section, we’ve generated a report on the fruit purchasing cost.

Let’s review the example quickly. We have two input files:

  •  price.txt: containing the price and supplier data: Product, Price, Supplier
  • purchasing.txt: storing the purchasing activities: Product, Volume(Kg), Date

Due to the good partnership with suppliers, they agreed to offer us some discounts. Now, we’ll add a third file, discount.txt:

$ cat discount.txt
Supplier Discount
Supplier_X 0.10
Supplier_Y 0.20

Let’s generate a new report on purchasing cost from the three input files:

$ awk 'fname != FILENAME { fname = FILENAME; idx++ }
        FNR > 1 && idx == 1 { discount[$1] = $2 }
        FNR > 1 && idx == 2 { price[$1] = $2 * ( 1 - discount[$3] ) }
        FNR > 1 && idx == 3 { printf "%s $%.2f %s\n",$1, price[$1]*$2, $3 }
       ' discount.txt price.txt purchasing.txt
Orange $288.00 2020-04-02
Apple $1152.00 2020-04-03
Peach $299.60 2020-04-05
Pear $225.00 2020-04-17

In the code above, we used FNR>1 to skip the header lines from input files. Also, we created associative arrays to share data between different file processings.

However, the key to distinguishing between input files is this line of code:

fname != FILENAME{ fname = FILENAME; idx++ }

Now, let’s understand how it works:

  1. We declare a variable fname to store the current FILENAME, and create an idx variable to store the index of the current input file.
  2. When the current input file changes, fname != FILENAME will be True.
  3. Then we update fname with the new FILENAME and increment the idx variable.
  4. Later, we distinguish the input files by the idx variable and process each input file differently.

This is one of the common techniques for handling multiple input files.

4.3. Input File Index vs. Filename

We’ve seen that the built-in FILENAME variable stores the name of the current input file. While reading the code in the previous section, we may come up with a question: why do we distinguish between input files by the index of each input instead of comparing the filename directly, as in the example:

FNR > 1 && FILENAME == "discount.txt" {...}
FNR > 1 && FILENAME == "price.txt" {...}
FNR > 1 && FILENAME == "purchasing.txt" {...}

Comparing the FILENAME variable with the filename works for this example, too. However, it has some disadvantages.

Most notably, it brings hardcoded filenames into our awk script. That is, when we change the name of a file, we must update the code, too.

For example, if we change the second file price.txt to “*/full/path/to/price.txt”*, we’d have to change our script.

Sometimes, we have to pass the filename with shell variables, such as “*$PWD/price.txt*“. In this case, we don’t know the exact value of the FILENAME variable.

A workaround is using the regular expression match operator ~ instead of == as in:

FNR > 1 && FILENAME ~ /\/price[.]txt$/ {...}

However, the workaround will fail when we feed the awk command by a process substitution as an input “file”.

With a process substitution, the name of the input file is going to be automatically generated by the pipe() system call. The filename will be dynamic.

Let’s see an example of this case:

$ echo "a dummy line" > dummy.txt
$ awk '{print FILENAME}' dummy.txt <(cat dummy.txt )
dummy.txt
/proc/self/fd/11

Therefore, we prefer to distinguish between input files using the index of an input file over the filename.

5. Conclusion

In this article, we’ve discussed how to handle multiple input files when we work with the awk command.


« 上一篇: Linux中的uniq命令