跨两个文件过滤表格数据

1. Overview

Tabular data refers to data organized in a row and column format. When these tabular data are in a database, SQL syntax makes it a breeze to perform tasks like comparison and filtering. But tabular data doesn’t always exist in the database, it can exist in text files.

In this tutorial, we’ll look at several command-line tools to compare text file tabular data on a Linux terminal.

2. Problem Statement

To visualize the problem, we’ll consider two text files: the staff-info file:

$ cat staff-info
id|name|origin
1|evelyn|canada
2|elizabeth|thailand
3|roberto|usa
4|albert|malaysian
5|eve|spain

and the staff-department file:

$ cat staff-department
id|year_of_service|department
1|7years|engineering
2|7years|engineering
4|3years|hr
9|1year|finance

The text files contain tabular data that are delimited by the pipe character (|). Additionally, the staff-info table contains a mapping between an ID and personal details like name and country of origin. Then, the staff-department table maps the same ID to their duration in the company as well as the department they are in.

Given the files, the problem we’re trying to solve here is to print the rows in the staff-info file if the id value of the row appears in both the staff-info and staff-department files. In SQL lingo, we want to perform an INNER JOIN on the two files based on the id column.

For example, id values of 1, 2, and 4 appear in both staff-info and staff-department files. Therefore we want to print the name and origin value of those rows. However, we don’t want the id value of 3 and 5 from the staff-info file to be printed as it doesn’t exist in the staff-department file. Similarly, the id value of 9 should be omitted as well in the final result.

While this might look trivial for this example, it can be complex when the text file contains a large number of rows if done manually. Fortunately, in Linux, several Linux command-line tools can allow us to perform joining on text file-based tabular data. Let’s look at them in detail.

3. awk

The awk command-line tool is a powerful text-processing command-line utility in Linux. With some awk scripting, we can filter the current file’s line by value from another file.

To print the row in the staff-info if the corresponding value in the first column appears in the staff-department, one possible solution with the awk command is the following script:

$ awk -F'|' 'NR==FNR{id[$1]++;next};id[$1] > 0' staff-department staff-info
id|name|origin
1|evelyn|canada
2|elizabeth|thailand
4|albert|malaysian

In the one-liner above, we’re passing the staff-department and staff-info files as the arguments to the awk command-line tool. Besides that, we specify that the delimiter character is the pipe character using the -F option of the awk command.

Importantly, we use an awk script to conditionally print rows in the second file. Specifically, we only print the rows if the id value has been observed in the first file. Let’s break down the script in detail.

The script consists of two parts. The first part of the script notes down the value of the first column into an array id. Do take note that an array in awk language behaves like a key-value map, in the sense that its index value is not necessarily an integer and can be a string.

Critically, we only run the first part of the code if we’re currently processing the first file. Essentially, what we’re doing is to record all the first column’s values we’re seeing in the first file. To do that, we rely on the NR==FNR comparison logic. The NR variable will always equal the FNR variable when we’re on the first file and false otherwise. For a more detailed explanation regarding the NR and FNR variables in the awk command, check out the write-up in a separate article.

Then, the second part of the code runs when we’re processing the second file. Specifically, for each row of the second file, we return true if the current row’s first column value exists in the id array. By returning true, we’re instructing the awk command to print the current line.

4. join

The join command in Linux is a terminal command-line tool for joining lines of two files based on a common field. The field here is synonymous with the column in the context of our tabular data.

4.1. Joining Files on a Common Column

The join command takes input from two files or input streams. Then, the join command joins the two input streams side by side based on a shared column. For example, we can join the staff-info and staff-department files using the join command:

$ join -t '|' <(cat staff-info) <(cat staff-department)
id|name|origin|year_of_service|department
1|evelyn|canada|7years|engineering
2|elizabeth|thailand|7years|engineering
4|albert|malaysian|3years|hr

By default, the join command joins the two input streams by their first column. We pass the -t option to overwrite the default delimiter of whitespace to pipe character. Then, we use the <() syntax to pipe the content of the staff-info and staff-department files to the join command.

4.2. Ordering the Rows

Importantly, the join command expects the rows of the files to be ordered by the join column. Let’s change the order of the rows of the staff-department file and save it as staff-department-out-of-order:

$ cat staff-department-out-of-order
id|year_of_service|department
1|7years|engineering
4|3years|hr
2|7years|engineering

Then, joining the file using the same one-liner will cause an error:

$ join -t '|' <(cat staff-info) <(cat staff-department-out-of-order)
id|name|origin|year_of_service|department
1|evelyn|canada|7years|engineering
join: /dev/fd/62:4: is not sorted: 2|7years|engineering
4|albert|malaysian|3years|hr

One quick fix is to replace the cat command with the sort command to sort the content before we pipe them to the join command:

$ join -t '|' <(sort staff-info) <(sort staff-department-out-of-order)
1|evelyn|canada|7years|engineering
2|elizabeth|thailand|7years|engineering
4|albert|malaysian|3years|hr
id|name|origin|year_of_service|department

In this example, the header row shows up at the bottom due to the alphabetical ordering. To remove it, we can use the head command on the output to skip the last line:

$ join -t '|' <(sort staff-info) <(sort staff-department-out-of-order) | head
-n -1
1|evelyn|canada|7years|engineering
2|elizabeth|thailand|7years|engineering
4|albert|malaysian|3years|hr

5. The q Tool

The q tool is a command-line tool in Linux that allows us to perform SQL-like operations on delimited text files on a Linux terminal.

5.1. Installation

To obtain the q command, we can install the q-text-as-data package using our package manager. For example, in Ubuntu Linux, we can use the apt-get command to install the package:

$ sudo apt-get install -y q-tools

5.2. Running SQL on Delimited Text File

As a starter, we can use the q tool to perform some SQL operations on our files as if they are on a database. For example, we can quickly select rows in the staff-info file where the id column has a value of 1:

$ q -H -d '|' "SELECT * FROM ./staff-department department WHERE id = 1"
1|7years|engineering

We use the -H option to specify that the first row is the column header row. Then, we specify our delimiter character as the pipe character using the -d option. Finally, we pass the SQL statement that selects all the column of the row that has an id value of 1.

This demonstration is only scratching the surface of what the q command-line tool can do. For more examples, readers are encouraged to read the official documentation.

5.3. Filtering Between Files Using INNER JOIN

Essentially, SQL’s INNER JOIN operation gives us the intersection of the tables (or files in our case) based on the common column.

We can use the INNER JOIN function to merge the staff-department and staff-info tables. Then, we can use the SELECT statement to only prints the columns in the staff-info table:

$ q -H -d '|' "SELECT info.* FROM ./staff-department department INNER JOIN ./s
taff-info info ON (info.id = department.id)"
1|evelyn|canada
2|elizabeth|thailand
4|albert|malaysian

6. Conclusion

In this tutorial, we’ve looked at the problem of filtering text-based tabular data in Linux. Specifically, we want to filter rows of tabular data in one file based on a shared column in another file.

We’ve first explored how we can write an awk script to iterate over the rows in both of the files and print them when there’s a match. Then, we’ve demonstrated joining the files with the join command in Linux, which will only display rows if the shared column value appears in both files.

Finally, we’ve discussed the q tool that allows us to perform SQL functionality on text-based tabular data. Specifically, we’ve shown that using the INNER JOIN functionality of the SQL, we can essentially select rows that have matching shared column values.

Persistence

REST

Security