1. Overview
In this tutorial, we’ll look at how we can merge files together in Linux. We can use almost any filetype, including .txt, .csv, etc.
Most of the solutions involve external packages. To demonstrate how different commands and programs work, we will use .csv files to show the results.
2. Using the csvkit Package
Luckily, Linux offers a package named csvkit, which is dedicated to optimizing the use of .csv files on the command line.
First, let’s install this package. The installation command varies between different Linux distributions:
$ sudo apt install csvkit
Now, we can use the command csvstack to merge two .csv files together. To do that, it’s best to ensure both have the same number of columns with the same titles. Otherwise, the merging can still happen, but sometimes the output may not be of much use.
We’ll create A2.csv and B2.csv to demonstrate:
$ cat A2.csv
A,B
1,2
$ cat B2.csv
A,B
7,9
Once we call the csvstack command on these two files, it will merge them as tables:
$ csvstack A2.csv B2.csv > out2.csv
$ cat out2.csv
A,B
1,2
7,9
If two columns have the same title, this command will simply merge their data from all files into a single column rather than creating a file with duplicated columns.
3. Using the awk Command
Linux also offers a powerful tool named awk, after a pattern scanning and processing language. There are many different versions of awk compilers.
After installing awk, let’s create the following two files as an example:
$ cat A3.csv
A,B
1,2
$ cat B3.csv
A,B
7,9
When merging two .csv files with awk, we can use its built-in variables to guide the process. NR (the current line overall) can lock in the first line of the first file as the initial one. Next, the FNR (the current line of the current file) variable excludes line 1 to prevent duplication of header lines. Both of the conditions must be satisfied at the same time:
$ awk '(NR == 1) || (FNR > 1)' A3.csv B3.csv > out3.csv
$ cat out3.csv
A,B
1,2
7,9
Using this method, we can merge files as long as they share the same columns. Otherwise, we’ll end up with broken data.
4. Using the miller Package
Another tool we can use is the miller package. It’s designed for handling different types of data, including .csv files.
To merge .csv files, we need to include a verb from the miller library. In this case, we’ll use cat. Also, we need to specify the file types so that miller can recognize them.
Let’s create our test data:
$ cat A4.csv
A,B
1,2
$ cat B4.csv
A,B
7,9
Now we merge them via the mlr command:
$ mlr --csv cat A4.csv B4.csv > out4.csv
$ cat out4.csv
A,B
1,2
7,9
When it encounters data files with different category columns, miller is able to manage record heterogeneity.
For example, we’ll create two other files with differing columns:
$ cat A4.csv
A,B
1,2
$ cat C4.csv
A,B,C
7,9,2
Next, we merge them using unsparsify verb, that prints records with the union of field names over all input records:
$ mlr --csv unsparsify A4.csv C4.csv > out4.csv
$ cat out4.csv
A,B,C
1,2,
7,9,2
5. Inferior Alternatives
Besides the packages and tools above, Linux offers commands that have some potential to handle our issues. However, many of them don’t handle .csv files well. In fact, there are multiple potential problems if we’re not careful with our data.
5.1. The cat Command
Often, the cat command comes up when trying to merge files. However, cat only sees files as a series of lines but does not consider format or other structure. The same also applies to our .csv files, as merging them may result in corruption.
For instance, we’ll use these two files:
$ cat A51.csv
A,B
1,2
$ cat B51.csv
A,B
7,9
Simply merging with cat produces the following result:
$ cat A51.csv B51.csv > out51.txt
$ cat out51.txt
A,B
1,2
A,B
7,9
As we can find here, the column headers are not integrated with the data. We find duplicated column titles.
On the other hand, this method still works when we want to merge two data files, with one file holding the column names for both of the files, while the other one holds the properly structured data.
5.2. Merging Files With Extra Spaces
Sometimes we want to merge different .csv files with more data. To do so, we can use the echo command with process substitution.
First, we’ll create a couple of .csv files:
$ cat A52.csv
A,B
1,2
$ cat B52.csv
A,B
7,9
After that, we can merge them, separating them with a new line:
$ cat A52.csv <(echo) B52.csv > out52.txt
$ cat out52.txt
A,B
1,2
A,B
7,9
In this case, we can separate different .csv files with extra spaces between them. Doing so makes the resulting file more readable, similar to miller’s output when columns differ. However, we still can’t merge .csv files properly even when their columns match.
5.3. Using the paste Command
Other than the three methods above, we can also use the paste command to merge .csv files. This command glues the lines of two files horizontally, with a delimiter specified with the -d parameter. This method is handy if we want to combine two .csv files with different columns for the same data.
Let’s take two simple .csv files as an example:
$ cat A53.csv
A,B
1,2
$ cat B53.csv
A,B
7,9
To merge them, we need to specify the delimiter as , (comma):
$ cat A53.csv
$ paste -d ',' A53.csv B53.csv > out53.csv
$ cat out53.csv
A,B,A,B
1,2,7,9
In this case, the paste command only places the two files side by side using the specified delimiter. Therefore, we may get duplicated column titles. Also, paste doesn’t check the row count, which should match between the files.
5.4. Using csvstack With Differing Columns
However, when two .csv files have a difference in the number of columns or column names, the result may not be useful. This method makes sense for files, which have:
- unique data
- an equal number of columns
- the same column names
- the same order of columns
We’ll use the csvstack here to demonstrate this. Consider the following example files with a differing column count:
$ cat A54.csv
A,B
1,2
$ cat C54.csv
A,B,C
7,9,2
Importantly, the number of columns is only adapted from the first file:
$ csvstack C54.csv A54.csv > out54.csv
$ cat out54.csv
A,B,C
7,9,2
1,2
If we have to merge two files of different sizes, it is always better to feed the largest file first to prevent data loss. The csvstack command only uses the columns from the first file.
The rule mentioned above also applies when merging two .csv files with the same number of columns but different column names.
It’s always recommended to check if two data files have the same column name and types, otherwise, errors may occur.
6. Conclusion
Linux offers us many ways to merge different .csv files together. These options can either be built-in commands or third-party tools. We also need to note that the .csv files have their own consistencies, and merging two files incorrectly may cause problems.