如何比较内容相同但行不同的两个文件

1. Introduction

There are many tools we can use to compare two files, however, they usually compare differences line by line. If we want to compare for whether files contain the same data, in any order, we need to modify the way we use comparison tools.

In this tutorial, we’ll compare two files with the same content but on different lines and explore methods for comparing files in plain text, XML, and JSON formats.

All examples have been tested on Debian 12 (Bookworm) with diff 3.8, jq 1.6, xsltproc 1.1.35, and libxml2-utils 2.9.14 for xmllint.

2. Scope of the Solutions

We’ll be discussing three file formats: plain text, XML, and JSON.

For the plain text files, we’ll sort the files and then compare them.

For the XML files, we’ll use XSLT templates to sort the XML attributes, elements, and their children. Furthermore, we’ll also remove the white spaces and adjust indentations. Then, we’ll compare them.

For the JSON files, we sort the JSON objects by key and the JSON arrays by value. We’ll also remove the white spaces and adjust the indentations. Finally, we compare them.

3. Sample Files

Let’s create three sample files for each file format.

The first two files will have the same content, but the second file will have the content arranged differently, such as on different lines, elements, or properties.

The third file is a copy of the second file but with additional data.

3.1. Plain Text Files

Here, we’re creating three plain text files, a.txt, b.txt, and c.txt:

$ mkdir compare && cd compare
$ cat > a.txt << EOF
This is line 1 of a plain text file.
This is line 2 of a plain text file.
This is line 3 of a plain text file.
This is line 4 of a plain text file.
This is line 5 of a plain text file.
EOF
$ cat > b.txt << EOF
This is line 2 of a plain text file.
This is line 5 of a plain text file.
This is line 3 of a plain text file.
This is line 4 of a plain text file.
This is line 1 of a plain text file.
EOF
$ cat > c.txt << EOF
This is line 2 of a plain text file.
This is line 5 of a plain text file.
This is line 3 of a plain text file.
This is line 4 of a plain text file.
This is line 1 of a plain text file.
This is line 6 of a plain text file.
EOF

We created a directory using the mkdir command. Next, **we used the cat command to append lines to a file**by reading the input until it encountered a certain text (EOF).

Both a.txt and b.txt have the same content, but they’re not on the same lines. Meanwhile, c.txt has an extra line compared to the other two.

3.2. XML Files

Next, we create three simple XML files – a.xml, b.xml, and c.xml:

$ cat > a.xml << EOF
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book category="CHILDREN" format="PAPERBACK">
        <title lang="en">The Amulet of Samarkand</title>
        <author>Jonathan Stroud</author>
        <year>2003</year>
        <price>8.49</price>
    </book>
    <book category="POLITICS">
        <title lang="en">The Anatomy of the State</title>
        <author>Murray N. Rothbard</author>
        <year>1974</year>
        <price>Free</price>
    </book>
</bookstore>
EOF
$ cat > b.xml << EOF
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book category="POLITICS">
        <title lang="en">The Anatomy of the State</title>
        <author>Murray N. Rothbard</author>
        <year>1974</year>
        <price>Free</price>
    </book>
    <book format="PAPERBACK" category="CHILDREN">
        <title lang="en">The Amulet of Samarkand</title>
        <author>Jonathan Stroud</author>
        <year>2003</year>
        <price>8.49</price>
    </book>
</bookstore>
EOF
$ cat > c.xml << EOF
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book category="POLITICS">
        <title lang="en">The Anatomy of the State</title>
        <author>Murray N. Rothbard</author>
        <year>1974</year>
        <price>Free</price>
    </book>
    <book format="PAPERBACK" category="CHILDREN">
        <title lang="en">The Amulet of Samarkand</title>
        <author>Jonathan Stroud</author>
        <year>2003</year>
        <price>8.49</price>
    </book>
    <book category="ECONOMICS">
        <title lang="en">The Richest Man in Babylon</title>
        <author>George S. Clason</author>
        <year>1926</year>
        <price>29.99</price>
    </book>
</bookstore>
EOF

Both a.xml and b.xml contain the same root element – bookstore, and they each have some book elements. However, the order of the elements within the book elements between the two files is different. Then, c.xml has an extra book element not contained in the others.

3.3. JSON Files

Finally, let’s create three simple JSON files – a.json, b.json, and c.json:

$ cat > a.json << EOF
{
    "bookstore": {
        "book": [
            {
                "category": "CHILDREN",
                "format": "PAPERBACK",
                "title": "The Amulet of Samarkand",
                "author": "Jonathan Stroud",
                "year": 2003,
                "price": 8.49
            },
            {
                "category": "POLITICS",
                "title": "The Anatomy of the State",
                "author": "Murray N. Rothbard",
                "year": 1974,
                "price": "Free"
            }
        ]
    }
}
EOF
$ cat > b.json << EOF
{
    "bookstore": {
        "book": [
            {
                "category": "POLITICS",
                "title": "The Anatomy of the State",
                "author": "Murray N. Rothbard",
                "year": 1974,
                "price": "Free"
            },
            {
                "category": "CHILDREN",
                "format": "PAPERBACK",
                "title": "The Amulet of Samarkand",
                "author": "Jonathan Stroud",
                "year": 2003,
                "price": 8.49
            }
        ]
    }
}
EOF
$ cat > c.json << EOF
{
    "bookstore": {
        "book": [
            {
                "category": "POLITICS",
                "title": "The Anatomy of the State",
                "author": "Murray N. Rothbard",
                "year": 1974,
                "price": "Free"
            },
            {
                "category": "CHILDREN",
                "format": "PAPERBACK",
                "title": "The Amulet of Samarkand",
                "author": "Jonathan Stroud",
                "year": 2003,
                "price": 8.49
            },
            {
                "category": "ECONOMICS",
                "title": "The Richest Man in Babylon",
                "author": "George S. Clason",
                "year": 1926,
                "price": 29.99
            }
        ]
    }
}
EOF

As with the XML example, a.json and b.json contain the same root element – bookstore, and within that element, they each have a book array. However, the order of the elements within the book array differs between the two. Additionally, c.json has an extra element in the book array.

4. Comparing Plain Text Files

Let’s compare the plain text files.

4.1. Comparing the First and the Second Files

We start with the first two files, a.txt and b.txt.

When we compare them with the diff command, the command should print out the differences:

$ diff a.txt b.txt 
1d0
< This is line 1 of a plain text file.
2a2
> This is line 5 of a plain text file.
5c5
< This is line 5 of a plain text file.
---
> This is line 1 of a plain text file.

So, let’s sort the files first with the sort command:

$ sort a.txt > a-sorted.txt
$ sort b.txt > b-sorted.txt
$ cat a-sorted.txt 
This is line 1 of a plain text file.
This is line 2 of a plain text file.
This is line 3 of a plain text file.
This is line 4 of a plain text file.
This is line 5 of a plain text file.
$ cat b-sorted.txt 
This is line 1 of a plain text file.
This is line 2 of a plain text file.
This is line 3 of a plain text file.
This is line 4 of a plain text file.
This is line 5 of a plain text file.

We sorted the data in both files and stored the sorted data in separate files using the redirection operator (>).

Next, we compare the sorted files with diff:

$ diff a-sorted.txt b-sorted.txt

The diff command shouldn’t print anything now as both files are identical.

We can avoid creating temporary files for the sorted files:

$ diff <(sort a.txt) <(sort b.txt)

The <(…) template is Bash’s process substitution which enables the output of a command to be treated as a temporary file. In this case, we used two process substitutions to pass the sorted outputs of the sort commands to the diff command.

4.2. Comparing the First and the Third Files

Next, let’s compare the first and the third files with the same method:

$ diff <(sort a.txt) <(sort c.txt)
5a6
> This is line 6 of a plain text file.

The diff command printed the line that’s different between the two, which is ‘line 6‘ in the right (>) input, corresponding to the c.txt file.

5. Comparing XML Files

Before comparing the XML files, we need to sort them first using XSLT templates with the xsltproc command.

After that, we format the sorted files with the xmllint command to remove unnecessary white spaces and adjust indentations.

Finally, we compare them with diff.

5.1. Installation

To begin, let’s install the tools that we’re going to use: xsltproc and xmllint.

The xsltproc package is available in most operating system repositories:

$ sudo apt install xsltproc

xmllint is s a part of the libxml2-utils package, so let’s install that too:

$ sudo apt install libxml2-utils
$ xmllint --version
xmllint: using libxml version 20914

If the installation was successful, we should be able run the xmllint command.

5.2. Creating the XSLT Templates

Let’s create XSLT templates for our XML files and store them in bookstore.xslt:

$ cat > bookstore.xslt << EOF
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <!-- Identity transform template -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*">
                <xsl:sort select="name()"/> <!-- Sort attributes by name -->
            </xsl:apply-templates>
            <xsl:apply-templates select="node()">
                <xsl:sort select="name()"/> <!-- Sort child elements by name -->
            </xsl:apply-templates>
        </xsl:copy>
    </xsl:template>
    <!-- Template to match the bookstore element -->
    <xsl:template match="bookstore">
        <xsl:copy>
            <xsl:apply-templates select="book">
                <xsl:sort select="author"/> <!-- Sort book elements by author -->
            </xsl:apply-templates>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>
EOF

In this XSLT file, we created two templates.

The first one is where we copy the whole XML data (<xsl:template match=”@*|node()”>) and sort the attributes and child elements by name.

The second one is where we create a rule for sorting the book elements by the author element value (<xsl:template match=”bookstore”>).

5.3. Comparing the First and the Second Files

If we compare the first and the second files using diff as-is, it shows the differences even though both files have the same content:

$ diff a.xml b.xml
1d0
< <?xml version="1.0" encoding="UTF-8"?>
3,14c2,13
<     <book category="CHILDREN" format="PAPERBACK">
<         <title lang="en">The Amulet of Samarkand</title>
<         <author>Jonathan Stroud</author>
<         <year>2003</year>
<         <price>8.49</price>
<     </book>
<     <book category="POLITICS">
<         <title lang="en">The Anatomy of the State</title>
<         <author>Murray N. Rothbard</author>
<         <year>1974</year>
<         <price>Free</price>
<     </book>
---
>   <book category="POLITICS">
>     <title lang="en">The Anatomy of the State</title>
>     <author>Murray N. Rothbard</author>
>     <year>1974</year>
>     <price>Free</price>
>   </book>
>   <book format="PAPERBACK" category="CHILDREN">
>     <title lang="en">The Amulet of Samarkand</title>
>     <author>Jonathan Stroud</author>
>     <year>2003</year>
>     <price>8.49</price>
>   </book>

The diff command essentially printed both files as differences, even though the content is the same but located on different lines.

Therefore, let’s sort the book elements first with the XSLT templates using the xsltproc command:

$ xsltproc -o a-sorted.xml bookstore.xslt a.xml
$ xsltproc -o b-sorted.xml bookstore.xslt b.xml
$ cat a-sorted.xml
<?xml version="1.0"?>
<bookstore><book category="CHILDREN" format="PAPERBACK">
...
    <author>Jonathan Stroud</author><price>8.49</price><title lang="en">The Amulet of Samarkand</title><year>2003</year></book><book category="POLITICS">
...
    <author>Murray N. Rothbard</author><price>Free</price><title lang="en">The Anatomy of the State</title><year>1974</year></book></bookstore>
$ cat b-sorted.xml
<?xml version="1.0"?>
<bookstore><book category="CHILDREN" format="PAPERBACK">
...
  <author>Jonathan Stroud</author><price>8.49</price><title lang="en">The Amulet of Samarkand</title><year>2003</year></book><book category="POLITICS">
...
  <author>Murray N. Rothbard</author><price>Free</price><title lang="en">The Anatomy of the State</title><year>1974</year></book></bookstore>

The xsltproc commands above applied the templates in bookstore.xslt to a.xml and b.xml, generating two output files (-o), namely a-sorted.xml and b-sorted.xml.

However, the content of the sorted XML files appears to have additional white spaces and inconsistent indentations.

Therefore, let’s format the sorted files using the xmllint command to remove the unnecessary white spaces and adjust the indentations:

$ xmllint --format a-sorted.xml > a-formatted.xml
$ xmllint --format b-sorted.xml > b-formatted.xml
$ cat a-formatted.xml 
<?xml version="1.0"?>
<bookstore>
  <book category="CHILDREN" format="PAPERBACK">
    <author>Jonathan Stroud</author>
    <price>8.49</price>
    <title lang="en">The Amulet of Samarkand</title>
    <year>2003</year>
  </book>
...
$ cat b-formatted.xml 
<?xml version="1.0"?>
<bookstore>
  <book category="CHILDREN" format="PAPERBACK">
    <author>Jonathan Stroud</author>
    <price>8.49</price>
    <title lang="en">The Amulet of Samarkand</title>
    <year>2003</year>
  </book>
...

As we can see from the output of the cat commands above, both formatted files now have identical content.

Finally, we compare them with the diff command:

$ diff a-formatted.xml b-formatted.xml

Since there’s no difference between the two, the diff command didn’t print any output.

Using process substitution, we can avoid creating temporary files for the formatted files:

$ diff <(xmllint --format a-sorted.xml) <(xmllint --format b-sorted.xml)

The diff command above should print no output as well because both sorted files are now identical after we formatted them with xmllint.

5.4. Comparing the First and the Third Files

Now, let’s compare the first and the third files:

$ xsltproc -o c-sorted.xml bookstore.xslt c.xml
$ xmllint --format c-sorted.xml > c-formatted.xml
$ diff a-formatted.xml c-formatted.xml
2a3,8
>   <book category="ECONOMICS">
>     <author>George S. Clason</author>
>     <price>29.99</price>
>     <title lang="en">The Richest Man in Babylon</title>
>     <year>1926</year>
>   </book>

The diff command printed a list of differences between the two files. In this case, the differences are all in the right (>) input, corresponding to the c-formatted.xml file.

6. Comparing JSON Files

Let’s now compare the JSON files. We’re going to use the jq command to remove white spaces, adjust indentations, and sort the JSON by key and author value. Then, we compare them with diff.

6.1. Installation

To start, let’s install jq, which is available in most operating system repositories:

$ sudo apt install jq
$ jq
jq - commandline JSON processor [version 1.6]

Once we’ve installed jq successfully, we should be able to run the command and see its version.

6.2. Comparing the First and the Second Files

Let’s compare the first and the second files as-is:

$ diff a.json b.json 
4,12d3
<             {   
<                 "category": "CHILDREN",
<                 "format": "PAPERBACK",
<                 "title": "The Amulet of Samarkand",
<                 "author": "Jonathan Stroud",
<                 "year": 2003,
<                 "price": 8.49
<             },
< 
18a10,17
>             },
>             {
>                 "category": "CHILDREN",
>                 "format": "PAPERBACK",
>                 "title": "The Amulet of Samarkand",
>                 "author": "Jonathan Stroud",
>                 "year": 2003,
>                 "price": 8.49

The diff command printed out the differences between the two, which showed that there’s a list of differences in the left (<*) and the right (*>) inputs, corresponding to a.json and b.json respectively.

Now, let’s sort the files first before we compare them:

$ jq --sort-keys '.bookstore.book |= sort_by(.author)' a.json > a-sorted.json
$ jq --sort-keys '.bookstore.book |= sort_by(.author)' b.json > b-sorted.json
$ cat a-sorted.json 
{
  "bookstore": {
    "book": [
      {
        "author": "Jonathan Stroud",
        "category": "CHILDREN",
        "format": "PAPERBACK",
        "price": 8.49,
        "title": "The Amulet of Samarkand",
        "year": 2003
      },
...
$ cat b-sorted.json 
{
  "bookstore": {
    "book": [
      {
        "author": "Jonathan Stroud",
        "category": "CHILDREN",
        "format": "PAPERBACK",
        "price": 8.49,
        "title": "The Amulet of Samarkand",
        "year": 2003
      },
...

As we can see from the output of the cat commands above, both files now have the same content.

Let’s break down the jq command.

The –sort-keys option tells jq to sort the JSON by key alphabetically.

In addition, we used a jq filter expression (‘.bookstore.book |= sort_by(.author)’).

The expression consists of three parts:

selecting the book array (.bookstore.book)
using the modifying operator (|=) to modify the array
sorting the array based on the author value (sort_by(.author))

Finally, let’s compare the sorted JSON files:

$ diff a-sorted.json b-sorted.json

Since both files are now identical, the diff command didn’t print any output.

6.3. Comparing the First and the Third Files

We can follow the same method to compare the first and the third files:

$ jq --sort-keys '.bookstore.book |= sort_by(.author)' c.json > c-sorted.json
$ diff a-sorted.json c-sorted.json
4a5,11
> "author": "George S. Clason",
> "category": "ECONOMICS",
> "price": 29.99,
> "title": "The Richest Man in Babylon",
> "year": 1926
> },
> {

The diff command printed out the differences between the two. In this case, it showed there are extra lines in the right (>) input, corresponding to the c-sorted.json file.

7. Conclusion

In this article, we learned how to compare files with the same content but arranged on different lines, focusing on three file formats: plain text, XML, and JSON.

For the plain text files, we simply used the sort and diff commands.

For the XML files, we used xsltproc with XSLT templates to sort the XML attributes, elements, and their child elements. Before comparing the sorted data with diff, we used xmllint to remove the white spaces and adjust the indentations.

For the JSON files, we used jq to sort the JSON objects by key and the JSON arrays by value. Additionally, the same jq command also removed the white spaces and adjusted the indentations. Afterward, we compared the sorted files with diff.

Persistence

REST

Security