1. Overview

In this tutorial, we’ll explore how to use the uniq command.

2. uniq

The uniq command provides us with an easy way to filter text files and remove duplicate lines from a stream of data.

We can use uniq in a few ways. We can print out either unique lines or the repeated lines. Additionally, uniq can print out each distinct line with a count of how many times that line appears within a file.

An important aspect that we need to keep in mind is that uniq works with adjacent lines. This means we often need to first sort our data before uniq can work on processing the file. Luckily, in Linux, we can use the sort command to achieve that.

Let’s try using uniq on a list of countries of visitors to our web server. First, we’ll create a file called countries.txt:

$ cat << EOF > countries.txt
Germany
South Africa
Japan
USA
England
Spain
Italy
Cameroon
Japan
EOF

3. Printing Duplicate Lines

In this example, we’ll use the uniq command to print the duplicate lines in our file. Let’s sort our data and pipe it through uniq to see how this works:

$ sort countries.txt | uniq -d
Japan

Here we’ve sorted the data before passing it to uniq using stream redirection. The -d flag presents us with just one instance of the duplicate lines. We’re presented with the output of “Japan” since that’s the only duplicate.

Now let’s take a look at a quick variation of this:

$ sort countries.txt | uniq -D
Japan
Japan

In our variation, we passed the -D flag to uniq, which prints all instances of the duplicate lines.

4. Counting Duplicate Lines

Let’s have a look at how we can get a quick and easy count of the duplicates in our data:

$ sort countries.txt | uniq -c
      1 Cameroon
      1 England
      1 Germany
      1 Italy
      2 Japan
      1 South Africa
      1 Spain
      1 USA

Using the -c flag, uniq prefixes each line with the number of times it appears in the file and prints it to the screen.

5. Removing Duplicate Lines

Now we’re going to use uniq to remove the duplicates lines entirely and present us with just those countries that only occur once in our countries.txt file.

We accomplish that with the -u flag to uniq:

$ sort countries.txt | uniq -u
Cameroon
England
Germany
Italy
South Africa
Spain
USA

As expected, “Japan” is not in the output because it occurs more than once in our file and is therefore not considered a unique record.

6. Case Sensitivity

In the real world, our data might be more inconsistent. Let’s update our sample data file and use a mix of different cases as a test:

$ cat << EOF > countries.txt 
GERMANY
South AFRICA
Japan 
USA 
england 
Spain 
ItaLY
CaMeRoon 
JAPAN
EOF

Now let’s attempt to print the duplicates in this file:

$ sort countries.txt | uniq -D

Oddly, our output is blank. We know that Japan is duplicated and should be printed but the weird case is likely the issue.

Let’s see how we can account for that in uniq using the -i flag:

$ sort countries.txt | uniq -D -i
Japan
JAPAN

We can get further confirmation by counting of how many times Japan appears in the file:

$ sort countries.txt  | uniq -c -i -d
      2 Japan

By using -i, we’ve asked uniq to perform a case-insensitive comparison when searching for duplicates.

7. Skipping Characters

Sometimes we might want to skip over or ignore a certain number of characters while looking for duplicate values. We can achieve this in uniq with the -s flag.

First, let’s create some sample data for this example:

$ cat << EOF > visitors.txt 
Visitor from Cameroon
Visitor from England
Visitor from Germany
Visitor from Italy
Visitor from Japan
Visitor from Japan
Visitor from South Africa
Visitor from Spain
Visitor from USA 
EOF

Now that we’ve created our data, we’ll pass -s the number of characters from the start of the line to skip over:

$ uniq -s 13 -c visitors.txt
      1 Visitor from Cameroon
      1 Visitor from England
      1 Visitor from Germany
      1 Visitor from Italy
      2 Visitor from Japan
      1 Visitor from South Africa
      1 Visitor from Spain
      1 Visitor from USA

In this example, we’ve used the -s flag to tell uniq to skip over the first 13 characters of each line. Doing this leaves uniq with just the country names to filter and, as expected, it’s just Japan that appears twice in our visitors.txt file.

8. First n Characters

We’re able to limit the number of characters that uniq uses for comparison when searching for duplicates.

Let’s take a look at how the -w option can be used to compare the first seven characters of each line in our visitors.txt file:

$ uniq -w 7 -c visitors.txt
      9 Visitor from Cameroon

We must take care not to get confused with uniq’s output when using the -w flag. What the output here tells us is that the first seven characters in the file were matched nine times in visitors.txt.

9. Ignoring Fields

We may want uniq to ignore a certain number of fields on each line when performing duplicate searches, and this is where the -f option comes into play:

$ uniq -f 2 -D visitors.txt
Visitor from Japan
Visitor from Japan

We’ve asked uniq to ignore the first two fields on each line in this example. A field is a set of characters separated by a space so effectively we’re ignoring the “Visitor from” text on each line.

10. Conclusion

In this tutorial, we explored the uniq command and listed some of its common uses. We then used uniq in a few examples to highlight how it works.

As always, we can refer to the man page for more information about it.


« 上一篇: 安装AWS CLI