如何在文件中查找超过特定长度的行

1. Overview

Checking or processing text files is a common operation when we work in the Linux command line.

In this tutorial, we’ll explore how to find lines exceeding a given length in a text file.

2. Introduction to the Problem

Checking lines’ length in a file could be necessary for a variety of reasons. For example, when we need to import the data from the file to a database table, we may want to check the data length before starting the import process.

An example can quickly explain the problem. Let’s say we have a file called myFile.txt:

$ cat -n myFile.txt
     1  The pipe marks the 35th column -> |
     2
     3  No Man Is An Island - by John Donne
     4
     5  No man is an island,
     6  Entire of itself,
     7  Every man is a piece of the continent,
     8  A part of the main.
     9  If a clod be washed away by the sea,
    10  Europe is the less.
    11  As well as if a promontory were.
    12  As well as if a manor of thy friend's
    13  Or of thine own were:
    14  Any man's death diminishes me,
    15  Because I am involved in mankind,
    16  And therefore never send to know for whom the bell tolls;
    17  It tolls for thee.

As the output above shows, we’ve used the cat command with the -n option to show the file content with line numbers. For example, the file contains a short poem by John Donne.

Now, we want to detect if there are lines that hold more than 35 characters. In other words, we would like to find lines longer than 35.

To easier identify those lines by our human eyes, we’ve marked column 35 using a pipe (|) character in the first line. Therefore, we can see four lines are exceeding length 35 in the file:

Line #7:  Every man is a piece of the continent,
Line #9:  If a clod be washed away by the sea,
Line #12: As well as if a manor of thy friend's
Line #16: And therefore never send to know for whom the bell tolls;

Now, we’ll use powerful Linux command-line tools to find these four lines.

Similar to what the output above shows, the line numbers and the content of the lines should be in the output.

There are several ways to achieve that. In this tutorial, we’ll address three approaches. So next, let’s see them in action.

3. Using the grep Command

One way to accomplish this task is using the grep command. Some of us might be surprised we’re picking grep to solve the problem. This is because grep is a Regex-based tool, and Regex isn’t good at arithmetic calculation and comparison, such as calculating the length of a line and comparing it to 35.

It’s true, but if we look at the problem from a different angle, we’ll understand Regex can perfectly solve this problem: If a line’s length exceeds 35, in other words, this line must have at least 36 characters. Therefore, we can ask the grep command to match those lines that hold at least 36 characters using the Regex “*.\{36\}*“:

$ grep '.\{36\}' myFile.txt
Every man is a piece of the continent,
If a clod be washed away by the sea,
As well as if a manor of thy friend's
And therefore never send to know for whom the bell tolls;

As the output above shows, grep has successfully found the four long lines and printed their content. We can use the –n option to tell the grep command to output matched lines with their line numbers:

$ grep -n '.\{36\}' myFile.txt
7:Every man is a piece of the continent,
9:If a clod be washed away by the sea,
12:As well as if a manor of thy friend's
16:And therefore never send to know for whom the bell tolls;

As we can see, the grep command has found the four long lines with the line numbers. So, it solves the problem.

It’s worth mentioning that the grep uses BRE (Basic Regular Expressions) by default. That’s why we must escape ‘*{‘ and ‘}*‘ to give them special meaning. Alternatively, we can pass grep the -E option to use ERE (Extended Regular Expressions). Then, we can remove the two back-slashes to make the code easier to read:

$ grep -nE '.{36}' myFile.txt

4. Using the sed Command

The sed command is a handy command-line utility to process text inputs. Further, sed supports Regex too. Therefore, we can use the same idea to solve the problem:

$ sed -n '/.\{36\}/p' myFile.txt 
Every man is a piece of the continent,
If a clod be washed away by the sea,
As well as if a manor of thy friend's
And therefore never send to know for whom the bell tolls;

The above sed command only prints lines that have at least 36 characters.

Unlike grep, sed doesn’t have the option to output text with line numbers. However, sed has the ‘*=*‘ command to print the current line number. Moreover, the ‘*=*‘ command prints a line number ending with a linebreak:

$ sed -n '/.\{36\}/{=;p}' myFile.txt
7
Every man is a piece of the continent,
9
If a clod be washed away by the sea,
12
As well as if a manor of thy friend's
16
And therefore never send to know for whom the bell tolls;

Although the output format is different from grep‘s, the sed command does the job.

Many sed implementations have the option to support ERE too. For example, the widely used GNU Sed with the -r option treats patterns as ERE. Therefore, this command produces the same output:

$ sed -nr '/.{36}/{=;p}' myFile.txt

5. Using the awk Command

awk is another powerful command-line text processing tool. Of course, awk supports Regex too. So awk can solve the problem using the same Regex approach. However, as awk supports a C-like script, it can solve the problem in a straightforward way:

$ awk 'length > 35 { print NR ": " $0 }' myFile.txt
7: Every man is a piece of the continent,
9: If a clod be washed away by the sea,
12: As well as if a manor of thy friend's
16: And therefore never send to know for whom the bell tolls;

As the output above shows, the awk built-in variable length holds the length of the current line. Therefore, we can simply print the line number (NR) and the content of that line ($0) in case the current line is longer than 35.

We’ve seen that grep and sed have solved the problem. However, we may have also realized that they cannot customize the output easily. For example, we must accept that the sed solution separates the line number and the content into two lines. But changing the output format is a piece of cake for awk.

Now, let’s say apart from the line numbers and the contents, we want to know the lengths of those long lines:

$ awk 'length > 35{ printf "Line #%d (Length: %d): %s\n", NR, length, $0}' myFile.txt 
Line #7 (Length: 38): Every man is a piece of the continent,
Line #9 (Length: 36): If a clod be washed away by the sea,
Line #12 (Length: 37): As well as if a manor of thy friend's
Line #16 (Length: 57): And therefore never send to know for whom the bell tolls;

As we can see, the awk script allows us to control the output flexibly.

6. Conclusion

In this article, we’ve learned three ways to find lines exceeding a specific length in a file through examples.

grep and sed solve the problem using Regex. But they can hardly customize the output.

The awk command can solve the problem straightforwardly. Further, it allows us to control the output freely.

Persistence

REST

Security