如何删除包含少于n个字符的行

1. Overview

Linux offers plenty of tools for text processing. While we can use these tools to transform the output of Linux commands via inter-process communication like pipes, we can also use them to manipulate text files.

In this tutorial, we’ll learn some ways to remove lines with character lengths less than a given limit using Linux commands and shell scripts.

2. Sample Text File With Random Words

First, we’ll create a text file with random words to use as a sample. For this purpose, we’ll use the wbritish package, a British-English word dictionary available in Ubuntu:

$ sudo apt install wbritish

Indeed, after installing the package with apt, we can find a file with British-English words under /usr/share/dict/words:

$ cat /usr/share/dict/words | tail
zoos
zorch
zucchini
zucchini's
zucchinis
zwieback
...

As we can see, this file contains one word per line. Thus, we can create a shell script that generates a small text file with randomly chosen words from the /usr/share/dict/words file:

$ for i in {1..10}; do
    shuf -n 5 /usr/share/dict/words | tr '\n' ' ' >> randomtxt;
    echo -e '' >> randomtxt;
  done

Here, we use a for loop to run the shuf command 10 times. This means that we’ll generate a file with 10 lines. For each line, the shuf command with 5 after the -n option, selects at most 5 lines from the words file. Then, it forwards its output to the tr command, which replaces the newline characters with spaces for each word. As a result, the five words are packed in a single line. Finally, the line is appended to the randomtxt file, and the echo -e command adds a single newline character at the end of the line.

At this point, the randomtxt file is ready:

$ cat randomtxt
tricolour Fukuyama eyelids espadrille legislate
Andrianampoinimerina's permutation's complying impeachment clog
stowaway Baotou vandalise Galloway's aureole
Fosse attend molest soon conjurer's
...

As expected, the file contains 10 lines, with 5 words per line.

3. Remove Lines With a Shell Script

First, we’ll create a shell script to filter the lines of the file. Such a script would loop over the lines of the file and count the characters in each line. Then it would print the lines with more than n characters to another file or the standard output.

First, let’s run a while loop to read the file line by line and count the characters of each line:

$ cat fileLines.sh
while read line; do
  linecount=$(echo $line | wc -m);
  echo $linecount $line;
done <randomtxt
$ bash fileLines.sh
48 tricolour Fukuyama eyelids espadrille legislate
64 Andrianampoinimerina's permutation's complying impeachment clog
45 stowaway Baotou vandalise Galloway's aureole
...

We can see that the script prepended the number of characters to each line of the file.

Let’s review the commands that we used in this one-line script:

the read built-in command together with while reads the lines of the randomtxt file and assigns them to the line variable
inside the while’s body, we use command substitution with the $() form to count the characters of the line
we store the character count of each line in the linecount variable
echo sends the line to the wc command through a pipe, and wc with the -m option counts the characters
the linecount variable and the line are printed to the standard output

Next, let’s change the script a little so that it prints only lines with 49 or more characters:

$ cat fileLines.sh
n=49
while read line; do
  linecount=$(echo $line | wc -m);
  if [ $linecount -ge $n ]; then
    echo $linecount $line;
  fi
done <randomtxt
$ bash fileLines.sh
64 Andrianampoinimerina's permutation's complying impeachment clog
49 slippages denominator's shoos judgeship's nerves

Indeed, the script printed only the lines with more than 49 characters. The main difference between this version of the script and the previous example is that we echo lines conditionally, inside an if statement.

4. Using grep

The grep command is frequently used in shell scripting, especially when filtering is needed.

Here, we can use a grep regular expression that will match any character for several occurrences that we’ll specify, from the beginning to the end of the line:

$ grep -E '^.{49,}$' randomtxt
Andrianampoinimerina's permutation's complying impeachment clog
slippages denominator's shoos judgeship's nerves

Indeed, the result is the same as in the previous section. The regular expression consists of several elements:

^$: when surrounded by these two characters, the pattern matches characters from the beginning to the end of a line
.: matches any single character except for new lines
{49,}: matches if the pattern before it exists 49 times or more

A key point here is that we used the -E option for extended regular expressions. This way we avoid escaping special characters like {}.

5. Using awk

The awk program is a well-known tool for text processing. It features a specialized language for manipulating text files. We can invoke awk by entering a pattern to select records, an action to perform to the selected records, and a text file. In addition, we can use the command with both a pattern and an action or we can skip either the pattern or the action, but not both.

When executed, the awk tool splits the text file into records and each record into fields. Records are separated by the new line character and columns by the whitespace character. As a result, in the case of plain text files, the records are lines, while the fields are words. Moreover, we can set other record and field separator characters if we want.

5.1. Using an Action Alone

Effectively, we can use only an action to print the lines with more than n (here, 49) characters:

$ awk '{if(length() >= 49) print $0}' randomtxt
Andrianampoinimerina's permutation's complying impeachment clog
slippages denominator's shoos judgeship's nerves

Indeed, we can see that the command printed lines with 49 or more characters. Since there’s no pattern, awk applies the action to all the lines.

Let’s break down the command:

$0 holds the content of the record
length() returns the length of the record in characters
if statement checks if the length is more than or equal to 49 characters
if the above condition is true, we print the line

Finally, the action is nested in curly brackets.

5.2. Using a Pattern With an Expression

Interestingly, we can use only a pattern to filter out lines with less than n characters:

$ awk 'length() >= 49' randomtxt
Andrianampoinimerina's permutation's complying impeachment clog
slippages denominator's shoos judgeship's nerves

As we can see, we printed only the lines with 49 or more characters. The awk command selected only the records that matched the comparison.

5.3. Using a Regular Expression

Another option that we have is to use awk with regular expressions as patterns. As a result, we can invoke awk with the regular expression of the earlier section about grep:

$ awk '$0 ~ /^.{49,}$/' randomtxt 
Andrianampoinimerina's permutation's complying impeachment clog
slippages denominator's shoos judgeship's nerves

As we can see, the command indeed printed lines with 49 or more characters. Furthermore, we used the ~ operator to match the string on its left side to the regular expression on its right side. Another point here is that the regular expression is enclosed in slashes (//).

Finally, awk uses extended regular expressions by default.

6. Using sed

The sed program is another well-known tool for editing files. Similarly to awk, we can invoke sed with a script and a text file. The script consists of an address that matches some of the lines in the file, a command like s for substitute, or p for print, and some options. The address can be a list of line numbers, or a regular expression.

Consequently, we can execute sed with the regular expression we used previously:

$ sed -E -n '/^.{49,}$/p' randomtxt
Andrianampoinimerina's permutation's complying impeachment clog
slippages denominator's shoos judgeship's nerves

Indeed, the result is the same as in the previous sections. Here, we used the -E option for extended regular expressions and the -n option to suppress output unless requested (by p in this case).

Furthermore, we’ve placed the regular expression in slashes (//). Immediately after the closing slash, we can see the p command, to print the selected lines.

7. Conclusion

In this article, we’ve examined some ways to remove lines with fewer than n characters from a given file. First, we wrote a shell script for the purpose. Afterward, we saw how some text processing tools like grep, awk, and sed can help do the same.

Persistence

REST

Security