如何在Linux中查找文本文件中的非ASCII字符

1. Overview

Non-ASCII characters are those that are not encoded in ASCII, such as Unicode, EBCDIC, etc. ASCII is limited to 128 characters and was initially developed for the English language.

In this tutorial, we’ll look at some tools to find and highlight non-ASCII characters within text files.

2. Setup

Let’s create a file sample.txt with multiple lines and non-ASCII characters in some lines:

$ cat >> sample.txt

This is an article on finding non-ASCII characters on Baeldung
日本人 中國的 ~=[]()%+{}@;’#!$_&-  éè  ;∞¥₤€
We hopè you find it inform@tiv€
Thank You!

We’ll use this as the sample file throughout the tutorial.

3. Using grep

grep stands for global regular expression print. It searches for particular patterns of characters in the input and outputs all lines that match.

The grep command has different variants. It is available on almost every Linux distribution system by default. Here, we’ll focus on the most widely used GNU grep.

We can use this command to find all non-ASCII characters:

$ grep --color='auto' -P -n "[\x80-\xFF]" sample.txt

Now, let’s understand this command by breaking it down:

–color=’auto’: specifies when parts of the matched pattern should be colored. The ‘auto’ value highlights matching strings if the output is written directly to the terminal. Other options include ‘always’, ‘never’, ‘tty’, etc.
-P: interprets patterns as Perl-compatible regular expressions
-n: displays each matched line with a line number
“[\x80-\xFF]”: regular expression that matches characters that are not within the ASCII range

Depending on our system settings, the above command may not work. It’ll only display some of the non-ASCII characters present. An alternative is to grep the inverse of this command, which works more effectively:

$ grep --color='auto' -P -n "[^\x00-\x7F]" sample.txt

Or, we can grep the inverse using character classes:

$ grep --color='auto' -n "[^[:ascii:]]" sample.txt

Here’s the response we get:
grep find nonascii characters

This command only displays lines containing non-ASCII characters. Additionally, it highlights non-ASCII characters in red.

We’ve used “[^\x00-\x7F]” and “[^[:ASCII:]]” as regular expressions. They match the inverse of all characters within the ASCII range.

4. Using pcregrep

pcregrep is a utility for performing file searches for specific character patterns. It utilizes the PCRE regular expressions library to support specific patterns compatible with Perl 5. We can install it through the package manager.

Let’s use this command to install on Debian, Ubuntu, or Kali distributions:

$ sudo apt install pcregrep

On the Red Hat distribution, we can use yum:

$ yum install pcregrep

Next, let’s use pcregrep to check for non-ASCII characters in our sample.txt file:

$ pcregrep --color='auto' -n "[\x80-\xFF]" sample.txt

Alternatively, we can use character classes:

$ pcregrep --color='auto' -n "[^[:ascii:]]" sample.txt

We get this output:

pcregrep find nonascii characters

Here is a breakdown of the command:

–color=”auto”: specifies when parts of the matched pattern should be colored
-n: displays each matched line with a line number
“[\x80-\xFF]”: regular expression that matches any character range outside ASCII format

We can see in the output that all non-ASCII characters have been replaced and highlighted with a red color.

5. Using perl

perl was originally built for scanning, extracting, and printing information in arbitrary text files.

It has grown into a general-purpose programming language widely used for many system management tasks.

perl is available by default on most Linux distributions. If it’s missing, we can install it:

$ sudo apt-get install perl

Next, let’s find non-ASCII characters using perl:

$ perl -ne 'print if /[^[:ascii:]]/' sample.txt
日本人 中國的 ~=[]()%+{}@;’#!$_&- éè ;∞¥₤€
We hopè you find it inform@tiv€

By default, the command above displays only lines that contain non-ASCII characters.

Let’s breakdown the command to understand it:

-ne: two combined flags (-n and -e) that create a new line and execute the print command respectively
‘print if…’: a small program that prints all non-ASCII characters
/[^[:ascii]]/: regular expression matching any non-ASCII character

6. Using tr

We can use the tr or translate command to translate or delete specific characters. It comes pre-installed in all the major Linux distributions.

tr allows us to perform text transformations such as uppercase to lowercase, deleting specific character patterns, etc.

Let’s use it to delete all the ASCII characters in our sample file:

$ tr -d '[:print:]' < sample.txt

日本人中國的’éè∞¥₤€
è€

We’ve used the -d flag to delete all ASCII characters. We also used [:print:]’ to match all printable ASCII characters.

The command above deletes all ASCII characters present. It only prints non-ASCII characters that weren’t deleted.

7. Using sed

sed is used to perform different functions like search, find and replace, etc. on files. It lets us edit files quickly from the command line without even opening them.

Let’s find all non-ASCII characters using sed:

$ LC_ALL=C sed -i 's/[^\x0-\xB1]//g' sample.txt

Now, let’s understand each part of the command:

LC_ALL=C: environment variable that overrides all other localization settings. Here, we’ve set it to the simplest C setting.
-i: used to edit the file in-place without opening it
‘s/[^\…’: regular expression that matches all non-ASCII characters

The command above edits our original input file. It is advisable to create a copy of the input file before running the command.

By default, the command above will not display any output. We can use the cat command to inspect the changes:

$ cat sample.txt

This is an article on finding non-ASCII characters on Baeldung
���� ����� ~=[]()%+{}@;��#!$_&- �� ;�������
We hop� you find it inform@tiv��

We can also use this sed command to highlight non-ASCII characters:

$ sed -n 'l' sample.txt

This is an article on finding non-ASCII characters on Baeldung$
\346\227\245\346\234\254\344\272\272 \344\270\255\345\234\213\347\232\
\204 ~=[]()%+{}@;\342\200\231#!$_&- \303\251\303\250 ;\342\210\236\
...truncated

This command replaces every occurrence of a non-ASCII character with its octal value.

8. Conclusion

In this article, we’ve learned about non-ASCII characters. We also discussed different tools that we can use to find non-ASCII characters within text files.

Persistence

REST

Security