如何从文件中删除非UTF-8字符

1. Overview

We can sometimes have a file that contains invalid characters or contains foreign language words that make our program crash with an “invalid characters error”.

In this tutorial, we’re going to take a deeper dive into this topic and find out what non-UTF-8 characters are and how we can automatically remove all invalid characters from our files.

2. What Are Non-UTF-8 Characters

UTF-8 is an encoding system for Unicode that can translate any Unicode character to a matching unique binary string. It can also convert binary strings to their respective Unicode character hence the “UTF (Unicode Transformational Unit)” prefix.

UTF-8 is unique because it represents characters in one-byte units that contain 8 bits each hence the “-8” suffix.

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages.

Let’s take a look at some strings containing non-UTF-8 characters:

İnanç Esasları
Ä°nanÃ§ EsaslarÄ±
æ���� ����

We’ll get an error if we attempt to store these characters to a variable or run a file that contains them.

3. Filtering Invalid UTF-8 Characters

Files that contain non-UTF-8 characters produce errors when processed by utilities or when opened by some text editors. Let’s take a look at the kind of errors to expect in different languages.

3.1. An Error in Python

Here’s an error we can expect on python:

#### Truncated ####
UnicodeDecodeError: 'utf-8' codec cannot decode byte 0xf1 in position 933: invalid continuation byte
None

3.2. An Error in JavaScript

Let’s take a look at the error to expect in JavaScript:

#### Trunctated ####
Uncaught SyntaxError: Unexpected identifier

3.3. An Error in Perl

Eventually, let’s see the error in Perl:

Malformed UTF-8 character (fatal)

4. How to Find Non-UTF-8 Characters in a File

We can easily find all non-UTF-8 characters in a file using grep. Assuming we’ve set up our locale to UTF-8.

Let’s type in the following command in our terminal to print out all lines containing non-UTF-8 characters:

grep -axv '.*' FILE

Here’s what each part of this command represents:

-a, –text: Treats our FILE as text, hence preventing grep from aborting once it finds an invalid character.
-x ‘.*’ (–line regexp): Matches a complete line containing any UTF-8 character.
-v, –invert match: Inverts our output displaying lines not matched.
FILE: Represents the file we want to check for invalid characters.

Let’s create a file named test.txt and add some random text to it with invalid characters:

$ touch test.txt

Then let’s add the following text to it:

2.3.1 U-0000D7FF = ed 9f bf = "퟿������"
This just some random text
More random text. Baeldung is awesome!

Let’s now use our grep command to find all invalid characters in our newly created test file:

$ grep -axv '.*' test.txt
2.3.1  U-0000D7FF = ed 9f bf = "퟿������"

But this is only useful to us when we need to find invalid characters. In the next section, we’ll find out how we can find and delete invalid characters in our file.

5. How to Automatically Remove Non-UTF-8 Characters

To automatically find and delete non-UTF-8 characters, we’re going to use the iconv command. It is used in Linux systems to convert text from one character encoding to another.

Let’s look at how we can use this command and a combination of other flags to remove invalid characters:

$ iconv -f utf-8 -t utf-8 -c FILE

We can break down the command above to find out what each part is doing:

-f: Represents the original file format. We’ve defined it as utf-8 in our example above
-t: Represents the target file format that we want to convert to.
-c: Skips any invalid sequences
FILE: Represents the file we want to remove invalid characters from.

By default, the cleared data will be written to standard output on our terminal. To save the changes we’ve made, we need to specify a file where the changes will be saved. We can use either of the following commands to save our changes:

$ iconv -f utf-8 -t utf-8 -c FILE.txt -o NEW_FILE

$ iconv -f utf-8 -t utf-8 -c FILE.txt > NEW_FILE

Let’s use the test file we created above to remove all invalid characters and save the changes to a different file named “test_clean.txt”:

$ iconv -f utf-8 -t utf-8 -c test.txt > test_clean.txt

$ iconv -f utf-8 -t utf-8 -c test.txt -o test_clean.txt

6. Conclusion

We took a closer look at what UTF-8 characters are and how having non-UTF-8 characters can potentially cause compatibility issues. We also looked at how we can find invalid characters through grep and how we can automatically delete the invalid characters from our file while utilizing the iconv command.

Persistence

REST

Security