1. Overview
The illegal character compilation error is a file type encoding error. It’s produced if we use an incorrect encoding in our files when they are created. As result, in languages like Java, we can get this type of error when we try to compile our project. In this tutorial, we’ll describe the problem in detail along with some scenarios where we may encounter it, and then, we’ll present some examples of how to resolve it.
2. Illegal Character Compilation Error
2.1. Byte Order Mark (BOM)
Before we go into the byte order mark, we need to take a quick look at the UCS (Unicode) Transformation Format (UTF). UTF is a character encoding format that can encode all of the possible character code points in Unicode. There are several kinds of UTF encodings. Among all these, UTF-8 has been the most used.
UTF-8 uses an 8-bit variable-width encoding to maximize compatibility with ASCII. When we use this encoding in our files, we may find some bytes that represent the Unicode code point. As a result, our files start with a U+FEFF byte order mark (BOM). This mark, correctly used, is invisible. However, in some cases, it could lead to data errors.
In the UTF-8 encoding, the presence of the BOM is not fundamental. Although it’s not essential, the BOM may still appear in UTF-8 encoded text. The BOM addition could happen either by an encoding conversion or by a text editor that flags the content as UTF-8.
Text editors like Notepad on Windows could produce this kind of addition. As a consequence, when we use a Notepad-like text editor to create a code example and try to run it, we could get a compilation error. In contrast, modern IDEs encode created files as UTF-8 without the BOM. The next sections will show some examples of this problem.
2.2. Class with Illegal Character Compilation Error
Typically, we work with advanced IDEs, but sometimes, we use a text editor instead. Unfortunately, as we’ve learned, some text editors could create more problems than solutions because saving a file with a BOM could lead to a compilation error in Java. The “illegal character” error occurs in the compilation phase, so it’s quite easy to detect. The next example shows us how it works.
First, let’s write a simple class in our text editor, such as Notepad. This class is just a representation – we could write any code to test. Next, we save our file with the BOM to test:
public class TestBOM {
public static void main(String ...args){
System.out.println("BOM Test");
}
}
Now, when we try to compile this file using the javac command:
$ javac ./TestBOM.java
Consequently, we get the error message:
public class TestBOM {
^
.\TestBOM.java:1: error: illegal character: '\u00bf'
public class TestBOM {
^
2 errors
Ideally, to fix this problem, the only thing to do is save the file as UTF-8 without BOM encoding. After that, the problem is solved. We should always check that our files are saved without a BOM.
Another way to fix this issue is with a tool like dos2unix. This tool will remove the BOM and also take care of other idiosyncrasies of Windows text files.
3. Reading Files
Additionally, let’s analyze some examples of reading files encoded with BOM.
Initially, we need to create a file with BOM to use for our test. This file contains our sample text, “Hello world with BOM.” – which will be our expected string. Next, let’s start testing.
3.1. Reading Files Using BufferedReader
First, we’ll test the file using the BufferedReader class:
@Test
public void whenInputFileHasBOM_thenUseInputStream() throws IOException {
String line;
String actual = "";
try (BufferedReader br = new BufferedReader(new InputStreamReader(file))) {
while ((line = br.readLine()) != null) {
actual += line;
}
}
assertEquals(expected, actual);
}
In this case, when we try to assert that the strings are equal, we get an error:
org.opentest4j.AssertionFailedError: expected: <Hello world with BOM.> but was: <Hello world with BOM.>
Expected :Hello world with BOM.
Actual :Hello world with BOM.
Actually, if we skim the test response, both strings look apparently equal. Even so, the actual value of the string contains the BOM. As result, the strings aren’t equal.
Moreover, a quick fix would be to replace BOM characters:
@Test
public void whenInputFileHasBOM_thenUseInputStreamWithReplace() throws IOException {
String line;
String actual = "";
try (BufferedReader br = new BufferedReader(new InputStreamReader(file))) {
while ((line = br.readLine()) != null) {
actual += line.replace("\uFEFF", "");
}
}
assertEquals(expected, actual);
}
The replace method clears the BOM from our string, so our test passes. We need to work carefully with the replace method. A huge number of files to process can lead to performance issues.
3.2. Reading Files Using Apache Commons IO
In addition, the Apache Commons IO library provides the BOMInputStream class. This class is a wrapper that includes an encoded ByteOrderMark as its first bytes. Let’s see how it works:
@Test
public void whenInputFileHasBOM_thenUseBOMInputStream() throws IOException {
String line;
String actual = "";
ByteOrderMark[] byteOrderMarks = new ByteOrderMark[] {
ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE
};
InputStream inputStream = new BOMInputStream(ioStream, false, byteOrderMarks);
Reader reader = new InputStreamReader(inputStream);
BufferedReader br = new BufferedReader(reader);
while ((line = br.readLine()) != null) {
actual += line;
}
assertEquals(expected, actual);
}
The code is similar to previous examples, but we pass the BOMInputStream as a parameter into the InputStreamReader.
3.3. Reading Files Using Google Data (GData)
On the other hand, another helpful library to handle the BOM is Google Data (GData). This is an older library, but it helps manage the BOM inside the files. It uses XML as its underlying format. Let’s see it in action:
@Test
public void whenInputFileHasBOM_thenUseGoogleGdata() throws IOException {
char[] actual = new char[21];
try (Reader r = new UnicodeReader(ioStream, null)) {
r.read(actual);
}
assertEquals(expected, String.valueOf(actual));
}
Finally, as we observed in the previous examples, removing the BOM from the files is important. If we don’t handle it properly in our files, unexpected results will happen when the data is read. That’s why we need to be aware of the existence of this mark in our files.
4. Conclusion
In this article, we covered several topics regarding the illegal character compilation error in Java. First, we learned what UTF is and how the BOM is integrated into it. Second, we showed a sample class created using a text editor – Windows Notepad, in this case. The generated class threw the compilation error for the illegal character. Finally, we presented some code examples on how to read files with a BOM.
As usual, all the code used for this example can be found over on GitHub.