1. Introduction
The Byte Order Mark (BOM) indicates the encoding of a file but can cause issues if we don’t handle it correctly, especially when processing text data. Besides, it isn’t uncommon to encounter files that start with a BOM character when reading text files.
In this tutorial, we’ll explore how to detect and remove BOM characters when reading from a file in Java, focusing specifically on UTF-8 encoding.
2. Understanding BOM Characters
A BOM character is a special Unicode character that signals a text file or stream’s endianness (byte order). For UTF-8, the BOM is EF BB BF (0xEF 0xBB 0xBF).
While useful for encoding detection, BOM characters can interfere with text processing if not properly removed.
3. Using InputStream and Reader
The traditional approach to handling BOMs involves using InputStream and Reader in Java. This approach lets us manually detect and remove BOMs from the input stream before processing the file’s content.
First, we should read the content of a file completely, as follows:
private String readFully(Reader reader) throws IOException {
StringBuilder content = new StringBuilder();
char[] buffer = new char[1024];
int numRead;
while ((numRead = reader.read(buffer)) != -1) {
content.append(buffer, 0, numRead);
}
return content.toString();
}
Here, we utilize a StringBuilder to accumulate the content read from the Reader. By repeatedly reading chunks of characters into a buffer array and appending them to the StringBuilder, we ensure that the entire content of the file is captured. Finally, the accumulated content is returned as a string.
Now, let’s apply the readFully() method within a test case to demonstrate how we can effectively handle BOMs using InputStream and Reader:
@Test
public void givenFileWithBOM_whenUsingInputStreamAndReader_thenRemoveBOM() throws IOException {
try (InputStream is = new FileInputStream(filePath)) {
byte[] bom = new byte[3];
int n = is.read(bom, 0, bom.length);
Reader reader;
if (n == 3 && (bom[0] & 0xFF) == 0xEF && (bom[1] & 0xFF) == 0xBB && (bom[2] & 0xFF) == 0xBF) {
reader = new InputStreamReader(is, StandardCharsets.UTF_8);
} else {
reader = new InputStreamReader(new FileInputStream(filePath), StandardCharsets.UTF_8);
}
assertEquals(expectedContent, readFully(reader));
}
}
In this method, we first set up the file path using the class loader’s resource and handle potential URI syntax exceptions. Then, we utilize the FileInputStream to open an InputStream to the file and create a Reader with UTF-8 encoding using the InputStreamReader.
Additionally, we utilize the read() method of the input stream to read the first 3 bytes into a byte array to check for the presence of a BOM.
If the system detects a UTF-8 BOM (0xEF, 0xBB, 0xBF), it skips it and asserts the content using the readFully() method we defined earlier. Otherwise, we reset the stream by creating a new InputStreamReader with UTF-8 encoding and performing the same assertion.
4. Using Apache Commons IO
An alternative to the manual detection and removal of BOMs is provided by Apache Commons IO, a library offering various utilities for common I/O operations. Among these utilities is the BOMInputStream class, which simplifies handling BOMs by automatically detecting and removing them from an input stream.
Here’s how we implement this approach:
@Test
public void givenFileWithBOM_whenUsingApacheCommonsIO_thenRemoveBOM() throws IOException {
try (BOMInputStream bomInputStream = new BOMInputStream(new FileInputStream(filePath));
Reader reader = new InputStreamReader(bomInputStream, StandardCharsets.UTF_8)) {
assertTrue(bomInputStream.hasBOM());
assertEquals(expectedContent, readFully(reader));
}
}
In this test case, we wrap the FileInputStream with a BOMInputStream, automatically detecting and removing any BOM in the input stream. Moreover, we use the assertTrue() method to check if a BOM was detected and removed successfully using the hasBOM() method.
We then create a Reader using the BOMInputStream and assert the content using the readFully() method to ensure that the content matches the expected content without being affected by the BOM.
5. Using NIO (New I/O)
Java’s NIO (New I/O) package provides efficient file-handling capabilities, including support for reading file contents into memory buffers. Leveraging NIO, we can detect and remove BOMs from a file using ByteBuffer and Files classes.
Here’s how we can implement a test case using NIO for BOM handling:
@Test
public void givenFileWithBOM_whenUsingNIO_thenRemoveBOM() throws IOException, URISyntaxException {
byte[] fileBytes = Files.readAllBytes(Paths.get(filePath));
ByteBuffer buffer = ByteBuffer.wrap(fileBytes);
if (buffer.remaining() >= 3) {
byte b0 = buffer.get();
byte b1 = buffer.get();
byte b2 = buffer.get();
if ((b0 & 0xFF) == 0xEF && (b1 & 0xFF) == 0xBB && (b2 & 0xFF) == 0xBF) {
assertEquals(expectedContent, StandardCharsets.UTF_8.decode(buffer).toString());
} else {
buffer.position(0);
assertEquals(expectedContent, StandardCharsets.UTF_8.decode(buffer).toString());
}
} else {
assertEquals(expectedContent, StandardCharsets.UTF_8.decode(buffer).toString());
}
}
In this test case, we read the file’s contents into a ByteBuffer using the readAllBytes() method. We then check for a BOM’s presence by inspecting the buffer’s first three bytes. If a UTF-8 BOM is detected, we skip it; otherwise, we reset the buffer position.
6. Conclusion
In conclusion, by employing different Java libraries and techniques, handling BOMs in file reading operations becomes straightforward and ensures smooth text processing.
As always, the complete code samples for this article can be found over on GitHub.