1. Overview
In this tutorial, we’ll discuss the basics of character encoding and how we handle it in Java.
2. Importance of Character Encoding
We often have to deal with texts belonging to multiple languages with diverse writing scripts like Latin or Arabic. Every character in every language needs to somehow be mapped to a set of ones and zeros. Really, it’s a wonder that computers can process all of our languages correctly.
To do this properly, we need to think about character encoding. Not doing so can often lead to data loss and even security vulnerabilities.
To understand this better, let’s define a method to decode a text in Java:
String decodeText(String input, String encoding) throws IOException {
return
new BufferedReader(
new InputStreamReader(
new ByteArrayInputStream(input.getBytes()),
Charset.forName(encoding)))
.readLine();
}
Note that the input text we feed here uses the default platform encoding.
If we run this method with input as “The façade pattern is a software design pattern.” and encoding as “US-ASCII”, it’ll output:
The fa��ade pattern is a software design pattern.
Well, not exactly what we expected.
What could have gone wrong? We’ll try to understand and correct this in the rest of this tutorial.
3. Fundamentals
Before digging deeper, though, let’s quickly review three terms: encoding, charsets, and code point.
3.1. Encoding
Computers can only understand binary representations like 1 and 0. Processing anything else requires some kind of mapping from the real-world text to its binary representation. This mapping is what we know as character encoding or simply just as encoding.
For example, the first letter in our message, “T”, in US-ASCII encodes to “01010100”.
3.2. Charsets
The mapping of characters to their binary representations can vary greatly in terms of the characters they include. The number of characters included in a mapping can vary from only a few to all the characters in practical use. The set of characters that are included in a mapping definition is formally called a charset.
For example, ASCII has a charset of 128 characters.
3.3. Code Point
A code point is an abstraction that separates a character from its actual encoding. A code point is an integer reference to a particular character.
We can represent the integer itself in plain decimal or alternate bases like hexadecimal or octal. We use alternate bases for the ease of referring large numbers.
For example, the first letter in our message, T, in Unicode has a code point “U+0054” (or 84 in decimal).
4. Understanding Encoding Schemes
A character encoding can take various forms depending upon the number of characters it encodes.
The number of characters encoded has a direct relationship to the length of each representation which typically is measured as the number of bytes. Having more characters to encode essentially means needing lengthier binary representations.
Let’s go through some of the popular encoding schemes in practice today.
4.1. Single-Byte Encoding
One of the earliest encoding schemes, called ASCII (American Standard Code for Information Exchange) uses a single-byte encoding scheme. This essentially means that each character in ASCII is represented with seven-bit binary numbers. This still leaves one bit free in every byte!
ASCII’s 128-character set covers English alphabets in lower and upper cases, digits, and some special and control characters.
Let’s define a simple method in Java to display the binary representation for a character under a particular encoding scheme:
String convertToBinary(String input, String encoding)
throws UnsupportedEncodingException {
byte[] encoded_input = Charset.forName(encoding)
.encode(input)
.array();
return IntStream.range(0, encoded_input.length)
.map(i -> encoded_input[i])
.mapToObj(e -> Integer.toBinaryString(e ^ 255))
.map(e -> String.format("%1$" + Byte.SIZE + "s", e).replace(" ", "0"))
.collect(Collectors.joining(" "));
}
Now, character ‘T’ has a code point of 84 in US-ASCII (ASCII is referred to as US-ASCII in Java).
And if we use our utility method, we can see its binary representation:
assertEquals(convertToBinary("T", "US-ASCII"), "01010100");
This, as we expected, is a seven-bit binary representation for the character ‘T’.
The original ASCII left the most significant bit of every byte unused. At the same time, ASCII had left quite a lot of characters unrepresented, especially for non-English languages.
This led to an effort to utilize that unused bit and include an additional 128 characters.
There were several variations of the ASCII encoding scheme proposed and adopted over time. These loosely came to be referred to as “ASCII extensions”.
Many of the ASCII extensions had different levels of success but obviously, this was not good enough for wider adoption as many characters were still not represented.
One of the more popular ASCII extensions was ISO-8859-1, also referred to as “ISO Latin 1”.
4.2. Multi-Byte Encoding
As the need to accommodate more and more characters grew, single-byte encoding schemes like ASCII were not sustainable.
This gave rise to multi-byte encoding schemes which have a much better capacity albeit at the cost of increased space requirements.
BIG5 and SHIFT-JIS are examples of multi-byte character encoding schemes which started to use one as well as two bytes to represent wider charsets. Most of these were created for the need to represent Chinese and similar scripts that have a significantly higher number of characters.
Let’s now call the method convertToBinary with input as ‘語’, a Chinese character, and encoding as “Big5”:
assertEquals(convertToBinary("語", "Big5"), "10111011 01111001");
The output above shows that Big5 encoding uses two bytes to represent the character ‘語’.
A comprehensive list of character encodings, along with their aliases, is maintained by the International Number Authority.
5. Unicode
It is not difficult to understand that while encoding is important, decoding is equally vital to make sense of the representations. This is only possible in practice if a consistent or compatible encoding scheme is used widely.
Different encoding schemes developed in isolation and practiced in local geographies started to become challenging.
This challenge gave rise to a singular encoding standard called Unicode which has the capacity for every possible character in the world. This includes the characters which are in use and even those which are defunct!
Well, that must require several bytes to store each character? Honestly yes, but Unicode has an ingenious solution.
Unicode as a standard defines code points for every possible character in the world. The code point for character ‘T’ in Unicode is 84 in decimal. We generally refer to this as “U+0054” in Unicode which is nothing but U+ followed by the hexadecimal number.
We use hexadecimal as the base for code points in Unicode as there are 1,114,112 points, which is a pretty large number to communicate conveniently in decimal!
How these code points are encoded into bits is left to specific encoding schemes within Unicode. We will cover some of these encoding schemes in the sub-sections below.
5.1. UTF-32
UTF-32 is an encoding scheme for Unicode that employs four bytes to represent every code point defined by Unicode. Obviously, it is space inefficient to use four bytes for every character.
Let’s see how a simple character like ‘T’ is represented in UTF-32. We will use the method convertToBinary introduced earlier:
assertEquals(convertToBinary("T", "UTF-32"), "00000000 00000000 00000000 01010100");
The output above shows the usage of four bytes to represent the character ‘T’ where the first three bytes are just wasted space.
5.2. UTF-8
UTF-8 is another encoding scheme for Unicode which employs a variable length of bytes to encode. While it uses a single byte to encode characters generally, it can use a higher number of bytes if needed, thus saving space.
Let’s again call the method convertToBinary with input as ‘T’ and encoding as “UTF-8”:
assertEquals(convertToBinary("T", "UTF-8"), "01010100");
The output is exactly similar to ASCII using just a single byte. In fact, UTF-8 is completely backward compatible with ASCII.
Let’s again call the method convertToBinary with input as ‘語’ and encoding as “UTF-8”:
assertEquals(convertToBinary("語", "UTF-8"), "11101000 10101010 10011110");
As we can see here UTF-8 uses three bytes to represent the character ‘語’. This is known as variable-width encoding.
UTF-8, due to its space efficiency, is the most common encoding used on the web.
5.3. Difference Between UTF-8 and UTF-16
UTF-8 and UTF-16 are just two of the established standards for encoding. They differ only in the number of bytes they use to encode each character. As both are variable-width encoding, they can use up to four bytes to encode the data, but when it comes to the minimum, UTF-8 only uses one byte (8 bits) and UTF- 16 uses 2 bytes (16 bits). This has a huge impact on the size of encoded files. Using only ASCII characters, a file encoded in UTF-16 would be about twice as large as the same file encoded in UTF-8.
Moreover, many common characters have different lengths, which makes indexing by code point and calculating the number of code points with UTF-8 terribly slow. On other hand, UTF-16 is more adequate with BMP (Basic Multilingual Plane) characters that can be represented with 2 bytes. This speeds up indexing and calculation of the number of code points in case the text does not contain extra characters.
As for the BOM (Byte Order Mark), it is neither required nor recommended with UTF-8 usage because it serves no purpose except to mark the start of a UTF-8 stream. Since each code point is coded on one byte minimum, the problem of endianness does not arise with UTF-8, unlike UTF-16, where the BOM, in addition to potentially allowing game detection, is mainly used to indicate how to read the file.
Furthermore, UTF-8 ensures there are no NULL bytes in the data except when encoding the null character, this introduces a great deal of backwards compatibility.
To resume, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocols.
6. Encoding Support in Java
Java supports a wide array of encodings and their conversions to each other. The class Charset defines a set of standard encodings which every implementation of Java platform is mandated to support.
This includes US-ASCII, ISO-8859-1, UTF-8, and UTF-16 to name a few. A particular implementation of Java may optionally support additional encodings.
There are some subtleties in the way Java picks up a charset to work with. Let’s go through them in more detail.
6.1. Default Charset Before Java 18
The Java platform depends heavily on a property called the default charset. The Java Virtual Machine (JVM) determines the default charset during start-up.
This is dependent on the locale and the charset of the underlying operating system on which JVM is running. For example on MacOS, the default charset is UTF-8.
Let’s see how we can determine the default charset:
Charset.defaultCharset().displayName();
If we run this code snippet on a Windows machine the output we get:
windows-1252
Now, “windows-1252” is the default charset of the Windows platform in English, which in this case has determined the default charset of JVM which is running on Windows.
6.2. Who Uses the Default Charset?
Many of the Java APIs make use of the default charset as determined by the JVM. To name a few:
- InputStreamReader and FileReader
- OutputStreamWriter and FileWriter
- Formatter and Scanner
- URLEncoder and URLDecoder
So, this means that if we’d run our example without specifying the charset:
new BufferedReader(new InputStreamReader(new ByteArrayInputStream(input.getBytes()))).readLine();
then it would use the default charset to decode it.
Several APIs make this same choice by default.
The default charset hence assumes an importance that we can not safely ignore.
6.3. Problems With the Default Charset Before Java 18
As we have seen the default charset in Java is determined dynamically when the JVM starts. This makes the platform less reliable or error-prone when used across different operating systems.
For example, if we run
new BufferedReader(new InputStreamReader(new ByteArrayInputStream(input.getBytes()))).readLine();
on macOS, it will use UTF-8.
If we try the same snippet on Windows, it will use Windows-1252 to decode the same text.
Or, imagine writing a file on a macOS, and then reading that same file on Windows.
It’s not difficult to understand that because of different encoding schemes, this may lead to data loss or corruption.
6.4. Can We Override the Default Charset?
The determination of the default charset in Java leads to two system properties:
- file.encoding: The value of this system property is the name of the default charset
- sun.jnu.encoding: The value of this system property is the name of the charset used when encoding/decoding file paths
Now, it’s intuitive to override these system properties through command line arguments:
-Dfile.encoding="UTF-8"
-Dsun.jnu.encoding="UTF-8"
However, it is important to note that these properties are read-only in Java. Their usage as above is not present in the documentation. Overriding these system properties may not have desired or predictable behavior.
Hence, we should avoid overriding the default charset in Java.
6.5. Solving This Problem in Our Programs
We should normally choose to specify a charset when dealing with text instead of relying on the default settings. We can explicitly declare the encoding we want to use in classes that deal with character-to-byte conversions.
Luckily, our example is already specifying the charset. We just need to select the right one and let Java do the rest.
We should realize by now that accented characters like ‘ç’ are not present in the encoding schema ASCII and hence we need an encoding that includes them. Perhaps, UTF-8?
Let’s try that, we will now run the method decodeText with the same input but encoding as “UTF-8”:
The façade pattern is a software-design pattern.
Bingo! We can see the output we were hoping to see now.
Here we have set the encoding we think best suits our need in the constructor of InputStreamReader. This is usually the safest method of dealing with characters and byte conversions in Java.
Similarly, OutputStreamWriter and many other APIs support setting an encoding scheme through their constructor.
6.6. Default Charset Since Java 18
JEP 400 solves issues related to the default charset in Java. Java 18 makes UTF-8 the default charset, bringing an end to most issues related to the default charset in versions before Java 18.
UTF-8 is widely used on the world wide web. Also, most Java programs use UTF-8 to process JSON and XML. Additionally, Java APIs like java.nio.Files use UTF-8 by default.
Furthermore, properties files have been loaded in UTF-8 by default since Java 9, as it’s convenient for representing non-Latin characters.
These reasons led to UTF-8 finally being set as the default charset starting in Java 18.
Notably, for past Java editions, it’s encouraged to check for charset issues by using the Java command:
java -Dfile.encoding=UTF-8
The above command overrides the default charset in the old JDK and sets it to UTF-8 for the specified program.
6.7. MalformedInputException
When we decode a byte sequence, there exist cases in which it’s not legal for the given Charset, or else it’s not a legal sixteen-bit Unicode. In other words, the given byte sequence has no mapping in the specified Charset.
There are three predefined strategies (or CodingErrorAction) when the input sequence has malformed input:
- IGNORE will ignore malformed characters and resume coding operation
- REPLACE will replace the malformed characters in the output buffer and resume the coding operation
- REPORT will throw a MalformedInputException
The default malformedInputAction for the CharsetDecoder is REPORT, and the default malformedInputAction of the default decoder in InputStreamReader is REPLACE.
Let’s define a decoding function that receives a specified Charset, a CodingErrorAction type, and a string to be decoded:
String decodeText(String input, Charset charset,
CodingErrorAction codingErrorAction) throws IOException {
CharsetDecoder charsetDecoder = charset.newDecoder();
charsetDecoder.onMalformedInput(codingErrorAction);
return new BufferedReader(
new InputStreamReader(
new ByteArrayInputStream(input.getBytes(charset)), charsetDecoder)).readLine();
}
It’s worth noting that the encoding parameter should be passed to both the InputStreamReader constructor and also to the getBytes() method to make it work regardless of platform default encoding.
So, if we decode “The façade pattern is a software design pattern.” with US_ASCII, the output for each strategy would be different. First, we use CodingErrorAction.IGNORE which skips illegal characters:
Assertions.assertEquals(
"The fa?ade pattern is a software design pattern.",
CharacterEncodingExamples.decodeText(
"The façade pattern is a software design pattern.",
StandardCharsets.US_ASCII,
CodingErrorAction.IGNORE));
For the second test, we use CodingErrorAction.REPLACE that puts � instead of the illegal characters:
Assertions.assertEquals(
"The fa��ade pattern is a software design pattern.",
CharacterEncodingExamples.decodeText(
"The façade pattern is a software design pattern.",
StandardCharsets.US_ASCII,
CodingErrorAction.REPLACE));
For the third test, we use CodingErrorAction.REPORT which leads to throwing MalformedInputException:
Assertions.assertThrows(
MalformedInputException.class,
() -> CharacterEncodingExamples.decodeText(
"The façade pattern is a software design pattern.",
StandardCharsets.US_ASCII,
CodingErrorAction.REPORT));
7. Other Places Where Encoding Is Important
We don’t just need to consider character encoding while programming. Texts can go wrong terminally at many other places.
The most common cause of problems in these cases is the conversion of text from one encoding scheme to another, thereby possibly introducing data loss.
Let’s quickly go through a few places where we may encounter issues when encoding or decoding text.
7.1. Text Editors
In most of the cases, a text editor is where texts originate. There are numerous text editors in popular choice including vi, Notepad, and MS Word. Most of these text editors allow for us to select the encoding scheme. Hence, we should always make sure they are appropriate for the text we are handling.
7.2. File System
After we create texts in an editor, we need to store them in some file system. The file system depends on the operating system on which it is running. Most operating systems have inherent support for multiple encoding schemes. However, there may still be cases where an encoding conversion leads to data loss.
7.3. Network
Texts when transferred over a network using a protocol like File Transfer Protocol (FTP) also involve conversion between character encodings. For anything encoded in Unicode, it’s safest to transfer over as binary to minimize the risk of loss in conversion. However, transferring text over a network is one of the less frequent causes of data corruption.
7.4. Databases
Most of the popular databases like Oracle and MySQL support the choice of the character encoding scheme at the installation or creation of databases. We must choose this in accordance with the texts we expect to store in the database. This is one of the more frequent places where the corruption of text data happens due to encoding conversions.
7.5. Browsers
Finally, in most web applications, we create texts and pass them through different layers with the intention to view them in a user interface, like a browser. Here as well it is imperative for us to choose the right character encoding which can display the characters properly. Most popular browsers like Chrome, Edge allow choosing the character encoding through their settings.
8. Conclusion
In this article, we discussed how encoding can be an issue while programming.
We further discussed the fundamentals including encoding and charsets. Moreover, we went through different encoding schemes and their uses.
We also picked up an example of incorrect character encoding usage in Java and saw how to get that right. Finally, we discussed some other common error scenarios related to character encoding.
As always, the code for the examples is available over on GitHub.