1. Overview

When dealing with Strings in Java, we sometimes need to encode them into a specific charset.

This tutorial is a practical guide showing different ways to encode a String to the UTF-8 charset.

For a more technical deep-dive, see our Guide to Character Encoding.

2. Defining the Problem

To showcase the Java encoding, we’ll work with the German String “Entwickeln Sie mit Vergnügen”:

String germanString = "Entwickeln Sie mit Vergnügen";
byte[] germanBytes = germanString.getBytes();

String asciiEncodedString = new String(germanBytes, StandardCharsets.US_ASCII);

assertNotEquals(asciiEncodedString, germanString);

This String encoded using US_ASCII gives us the value “Entwickeln Sie mit Vergn?gen” when printed because it doesn’t understand the non-ASCII ü character.

But when we convert an ASCII-encoded String that uses all English characters to UTF-8, we get the same string:

String englishString = "Develop with pleasure";
byte[] englishBytes = englishString.getBytes();

String asciiEncondedEnglishString = new String(englishBytes, StandardCharsets.US_ASCII);

assertEquals(asciiEncondedEnglishString, englishString);

Let’s see what happens when we use the UTF-8 encoding.

3. Encoding With Core Java

Let’s start with the core library.

Strings are immutable in Java, which means we cannot change a String character encoding. To achieve what we want, we need to copy the bytes of the String and then create a new one with the desired encoding.

First, we get the String bytes, and then we create a new one using the retrieved bytes and the desired charset:

String rawString = "Entwickeln Sie mit Vergnügen";
byte[] bytes = rawString.getBytes(StandardCharsets.UTF_8);

String utf8EncodedString = new String(bytes, StandardCharsets.UTF_8);

assertEquals(rawString, utf8EncodedString);

4. Encoding With Java 7 StandardCharsets

Alternatively, we can use the StandardCharsets class introduced in Java 7 to encode the String.

First, we’ll encode the String into bytes, and second, we’ll decode it into a UTF-8 String:

String rawString = "Entwickeln Sie mit Vergnügen";
ByteBuffer buffer = StandardCharsets.UTF_8.encode(rawString); 

String utf8EncodedString = StandardCharsets.UTF_8.decode(buffer).toString();

assertEquals(rawString, utf8EncodedString);

5. Encoding With Commons-Codec

Besides using core Java, we can alternatively use Apache Commons Codec to achieve the same results.

Apache Commons Codec is a handy package containing simple encoders and decoders for various formats.

First, let’s start with the project configuration.

When using Maven, we have to add the commons-codec dependency to our pom.xml:

<dependency>
    <groupId>commons-codec</groupId>
    <artifactId>commons-codec</artifactId>
    <version>1.14</version>
</dependency>

Then, in our case, the most interesting class is StringUtils, which provides methods to encode Strings.

Using this class, getting a UTF-8 encoded String is pretty straightforward:

String rawString = "Entwickeln Sie mit Vergnügen"; 
byte[] bytes = StringUtils.getBytesUtf8(rawString);
 
String utf8EncodedString = StringUtils.newStringUtf8(bytes);

assertEquals(rawString, utf8EncodedString);

6. Conclusion

Encoding a String into UTF-8 isn’t difficult, but it’s not that intuitive. This article presents three ways of doing it, using either core Java or Apache Commons Codec.

As always, the code samples can be found over on GitHub.


» 下一篇: 事务介绍