1. Overview
In this tutorial, we’ll look at what character encoding is and how it’s used. Then, we’ll discuss how locale determines the encoding used in Linux. Lastly, we’ll look at the various character encodings in Linux.
2. Understanding Character Encoding
For computers to store text, numbers, and symbols that humans can understand, there must be an established standard that converts all characters from any language into a format that retains their meaning even if translated. Character encoding is the process of converting characters into their byte mappings according to a specific character set.
Essentially, this process includes the written characters of human language that are changed into a format where they can be stored, transmitted, and transformed using digital computers. Using character encoding, computers can digitally translate characters from a single language into others. The applications store the encoded characters either in bits or bytes.
When we’re discussing character encoding, we’ll often encounter these terminologies:
- Characters are letters, numbers, punctuation marks, and mathematical or monetary symbols in a language.
- A code point is an integer value that maps to a specific character. Code points usually represent a single letter, digit, punctuation mark, or whitespace but sometimes symbols, control or formatting characters.
- Character sets/charmaps are random mappings between characters and byte sequences. Different character sets assign varying byte values to the same characters.
Using the same medium for translating characters makes internationalization easier. Communication on different platforms is effective and reliable.
3. What Determines Character Encoding in Linux?
All characters are encoded before being displayed, written to a disk, printed on paper, or transmitted through a digital medium. Computers and most applications use the character set defined by the systems environment variables to set the default encoding. Often, the default encoding is set during installation,
On Linux systems, locale specifies country-specific standards for application behavior, such as the character encoding system, measurement units, date, and time. locale names usually combine a two-letter ISO 639-1 language code and a two-letter ISO 3166-1 country code in the format ‘ll_CC’, e.g., en_US, and de_DE (for English in the US and German language in Germany).
The settings on the locale environment variables dictate the character encoding used on a Linux system. Let’s run the locale command:
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
The LANGUAGE variable isn’t set because it’s meant to contain a colon-separated priority list of languages, e.g. “en:ru:de” (often set by GUI applications). Also, the LC_ALL variable is empty because it’s meant for testing or troubleshooting purposes.
Let’s display a list of all available character sets:
$ locale -m
ANSI_X3.110-1983
ANSI_X3.4-1968
IBM1026
ISO-8859-1
ISO-8859-10
UTF-8
VIDEOTEX-SUPPL
Next, we can run the locale charmap command to see which character set is in use:
$ locale charmap
UTF-8
If we want to show the list of set locale definitions in our system, we run the locale–a command:
$ locale -a
C
C.utf8
en_US.utf8
POSIX
We should remember that not all character sets have locale definitions.
4. Unicode Character Encoding
Unicode character encoding is a standard character set that indexes and defines characters from multiple languages and symbols. It uses a fixed-length character encoding scheme.
Unicode decodes data through 8-bit or 16-bit encoding, but it depends on the data type. There are three types of Unicode character encoding: UTF-8, UTF-16, and UTF-32.
Let’s look at UTF-8. UTF-8 uses a computative principle that transforms fixed-length Unicode characters into variable-length ASCII-safe byte strings. UTF-8 and ASCII control characters are represented by a single byte, while other characters are represented by two or more bytes.
ASCII and UTF-8 are backward compatible: the first 128 Unicode characters correspond evenly. The characters are encoded using a single byte with the same binary value as ASCII so that valid ASCII text is valid UTF-8-encoded string.
5. ASCII Character Encoding
ASCII is a character encoding for electronic communication. It only has 128 code points, and only 95 are printable characters. Also, it’s the most common character encoding for text data on computers. ASCII encodes upper and lowercase alphabets, numerals, and punctuation symbols.
ASCII characters are presented through a hexadecimal digit (base-16 numbers), binary number (7-bit or 8-bit), three-digit octal number (base-8 numbers), and lastly, through a decimal number between 0 and 127.
The extended ASCII character set has 8-bit long bytes and it adds 128 characters to the standard ASCII character set. Some characters added are special symbols, foreign language letters, and drawing characters.
6. Conclusion
In this article, we looked at characters and why we need to use the same encoding scheme when translating characters.
We further discussed how Linux uses locale to determine the character encoding type to use in different areas around the world. Also, we looked at Unicode where we discussed UTF-8. Lastly, we discussed ASCII and mentioned other character encoding types.