1. Introduction

Terminals are usually the preferred way to use Linux. Despite their basic text-based command-line interface (CLI) or terminal user interface (TUI) in contrast to a graphical user interface (GUI), we might still encounter problems in case of a bad encoding setting.

In this tutorial, we look at the locale and ways to see the encoding set for the current terminal. First, we go over the basic idea of a locale. Next, we understand how Linux configures and uses it. After that, we explore the main command to check locale settings. Finally, we show how to check the encoding in different contexts.

We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments.

2. Locale

When it comes to software, a locale is a group of settings usually related to regional specifics:

  • number formats like 1666000 versus 1,666,000
  • character formats like ъ versus X
  • date formats like 20.10.2010 versus 2010/10/20
  • time formats like 16:56 versus 04:56PM
  • currency formats like лв versus $
  • paper sizes like A4 versus letter
  • other settings

In most instances, we can just use a country and language code to define a set of the above characteristics. For example, bg_BG might define the first from each set of examples in the items above, while us_US might define the second.

2.1. POSIX Format and Standardization

While the BCP 47 and ISO/IEC 15897 standards are similar, POSIX uses the latter:

[language[_territory][.codeset][@modifier]]

Here, we see the abovementioned country code as territory and language code as language. However, there are two other optional parameters:

  • codeset specifies the encoding
  • modifier is a name for even more specific or custom variants of a locale

In detail, code sets contain the encoding values for a character set.

2.2. Encoding

Basically, to encode a character means to assign it a numerical value, also called a code point. Multiple code points can get grouped into code pages, otherwise known as a character map.

Simply assigning 1 to a, 2 to b, and continuing from there is a possible encoding. Its overly-simplistic nature would make its usage in computers inefficient.

When it comes to software character encodings, there are many:

  • basic ASCII encoding
  • Unicode encodings like UTF-8, UTF-16, UTF-32
  • ISO encodings like ISO 8859-5
  • extended Cyrillic KOI-8 encodings
  • Windows encodings like Windows-1251

For example, a character in Unicode might not exist or be completely different in another encoding. Hence, knowing the context of a given numeric value can change its character translation and visual appearance.

3. Linux Locale

Like any other operating system (OS), Linux offers options to change its locale via a set of environment variables:

  • $LANG – general language specification
  • $LC_CTYPE – character map, lowercase, uppercase, and alphanumeric detection
  • $LC_NUMERIC – number formats
  • $LC_TIME – date and time formats
  • $LC_COLLATE – collation control and string comparison
  • $LC_MONETARY – currency formatting
  • $LC_MESSAGES – message control
  • $LC_PAPER – paper sizes and formats
  • $LC_NAME – person naming convention
  • $LC_ADDRESS – format of addresses
  • $LC_TELEPHONE – format of telephone numbers
  • $LC_MEASUREMENT – measurement units and formats
  • $LC_IDENTIFICATION – further customization

Further, the values of the $LC_ALL and $LANG variables are used in that other when other $LC_* values are missing. Also, the $LANGUAGE variable with a similar function is independent and can even override $LC_ALL.

The encoding of any context can depend on many factors:

  • user settings
  • desired language
  • current region
  • device and software capability

On the last point, both the GUI and CLI can pose limitations when it comes to character presentation.

4. The locale Command

Indeed, the main Linux command to provide locale information is locale.

By default, locale returns the variable values we talked about earlier:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

For example, here, we have the fairly common UTF-8 encoding in an English-based (en) environment with the US-region formats.

To get a list of supported locales, we can use the -a or –all-locales flag:

$ locale --all-locales
C
C.UTF-8
en_US.utf8
POSIX

Notably, this is a very limited choice, but it covers all of UTF-8 in two of the four cases. The other two locales, American National Standards Institute (ANSI) C and POSIX, are predefined with 7-bit ASCII and values for the other $LC_* variables.

Notable, we can still represent values via escape sequences, but our current context might not be able to show them.

To see other supported encodings, -m or –charmaps is useful:

$ locale --charmaps
ANSI_X3.110-1983
ANSI_X3.4-1968
ARMSCII-8
ASMO_449
BIG5
BIG5-HKSCS
BRF
[...]

In fact, in this case, we can pick from 236 character maps.

Finally, we’re able to add a list of space-separated category or keyword names to get targeted output:

$ locale LC_TIME
Sun;Mon;Tue;Wed;Thu;Fri;Sat
Sunday;Monday;Tuesday;Wednesday;Thursday;Friday;Saturday
Jan;Feb;Mar;Apr;May;Jun;Jul;Aug;Sep;Oct;Nov;Dec
January;February;March;April;May;June;July;August;September;October;November;December
AM;PM
%a %d %b %Y %r %Z
[...]
$ locale --category-name --keyword-name am_pm date_fmt
LC_TIME
am_pm="AM;PM"
LC_TIME
date_fmt="%a %d %b %Y %r %Z"

In the last example, -c or –category-name prepends a line with the category name before each output block, while -k or –keyword-name prepends the name of the keyword for the values, e.g., am_pm. Using locale –keyword-name with an LC_* category name, we can acquire lists of possible keyword names.

5. Encoding by Context

After looking at the locale, its constituents, as well as the main command to output their values, let’s continue with encoding checks in different contexts.

5.1. File Encoding

With tools like file and enca, we can get the encoding of a given file.

Let’s see a simple example with file:

$ file --mime /etc/hosts
/etc/hosts: text/plain; charset=us-ascii

Here, we verify the encoding of /etc/hosts is ASCII.

Of course, tools like vi can also tell us the same.

5.2. GUI Encoding

In most window management systems like the X Window system, we can configure our encoding. Consequently, text on any visual elements uses our setting by default.

In addition, applications that run in the context of our GUI know the configuration and can choose to apply or override it. This is valid for most environments like GNOME, KDE, Xfce, and others.

Importantly, most GUI environments also have one or more terminal emulators. To set the encoding of a terminal emulator, we modify its settings in the interface or directly via a file.

For example, the GNOME Terminal has the Edit -> Preferences -> Encodings settings with their /org/gnome/terminal/legacy/encodings file counterparts.

5.3. Terminal Encoding

Of course, we can always manually echo a specific $LC_* variable or $LANG:

$ echo $LC_CTYPE

$ echo $LANG
en_US.UTF-8

Notably, the value of $LC_CTYPE is empty. As we discussed, any empty $LC_* variables use the values of $LC_ALL, $LANG, and $LANGUAGE.

Moreover, the locale command can be useful to clear up any inheritance confusion:

$ locale --category-name --keyword-name charmap
LC_CTYPE
charmap="UTF-8"

Here, we verify the encoding as UTF-8.

Finally, we can also use interpreters like Perl and Python to get the information we’re after:

$ perl -e 'use Term::Encoding; print Term::Encoding::get_encoding();'
utf-8
$ python -c "import sys; print(sys.stdout.encoding)"
UTF-8

In the case of Perl, we need to preinstall the additional Term::Encoding module via cpan install Term::Encoding.

6. Summary

In this article, we talked about locales, how they are used in Linux, and checking the encoding in different environments.

In conclusion, since they control the visual representation of information, knowing the current locale and encoding can be vital, especially in a terminal.