1. Overview
ASCII is short for American Standard Code for Information Interchange. As the name suggests, ASCII is a standard set that assigns characters to numerical values.
Before ASCII, there were multiple text encoding formats used by different companies. Clearly, there was a need for a standard to ensure that data could easily be transferred between different computers.
In 1963, ASCII was adopted by the American National Standards Institute (ANSI) as the national standard. It later spread worldwide and got the popularity it has now.
2. What Is ASCII?
ASCII consists of lowercase and uppercase Latin letters (English alphabet), punctuation marks and symbols, control codes, and the digits 0 to 9.
The ASCII character set represents supported characters and their corresponding numerical encoding. The encoded number can be in decimal format, hexadecimal or binary. Each format has its own usefulness.
For example, when informing an audience about the ASCII character set, it’s more relatable to describe the decimal form. The binary encoding represents how the computer actually views the input. The hexadecimal form is what we normally write in a program in order to get the matching character.
ASCII characters can be grouped into control codes and printing characters. Control codes are non-printing characters used to control how the hardware behaves. They are a group of 32 codes that range from 0 to 31. Such codes include the characters for performing operations like End of Text, Horizontal Tab, and End of Transmission Block
Printing characters are those which can be displayed onscreen (or printed). They range from 32 to 127. However, the last character (127) isn’t visible. Instead, it’s used to delete characters.
The standard ASCII characters are 7-bit values. However, due to the need for more characters, the Extended ASCII character set was introduced. It consists of 8-bit values. Hence it has 256 supported values.
3. Limitations of ASCII
As we have discussed, ASCII only has characters in the Latin alphabet. So if, for example, we want to use characters in the Russian alphabet, then we won’t be able to.
At the moment, most systems are shifting to the Unicode encoding format. However, Unicode doesn’t make ASCII obsolete.
Unicode is simply a character set that supports all human languages, including ancient ones such as Aramaic. Simply put, Unicode is a superset of ASCII.
4. Getting the ASCII Value in Scala
ASCII is the first block in the Unicode character set. It contains the standard 128 ASCII values. The escape sequence for accessing them is \u. The range U+0000 to U+007F defines their hexadecimal values.
To output the ASCII encoding of a character, we’ll use the toInt method:
println(‘a’.toInt)
The above code will print out 97, equivalent to 61, in hexadecimal numbering.
If, however, we want to output the ASCII character that matches a numerical encoding, we would have to use the format below:
‘\’ ‘u’ hexDigit hexDigit hexDigit hexDigit
That is, use the \u escape sequence followed by a four-digit hex number. If the hex value has less than four digits, then we’ll have to pad it with the preceding 0s:
println(“\u006D”)
println("\u004D")
The above example outputs m and M. In either of the statements above, if we wrote the hex value with a lowercase letter (d), we would still get the same character values.
The range of letters from U+0041 to U+005A defines uppercase letters, while U+0061 to U+007A is for lowercase letters in the Latin (English) alphabet.
The difference between an uppercase character and the lowercase one is always 32 (in decimal). For example, A is encoded as 65 while a is 97.
5. Conclusion
ASCII is the most widely used character set. To ensure a universal character set, ASCII is part of the Unicode character set, which a number of programming languages, including Scala, use.