1. Overview
Sorting and ordering elements is one of the most critical parts of data structures in any programming language. While we can order elements based on any criteria, most sets have the notion of lexicographical or natural ordering.
In this tutorial, we’ll discuss lexicographical ordering, its origin, and its importance in computer science. We won’t fully explore the mathematical underlying and set theory, but we’ll emphasize practical examples.
2. Origin
We can think about the lexicographical ordering as a generalized version of an alphabetical order. It takes the name from the ordering convention used in the dictionaries and encyclopedias. At the same time, we can use a similar idea of natural ordering with elements of any type.
For example, consider the lexicographical ordering of numbers and dates. Two follows one, and three follows two. This is an intuitive way to compare the numbers. While it’s intuitive, let’s get to a more theoretical explanation.
3. Total Order
The lexicographical order uses a total order. To understand it better, let’s dive a bit into the theory.
3.1. Requirements
The total order has a set of strict requirements. And we should respect all of them:
- Reflexive – the element cannot be greater than itself: a <= a. While <= isn’t intuitive, it’s used for mathematical reasons
- Transitive – this requirement states that we can compare the elements transitively: a <= b and b <= c, then a <= c
- Antisymmetric – the elements cannot be greater and lesser than each other, but they can be equal: a <= b and b <= a, then a = b
- Strongly connected – means we can compare any two elements in the set
We can sequence the letters alphabetically and use them to compare individual words of different lengths. The same applies to the numerical order. We order simple digits and use them to compare numbers. The total order is applied to the alphabet, which we can use to order the words.
3.2. Alphabets and Words
The total order operates on alphabets and words. The alphabet is a set of elements that can be grouped into sequences called words. It might be an actual alphabet and words if we talk about letters.
However, if we talk about numbers, the alphabet would be the set of digits: {0,1,2,3,4,5,6,7,8,9}. The words wouldn’t be the words in terms of dictionary words, but numbers we can construct from the elements in the alphabet. For example: 84372, 23243, and even 42.
3.3. Alphabets vs. Words
Sometimes, the context can affect the idea of an alphabet. Let’s take playing cards as an example. Assume we have a simple game: two players draw a card in turn; the player with the higher card wins. We can ignore the suits for this game.
Technically, in this case, we’ll be using partial ordering, as non-identical words (cards) could be considered equal: the king of diamonds will have the same value as the king of clubs, while they’re non-identical. However, this doesn’t affect the point of this explanation.
In this case, we have the alphabet, which contains the nominal values of the cards from two to an ace. Based on this, we can order all the cards in the deck and identify whether a card is higher, lower, or equal to another. Thus, the alphabet and the words match almost exactly (given that we ignored the suits).
Let’s complicate the game and allow players to draw five cards. In this case, we’ll compare the cards based on the highest card among the five. If they’re equal, we will compare the next highest card, and so on.
Here, the words and the alphabet won’t match. The alphabet consists of thirteen cards; the words can be any combination of five cards from a deck. However, the alphabet would still match the nominal values of the cards.
Let’s now change the game entirely and compare these five cards based on the poker ranking rules. The comparison rules are much more complex, and we must change the alphabet. The alphabet of thirteen cards won’t give us an adequate way to identify a winner.
The size of an alphabet would jump to a high number as we need to represent every combination separately. It would still be less than the number of combinations but would grow significantly. A single entry of an alphabet might contain all the five cards.
Thus, our alphabet would grow from individual cards to combinations, taking into account the suits and combinations. What we used as words previously now became a part of the alphabet. Thus, the alphabet relies entirely on context and the comparison logic between the elements.
4. Cardinality
It’s clear how to compare the sequences of the same cardinality, for example, words with the same length (e.g., “car” and “cat”) or numbers of the same order of magnitude (e.g., 15 and 17).
However, it’s not clear what we should do if the sequences have different numbers of elements. In this case, the lexicographical order allows two different approaches. We can see this while sorting numbers and letters.
4.1. Blank Paddings
The first approach is to append blank elements to a shorter sequence. This empty element always has a lesser value than any element in the set. We can see this way of ordering the words in a dictionary.
The word “off” would appear before the word “offer.” We can think about this by comparing “off__” to “offer.” At the same time, we’ll see the word “algorithm” before the word “byte.”
This is a natural way of comparing the words. Some dictionaries order the words by their length, but they’re more specialized versions of the dictionaries and are usually aimed at crossword puzzles or Scrabble fans.
4.2. Shortlex
Another approach is called shortlex. This approach assumes that a shorter sequence is always less than a longer one.
We can see it in the numbers. When we first compare by the order of the magnitude, tens are lesser than thousands. That’s why two is less than ten, although two is greater than one.
We often see the issue with sorting the numbers as strings; for example, we can get an incorrect result when we don’t treat them as numerical values.
The ASCII table would have the same digit order even if the digits were treated as characters. However, different cardinality rules would result in an incorrect order. That’s why sorting the numbers of the same magnitude using alphabetical order produces the correct result.
5. Format
The fact that the words have a specific lexicographical order, or at least it’s reasonable to assume that they have it, doesn’t mean it’s always possible to apply it.
As we saw previously, we know general rules for sorting numbers in a lexicographical order. However, this is possible when we use the Arabic numeral system. Applying it to the Roman numeric representation won’t yield the expected result.
We can sort the Roman numerals:
I < V < X < L < C < D < M
However, according to this order, IX would appear before VI, which isn’t correct.
The same is true for dates. For example, we generally sort them chronologically if we need to order them. However, this does not always match the lexicographical order, as it depends on the order of elements in the sequence.
If we format the dates as YYYY-MM-DD, we can apply simple digit order to get the correct order. At the same time, if we do the lexicographic sort on the dates in the following format: DD-MM-YYYY, it won’t sort them chronologically.
6. Cartesian Product
We can apply the lexicographical order to their cartesian product if we have several independent sets. We can use the total or partial ordering of alphabets and order the cartesian product of alphabets.
Let’s take musical notes as an example. We can represent a note by pitch and octave. Both these alphabets can have total ordering. The octaves have simple numerical ordering:
0 < 1 < 2 < 3 < 4 < 5 < 6 < 7
The pitch, while expressed with letters, doesn’t follow the alphabetic ordering:
C < D < E < F < G < A < B
Thus, we can compare two notes with each other using these two alphabets and their order:
A3 < C4 < E4 < A4 < D5
In this example, we used related ideas of pitches and octaves. However, the lexicographical sorting of a cartesian product can be done using even unrelated alphabets. For example, books can be ordered by category and then by the author in a bookstore.
7. Conclusion
Except for its importance in set theory and mathematics, the lexicographical order has a more applicational benefit.
Programming languages and libraries use it as the default order for the most common data types, which fits our understanding of the natural order.