1. Overview
The Burrows-Wheeler Transform (BWT) is an algorithm that rearranges the characters of a string to group similar characters. The transformed string is more suitable for compression. The transformation is reversible, i.e., the original string can be reconstructed back.
The well-known bzip2 utility uses the BWT for file compression.
In this tutorial, we’ll discuss how the BWT works. We’ll first describe the transform and the inverse transform. Then, we’ll see how it works using an example.
2. How Does the Transform Work?
In this section, we’ll see the transformation and its inverse.
2.1. Transform
To transform the input string, we first need to add a character. The additional character shouldn’t exist within the string; it can be added anywhere in the string. We prefer to add it to the end of the string to denote the end.
Then, we shift the string circularly to obtain all possible circular shifts. If the string has N characters including the added character, then the number of all circular shifts is also N.
Finally, we sort all the circular shifts of the input string in alphabetical order and create a table. The last column of the table is the BWT of the input string. We’ll refer to this sorted table as the BWT Table.
Here’s the pseudocode of the transformation:
algorithm BWT(str):
// INPUT
// str = input string
// OUTPUT
// bwt = the BWT of str
str_with_eof <- append eof character to str
create a table whose rows are all possible circular shifts of str_with_eof
sort rows of the table alphabetically
bwt <- last column of the alphabetically sorted table
return bwt
We get the BWT table’s last column as the BWT of the input string since the same characters tend to group in this column. For example, when we transform a long text in English containing many instances of the “the” article, the sorted circular shifts beginning with “he“ will group together. Consequently, the circular shifts ending with the “t” character will also group together. It’s easier to compress strings with repeating characters grouped together.
The transformed string using the BWT may not be the maximum compressible cyclic shift. However, we can reconstruct the original string transformed by the BWT, but this isn’t possible using the other cyclic shifts.
2.2. Inverse Transform
We can reconstruct the original input string from the transformed string using a sequence of additions and sorting. We can reconstruct the first column of the BWT Table, which consists of the alphabetically sorted circular shifts of the input string, simply by sorting the last column. The last column is the BWT of the input string. Therefore, we obtain the first and last columns of the BWT Table.
Combining the first and last columns and sorting them alphabetically reconstructs the first and second columns in the BWT table. Then, appending the BWT of the input string once more as a column and sorting the strings consisting of three characters reconstructs the first three columns in the BWT table. Continuing in the same way, we finally reconstruct the whole BWT table. Naturally, the string in this table ending with the special character we’ve added before is the original input string.
Here’s the pseudocode of the inverse transformation:
algorithm inverseBWT(bwt):
// INPUT
// bwt = a string transformed by the BWT() algorithm
// OUTPUT
// str = the original string
create an empty table
n <- length of bwt
for i <- 1 to n:
append bwt as a column to the table from the left side
sort the rows of the table alphabetically
str <- the row that ends with the eof character
return str without the eof character
3. An Example
In this section, we’ll first transform the string PEPPER using the BWT. Then, we’ll transform the result back to obtain the original string.
3.1. Transform
We add the dollar sign, $, to the end of the string PEPPER to indicate the end of the string. The added character can be any character that doesn’t exist in the string.
All circular shifts of PEPPER$ and the sorted counterparts, namely the BWT Table, are in the first and second columns of the table below:
Circular Shifts
Sorted Circular Shits
(The BWT Table)
BWT (Last Column)
PEPPER$
$PEPPER
R
$PEPPER
EPPER$P
P
R$PEPPE
ER$PEPP
P
ER$PEPP
PEPPER$
$
PER$PEP
PER$PEP
P
PPER$PE
PPER$PE
E
EPPER$P
R$PEPPE
E
Therefore, the BWT of the input string PEPPER$ is RPP$PEE, the last column of the BWT Table. We assume that $ is alphabetically smaller than the other characters while sorting.
There are two groups consisting of the same characters in the transformed string, namely PP and EE. However, the only group in the original string is PP. Therefore, even for this short string, the transformed string is more suitable for compression.
Of course, an ordering such as $PPPEER is more suitable for compression than RPP$PEE. But it isn’t possible to reconstruct PEPPER$ using $PPPEER whereas we can reconstruct it using RPP$PEE as we’ll see in the next subsection.
3.2. Inverse Transform
The first three iterations of addition and sorting to reconstruct the original string from the transformed string, RPP$PEE, are listed in the table below:
Addition1
Sorting1
Addition2
Sorting2
Addition3
Sorting3
R
$
R$
$P
R$P
$PE
P
E
PE
EP
PEP
EPP
P
E
PE
ER
PER
ER$
$
P
$P
PE
$PE
PEP
P
P
PP
PE
PPE
PER
E
P
EP
PP
EPP
PPE
E
R
ER
R$
ER$
R$P
We start by adding the transformed string, RPP$PEE, to an empty table. This corresponds to the Addition1 column in the table. Then, we sort this column and obtain the Sorting1 column. This column is the same as the first column in the BWT Table.
We continue by combining the two columns, Addition1 and Sorting1, in the Addition2 column. Then, we sort the Addition2 column and obtain the Sorting2 column. This column is the same as the first two columns in the BWT Table.
We start the third iteration by adding the Addition1 column to the Sorting2 column. Therefore, we obtain the Addition3 column. Then, we sort it to get the Sorting3 column. This column is the same as the first three columns in the BWT Table.
Continuing in this way, we finally construct the Sorting7 column, which is the same as the BWT table:
Addition4
Sorting4
Addition5
Sorting5
Addition6
Sorting6
Addition7
Sorting7
R$PE
$PEP
R$PEP
$PEPP
R$PEPP
$PEPPE
R$PEPPE
$PEPPER
PEPP
EPPE
PEPPE
EPPER
PEPPER
EPPER$
PEPPER$
EPPER$P
PER$
ER$P
PER$P
ER$PE
PER$PE
ER$PEP
PER$PEP
ER$PEPP
$PEP
PEPP
$PEPP
PEPPE
$PEPPE
PEPPER
$PEPPER
PEPPER$
PPER
PER$
PPER$
PER$P
PPER$P
PER$PE
PPER$PE
PER$PEP
EPPE
PPER
EPPER
PPER$
EPPER$
PPER$P
EPPER$P
PPER$PE
ER$P
R$PE
ER$PE
R$PEP
ER$PEP
R$PEPP
ER$PEPP
R$PEPPE
The original string is the string ending with $ in this table. This corresponds to the fourth string, PEPPER$. Hence, the original string is PEPPER.
4. Conclusion
In this article, we discussed how the BWT works.
Firstly, we learned that we need to add a character that doesn’t exist within the string we want to transform. Then, we construct all the circular shifts of the string and create a table using the rotated strings. Then, we sort the rows of the table alphabetically. The BWT of the input string is the last column of the sorted table.
Then, we learned how to reconstruct the original string from the transformed string. We saw that the original string can be reconstructed by N successive additions and sorts, where N is the BWT of the string.
Finally, we applied what we learned using an example.