Burrows-Wheeler变换如何工作？

1. Overview

The Burrows-Wheeler Transform (BWT) is an algorithm that rearranges the characters of a string to group similar characters. The transformed string is more suitable for compression. The transformation is reversible, i.e., the original string can be reconstructed back.

The well-known bzip2 utility uses the BWT for file compression.

In this tutorial, we’ll discuss how the BWT works. We’ll first describe the transform and the inverse transform. Then, we’ll see how it works using an example.

2. How Does the Transform Work?

In this section, we’ll see the transformation and its inverse.

2.1. Transform

To transform the input string, we first need to add a character. The additional character shouldn’t exist within the string; it can be added anywhere in the string. We prefer to add it to the end of the string to denote the end.

Then, we shift the string circularly to obtain all possible circular shifts. If the string has N characters including the added character, then the number of all circular shifts is also N.

Finally, we sort all the circular shifts of the input string in alphabetical order and create a table. The last column of the table is the BWT of the input string. We’ll refer to this sorted table as the BWT Table.

Here’s the pseudocode of the transformation:

algorithm BWT(str): 
    // INPUT 
    //    str = input string 
    // OUTPUT 
    //    bwt = the BWT of str
    str_with_eof <- append eof character to str
    create a table whose rows are all possible circular shifts of str_with_eof
    sort rows of the table alphabetically
    bwt <- last column of the alphabetically sorted table
    return bwt

We get the BWT table’s last column as the BWT of the input string since the same characters tend to group in this column. For example, when we transform a long text in English containing many instances of the “the” article, the sorted circular shifts beginning with “he“ will group together. Consequently, the circular shifts ending with the “t” character will also group together. It’s easier to compress strings with repeating characters grouped together.

The transformed string using the BWT may not be the maximum compressible cyclic shift. However, we can reconstruct the original string transformed by the BWT, but this isn’t possible using the other cyclic shifts.

2.2. Inverse Transform

We can reconstruct the original input string from the transformed string using a sequence of additions and sorting. We can reconstruct the first column of the BWT Table, which consists of the alphabetically sorted circular shifts of the input string, simply by sorting the last column. The last column is the BWT of the input string. Therefore, we obtain the first and last columns of the BWT Table.

Combining the first and last columns and sorting them alphabetically reconstructs the first and second columns in the BWT table. Then, appending the BWT of the input string once more as a column and sorting the strings consisting of three characters reconstructs the first three columns in the BWT table. Continuing in the same way, we finally reconstruct the whole BWT table. Naturally, the string in this table ending with the special character we’ve added before is the original input string.

Here’s the pseudocode of the inverse transformation:

algorithm inverseBWT(bwt): 
    // INPUT 
    //    bwt = a string transformed by the BWT() algorithm 
    // OUTPUT 
    //    str = the original string
    create an empty table
    n <- length of bwt
    for i <- 1 to n:
        append bwt as a column to the table from the left side
        sort the rows of the table alphabetically
    str <- the row that ends with the eof character
    return str without the eof character

3. An Example

In this section, we’ll first transform the string PEPPER using the BWT. Then, we’ll transform the result back to obtain the original string.

3.1. Transform

We add the dollar sign, $, to the end of the string PEPPER to indicate the end of the string. The added character can be any character that doesn’t exist in the string.

All circular shifts of PEPPER$ and the sorted counterparts, namely the BWT Table, are in the first and second columns of the table below:

Circular Shifts

Sorted Circular Shits

(The BWT Table)

BWT (Last Column)

PEPPER$

$PEPPER

EPPER$P

R$PEPPE

ER$PEPP

PEPPER$

PER$PEP

PPER$PE

EPPER$P

R$PEPPE

Therefore, the BWT of the input string PEPPER$ is RPP$PEE, the last column of the BWT Table. We assume that $ is alphabetically smaller than the other characters while sorting.

There are two groups consisting of the same characters in the transformed string, namely PP and EE. However, the only group in the original string is PP. Therefore, even for this short string, the transformed string is more suitable for compression.

Of course, an ordering such as $PPPEER is more suitable for compression than RPP$PEE. But it isn’t possible to reconstruct PEPPER$ using $PPPEER whereas we can reconstruct it using RPP$PEE as we’ll see in the next subsection.

3.2. Inverse Transform

The first three iterations of addition and sorting to reconstruct the original string from the transformed string, RPP$PEE, are listed in the table below:

Addition1

Sorting1

Addition2

Sorting2

Addition3

Sorting3

R$P

$PE

PEP

EPP

PER

ER$

$PE

PEP

PPE

PER

EPP

PPE

ER$

R$P

We start by adding the transformed string, RPP$PEE, to an empty table. This corresponds to the Addition1 column in the table. Then, we sort this column and obtain the Sorting1 column. This column is the same as the first column in the BWT Table.

We continue by combining the two columns, Addition1 and Sorting1, in the Addition2 column. Then, we sort the Addition2 column and obtain the Sorting2 column. This column is the same as the first two columns in the BWT Table.

We start the third iteration by adding the Addition1 column to the Sorting2 column. Therefore, we obtain the Addition3 column. Then, we sort it to get the Sorting3 column. This column is the same as the first three columns in the BWT Table.

Continuing in this way, we finally construct the Sorting7 column, which is the same as the BWT table:

Addition4

Sorting4

Addition5

Sorting5

Addition6

Sorting6

Addition7

Sorting7

R$PE

$PEP

R$PEP

$PEPP

R$PEPP

$PEPPE

R$PEPPE

$PEPPER

PEPP

EPPE

PEPPE

EPPER

PEPPER

EPPER$

PEPPER$

EPPER$P

PER$

ER$P

PER$P

ER$PE

PER$PE

ER$PEP

PER$PEP

ER$PEPP

$PEP

PEPP

$PEPP

PEPPE

$PEPPE

PEPPER

$PEPPER

PEPPER$

PPER

PER$

PPER$

PER$P

PPER$P

PER$PE

PPER$PE

PER$PEP

EPPE

PPER

EPPER

PPER$

EPPER$

PPER$P

EPPER$P

PPER$PE

ER$P

R$PE

ER$PE

R$PEP

ER$PEP

R$PEPP

ER$PEPP

R$PEPPE

The original string is the string ending with $ in this table. This corresponds to the fourth string, PEPPER$. Hence, the original string is PEPPER.

4. Conclusion

In this article, we discussed how the BWT works.

Firstly, we learned that we need to add a character that doesn’t exist within the string we want to transform. Then, we construct all the circular shifts of the string and create a table using the rotated strings. Then, we sort the rows of the table alphabetically. The BWT of the input string is the last column of the sorted table.

Then, we learned how to reconstruct the original string from the transformed string. We saw that the original string can be reconstructed by N successive additions and sorts, where N is the BWT of the string.

Finally, we applied what we learned using an example.

Persistence

REST

Security

1. Overview

2. How Does the Transform Work?

2.1. Transform

2.2. Inverse Transform

3. An Example

3.1. Transform

3.2. Inverse Transform

4. Conclusion