1. Introduction

In this tutorial, we’ll explore what a generalized suffix tree is and an example of how it can be used for finding the longest common substring. Finding the longest common substring is a special case of string similarity methods through finding common subsequences.  The ‘sequence’ in this class of methods are substrings.

We’ll illustrate an application of the generalized suffix tree with finding the longest common substring.

We’ll then build a generalized suffix tree by first building a suffix trie and a Patricia trie (using the terminology of the field) and then annotating these trees to form a generalized suffix tree.

2. Suffix Tries

To answer the first question, no, we did not misspell the word ‘tree’. In the language of suffix trees, a trie is an intermediate to building a full generalized suffix tree that can be used for our tasks. A suffix trie is a tree where the edges, namely the lines connections the nodes, are labeled with the letters of our suffixes.

We build a suffix tree by following each suffix and creating an edge for each character, starting with a top node. If the new suffix to be put in the tree begins with a set of characters that are already in the tree, we follow those characters until we have a different one, creating a new branch.

This is best explained with an example. We’ll use the word nonsense. This word has 8 suffixes, plus an empty suffix (signified by $):

  1. $
  2. e
  3. se
  4. nse
  5. ense
  6. sense
  7. nsense
  8. onsense
  9. nonsense

We begin building the suffix trie with the start node and the blank suffix, labeled with $:

SuffixTreeTriesFig1

The suffix trie is then build-up, with the first letter of the suffixes connected to the start node. We see that the first three non-empty suffixes all start with a different letter, so we add three branches to the blank branch:

SuffixTreeTriesFig21

The next suffix, ense, begins with e, which already has a node from the start node. We add this as a branch to the e node by following the common part of the suffix, in this case just e, and adding the extra branch where the letters differ:

SuffixTreeTriesFig4

We continue this process by adding branches where the suffixes differ and adding new branches where necessary. When we’re done, we get:

SuffixTreeTriesFig5

Here, we can see there are three suffixes starting with n. We see that where there are common substring sections, they follow the same branch, such as nse for nse and nsense suffixes.

We also see that if the size of the string T is m, namely |T| = m, the suffix tree has exactly m + 1 leaf nodes. In the case of nonsense, we have built a tree with 9 nodes. The complexity of the algorithm depends on the number of nodes. So we can ask the next question:

Can we reduce the number of nodes in the trie?

The answer, explained in the next section, is to build what is called a ‘Patricia Trie’.

3. Patricia Trie

A Patricia trie is the suffix tree where all the ‘simple’ nodes having only one child (no branching) are grouped together. With our example, we get:

SuffixTreePatriciaTriesFig6

We see that the number of nodes is reduced, but the number of leaf nodes is the same, i.e., the number of suffixes. Building the Patricia trie is a step toward building the suffix tree (and the generalized suffix tree), which will be used to accomplish numerous tasks in substring recognition.

4. Suffix Tree

With the suffix tree, we’re one step closer to creating a data structure for facilitating our substring recognition tasks. A suffix tree for a string T is a Patricia trie of T where each leaf is labeled with the index where the corresponding suffix starts in T . This label gives us a direct reference to the original suffix:

SuffixTreeSingleWordFig7

5. Generalized Suffix Tree

A suffix tree has only a description of a single string, T. But, a generalized version of the data structure described above can be used to index multiple strings. In this case, the result of a search operation could be a reference to which string(s) contain a given input string.

A generalized suffix tree for T_1, ..., T_kis a suffix tree of T_1, ..., T_k, but the label on the leaf nodes has not only the position in the string but an index of which string it is referring to. The leafs are labeled with i:j, meaning “jth suffix of string T_{i}.

We’ll illustrate this with a generalized suffix tree with two strings, T_1 = nonsense and T_2=offense:

SuffixTreeTwoWordsFig8

6. Longest Common Substring

We can use the generalized suffix tree to facilitate substring recognition. The main idea is that every substring in one string is a prefix of some suffix of that string. Put another way, when we put every suffix in the tree, we also put every beginning character of suffix at the top (connected to the start node) of the tree. These beginning characters of the suffix are also the beginning of the prefix. Two substrings need only match partially down the tree.

The algorithm to find the longest common substring has three steps:

  1. Build a generalized suffix tree for T_1 and T_2.
  2. Annotate each internal node in the tree with whether that node has at least one leaf node from each of T_1 and T_2
  3. Run a depth-first search over the tree to find the marked node with the highest string depth.

Using our example, we build a generalized suffix tree with two strings, nonsense, and offense. Then we annotate each node with whether that branch has only string 1, only string 2 or both strings with 1,2:

CSuffixTreeTwoWordsAnnotatedFig91

Looking at the tree (depth-first search), we see the red node, representing ense is the longest common substring. The green nodes represent shorter string, nse and se. They appear at the same level because of the condensed Patricia tree.

7. Conclusion

This article outlined how to build a generalized suffix tree to solve a substring recognition problem. What we outlined here was a simple way to generate generalized suffix trees, but there are many advanced tricks to speed up the algorithm, such as on–line construction of suffix trees from the University of Helsinki.

There are several sources of code to build the tree in several languages, such as a Python implementation or Java implementation.