1. Introduction
In this tutorial, we’ll explore what a generalized suffix tree is and an example of how it can be used for finding the longest common substring. Finding the longest common substring is a special case of string similarity methods through finding common subsequences. The ‘sequence’ in this class of methods are substrings.
We’ll illustrate an application of the generalized suffix tree with finding the longest common substring.
We’ll then build a generalized suffix tree by first building a suffix trie and a Patricia trie (using the terminology of the field) and then annotating these trees to form a generalized suffix tree.
2. Suffix Tries
To answer the first question, no, we did not misspell the word ‘tree’. In the language of suffix trees, a trie is an intermediate to building a full generalized suffix tree that can be used for our tasks. A suffix trie is a tree where the edges, namely the lines connections the nodes, are labeled with the letters of our suffixes.
We build a suffix tree by following each suffix and creating an edge for each character, starting with a top node. If the new suffix to be put in the tree begins with a set of characters that are already in the tree, we follow those characters until we have a different one, creating a new branch.
This is best explained with an example. We’ll use the word . This word has 8 suffixes, plus an empty suffix (signified by $):
- $
- e
- se
- nse
- ense
- sense
- nsense
- onsense
- nonsense
We begin building the suffix trie with the start node and the blank suffix, labeled with $:
The suffix trie is then build-up, with the first letter of the suffixes connected to the start node. We see that the first three non-empty suffixes all start with a different letter, so we add three branches to the blank branch:
The next suffix, , begins with , which already has a node from the start node. We add this as a branch to the node by following the common part of the suffix, in this case just , and adding the extra branch where the letters differ:
We continue this process by adding branches where the suffixes differ and adding new branches where necessary. When we’re done, we get:
Here, we can see there are three suffixes starting with . We see that where there are common substring sections, they follow the same branch, such as for and suffixes.
We also see that if the size of the string is , namely , the suffix tree has exactly leaf nodes. In the case of , we have built a tree with nodes. The complexity of the algorithm depends on the number of nodes. So we can ask the next question:
Can we reduce the number of nodes in the trie?
The answer, explained in the next section, is to build what is called a ‘Patricia Trie’.
3. Patricia Trie
A Patricia trie is the suffix tree where all the ‘simple’ nodes having only one child (no branching) are grouped together. With our example, we get:
We see that the number of nodes is reduced, but the number of leaf nodes is the same, i.e., the number of suffixes. Building the Patricia trie is a step toward building the suffix tree (and the generalized suffix tree), which will be used to accomplish numerous tasks in substring recognition.
4. Suffix Tree
With the suffix tree, we’re one step closer to creating a data structure for facilitating our substring recognition tasks. A suffix tree for a string is a Patricia trie of where each leaf is labeled with the index where the corresponding suffix starts in . This label gives us a direct reference to the original suffix:
5. Generalized Suffix Tree
A suffix tree has only a description of a single string, . But, a generalized version of the data structure described above can be used to index multiple strings. In this case, the result of a search operation could be a reference to which string(s) contain a given input string.
A generalized suffix tree for is a suffix tree of , but the label on the leaf nodes has not only the position in the string but an index of which string it is referring to. The leafs are labeled with , meaning “th suffix of string .
We’ll illustrate this with a generalized suffix tree with two strings, and :
6. Longest Common Substring
We can use the generalized suffix tree to facilitate substring recognition. The main idea is that every substring in one string is a prefix of some suffix of that string. Put another way, when we put every suffix in the tree, we also put every beginning character of suffix at the top (connected to the start node) of the tree. These beginning characters of the suffix are also the beginning of the prefix. Two substrings need only match partially down the tree.
The algorithm to find the longest common substring has three steps:
- Build a generalized suffix tree for and .
- Annotate each internal node in the tree with whether that node has at least one leaf node from each of and
- Run a depth-first search over the tree to find the marked node with the highest string depth.
Using our example, we build a generalized suffix tree with two strings, , and . Then we annotate each node with whether that branch has only string , only string or both strings with :
Looking at the tree (depth-first search), we see the red node, representing is the longest common substring. The green nodes represent shorter string, and . They appear at the same level because of the condensed Patricia tree.
7. Conclusion
This article outlined how to build a generalized suffix tree to solve a substring recognition problem. What we outlined here was a simple way to generate generalized suffix trees, but there are many advanced tricks to speed up the algorithm, such as on–line construction of suffix trees from the University of Helsinki.
There are several sources of code to build the tree in several languages, such as a Python implementation or Java implementation.