1. Introduction

In this tutorial, we’ll talk about the tree edit distance (TED) and how to calculate it. TED is similar to string edit distance but works on trees.

2. What Is a Tree Edit Distance (TED)?

TED is the minimal number of select operations necessary to transform one tree (\boldsymbol{T_1}) to another (\boldsymbol{T_2}). Usually, those operations are: relabel, delete, and insert a node. However, we can also define edge-based operations such as the removal and creation of an edge between two nodes. In either case, they act only on T_1.

We can also associate different costs to those operations. Then, TED will be the cost of the least expensive sequence of actions transforming \boldsymbol{T_1} to \boldsymbol{T_2}, where the sequence’s cost is the sum of the individual actions’ costs.

2.1. Example

Let’s take a look at these two trees:

ted example

We can define the transformation costs as follows:

    [cost(relabel) = 2 \quad cost(delete) = cost(insert) = 3]

The least expensive sequence is:

    [\begin{aligned} \text{delete } & c, e, g \\ \text{relabel } & f \rightarrow g, d \rightarrow f, b \rightarrow e, a \rightarrow x \\ \end{aligned}]

Its cost is 17. So, TED(T_1, T_2) = 17.

2.2. Interpretation of TED

Since it depends on the transforming operations and their costs, TED can be interpreted only concerning those settings. Using different operations and costs can give different distances. Therefore, we should choose them in a way that makes sense in our application.

Anyhow, if TED fulfills the metric axioms, it induces a metric space of trees. That allows us to analyze them rigorously.

So, let’s now define a general TED and the way to calculate it.

3. General Recursive Calculation

Let \mathcal{A}(T) represent the set of operations we can perform on tree T , and let cost(a) be the cost of operation a \in \mathcal{A}(T). The costs can be negative, but we’ll focus only on the case where all are positive.

Our general recursive definition of TED is:

(1)   \begin{equation*}  TED(T_1, T_2) = \begin{cases} 0,& T_1 = T_2 \\ \min_{a \in \mathcal{A}(T)} \{cost(a) + TED(a(T_1), T_2)\}, & \text{otherwise} \end{cases} \end{equation*}

We can’t say more about its complexity other than it may be exponential, depending on \mathcal{A} and c.

3.1. Memoization

However, if we find an efficient way to hash trees, we may reduce the complexity. The first time we calculate TED(T, T_2), we hash T and insert the pair (hash(T), TED(T, T_2)) into a hash map. That way, if TED(T, T_2) appears again in the recursion, we don’t have to compute it again.

This particular technique is known as memoization. Some authors consider it to be a tool of dynamic programming.

4. A Node-Based TED

The recursion (1) covers a general case of TED. Here, we’ll focus on more specific operations and costs. In particular, on the node-based relabel, insert, and delete.

4.1. Operations

Let’s explain the operations in more detail. While deleting a node, we don’t leave its children hanging in the air. Instead, we add them as the new children of the node’s parent:

ted deletion

Relabeling a node means changing its label. For example:

ted relabeling

Unlike deletion and relabeling, insertions have more than one argument. We specify the node to insert and the node that will be its parent:

ted insertion

4.2. Assumptions and Remarks

Let’s first state an important assumption: the nodes in \boldsymbol{T_1} and \boldsymbol{T_2} have a left-to-right ordering. So, for any two nodes in a tree, we can say which one is to the left of the other.

Another assumption is that we can insert, delete, or relabel a node at most once during the transformation. For instance, we can’t delete a node after inserting it. Otherwise, the space of possible transformation sequences to consider would become infinite.

The costs of the operations are constant. So, delete(T, u) has the same cost regardless of the node we’re deleting (u) and the state of the tree we’re deleting it from (T). The same goes for insertion and relabeling. Let c_{del}, c_{ins}, c_{rel} be their costs.

Since inserting a node to T_1 is the same as deleting it from T_2, we need not apply insertion to T_1 in our implementation. Instead, we can delete the node from T_2 and consider it inserted into T_1. Analogously, when relabeling u \in T_1 to match v \in T_2, we don’t need to actually change the label. We can delete both nodes from the trees and consider u relabeled to v. To ease the notation, we’ll denote deletions with a minus sign: T-u instead of delete(T, u).

Finally, the cost function in our algorithm is restricted to be a distance metric. That makes the TED that it computes a distance metric too. A non-negative constant cost function we opted for fulfills this condition.

4.3. Computing TED Using Forests

In its intermediate steps, the algorithm operates on forests. It does so by considering the post-order numbering of the nodes. Let T[i] be the i-th node of T in the post-order traversal. Then, T[i..j] is the forest consisting of the nodes numbered i, i+1, \ldots, j-1, j in the post-order traversal of T. For instance:

post order sub forest

Although it may seem that dealing with a more general structure will complicate things, it isn’t so. Instead, it makes the problem easier. By dividing the trees into sub-forests, we can solve the problem using dynamic programming.

4.4. The Recurrence

Let F_1 =T_1[i..j] be the post-order sub-forest of T_1 and let r_1 denote its rightmost root. Also, let R_1 be the rightmost tree of F_1 (the one rooted at r_1). The same notation holds for T_2. Then, we calculate TED as follows:

(2)  \begin{equation*}  \begin{aligned} TED(\emptyset, \emptyset) &= 0\\ TED(F_1, \emptyset) &= TED(F_1 - r_1, \emptyset) + cost_{del} \\ TED(\emptyset, F_2) &= TED(\emptyset, F_2 - r_2) + cost_{ins} \\ TED(F_1, F_2) &= \min \begin{cases} TED(F_1-r_1, F_2) + cost_{del} \\ TED(F_1, F_2-r_2) + cost_{ins} \\ TED(F_1-R_1, F_2-R_2) + cost_{rel} + TED(R_1-r_1, R_2-r_2) \end{cases} \end{aligned} \end{equation*}

We can calculate the recurrence with the memoization technique. For instance, we can hash each forest (tree) as a pair (i, j), where i and j are the post-order bounds of the nodes in the original tree. So, the first time we calculate the TED of \boldsymbol{T_1[i_1..j_1]} and \boldsymbol{T_2[i_2, j_2]}, we use \boldsymbol{(i_1, j_1, i_2, j_2)} as the key and insert the corresponding value into a map. Later, we reuse the stored result to avoid repeated calculation.

The forest and tree structures should be supplied with efficient methods for evaluating r_k, R_k, F_k-R_k, and R_k-r_k (k=1,2).

4.5. Tree Operations

First, let’s observe that the root of the rightmost tree of the forest \boldsymbol{F_1=T_1[i_1..j_1]} is the node number \boldsymbol{j_1} , so r_1=j_1 for F_1=T_1[i_1..j_1]. From there, F_1 - r_1 amounts to T_1[i_1..(j_1-1)]. The difference of F_1 and R_1 is the forest consisting of all the nodes that are in F_1 but not R_1. The same goes for the other tree F_2=T_2[i_2..j_2]..

Finding R_k (k=1,3), however, requires a more careful consideration. The right boundary of R_k is the post-order index of its root, that is, r_k. We can compute the left boundary on the fly by running a post-order traversal routine, but this approach will slow down the algorithm. Instead, it’s more efficient to preprocess the tree and compute each sub-tree’s post-order left boundary so that it’s available in O(1) time during the computation of TED. We can do it in linear time using a post-order traversal algorithm.

So, we’ll assume that \boldsymbol{\ell(i)}, the left post-order boundary of the tree whose root is \boldsymbol{i}, is available for the nodes in both trees. Then, R_k is T_k[\ell(r_k).. r_k], and R_k - r_k is T_k\left[\ell(r_k)..r_k-1\right]. If F_k \neq R_k (the forest doesn’t contain only one tree), then F_k - R_k is (i, \ell(r)-1). Otherwise, F_k-R_k = \emptyset \equiv T_k[0, 0].

4.6. The Memoization Algorithm

Here’s how we can calculate TED(T_1, T_2) with a memoization algorithm:

algorithm TEDMemoized(T1, T2, n, m):
    // INPUT
    //    T1 = the tree to transform
    //    T2 = the tree to transform T1 into
    //    n = the number of nodes in T1
    //    m = the number of nodes in T2
    //    T1 and T2 have been preprocessed to determine the left boundaries ℓ(·) for all the nodes
    //    results = the storage for the results, initially empty
    // OUTPUT
    //    The tree edit distance between T1 and T2

    // Initially called as TED(T1[1 : n], T2[1 : m])

    // Apply the memoization technique
    if results[T1, T2] is known:
        return results[T1, T2]

    // Cover the base cases
    if T1 = empty set and T2 = empty set:
        return 0
    if T1 != empty set and T2 = empty set:
        r1 <- find the rightmost root of T1
        return cost_del + TED(T1 - r1, empty set)
    if T1 = empty set and T2 != empty set:
        r2 <- find the rightmost root of T2
        return cost_ins + TED(empty set, T2 - r2)
    
    // Cover the recursive case (both T1, T2 != empty set)
    Determine r1, R1, r2, R2
    
    ted_del <- TED(T1 - r1, T2) + cost_del
    ted_ins <- TED(T1, T2 - r2) + cost_ins
    ted_rel <- TED(T1 - R1, T2 - R2) + cost_rel + TED(R1 - r1, R2 - r2)
    
    // Store and return the result
    results[T1, T2] <- min{ted_del, ted_ins, ted_rel}
    return results[T1, T2]

This algorithm is a memoized version of Zhang and Shasha‘s iterative TED algorithm. So, its complexity is the same.

4.7. Complexity

We’ll derive it by considering the relevant (sub)forests of T_1 and T_2.

A forest is relevant if it appears in the recursive calculation of \bodsymbol{TED(T_1, T_2)} . Zhang and Shasha define a keyroot as a tree’s root or a node with a left sibling. We can prove that the post-order numbering of a relevant forest is a prefix of the post-order traversal of a keyroot’s tree.

Let the keyroot depth d_k of T_k (k=1,2) be the highest number of keyroot nodes on the path from the root to any other node in the tree. Then, Zhang and Shasha’s algorithm’s complexity (and consequently of ours too) is O(d_1d_2 nm).

We can improve the result even further. Let n_{\mathrm{height}} and n_{\mathrm{leaves}} denote the height and the number of leaves in T_1, and let m_{\mathrm{height}} and m_{\mathrm{leaves}} do the same for T_2. Zhang and Shasha prove that d_1 \leq \min\{n_{\mathrm{height}, n_{\mathrm{leaves}}}\} and that d_2 \leq \min\{m_{\mathrm{height}, m_{\mathrm{leaves}}}\}. So, the overall complexity is:

    [O\left( \min\{n_{\mathrm{height}}, n_{\mathrm{leaves}}\} \min\{m_{\mathrm{height}}, m_{\mathrm{leaves}}\} n m   \right)]

If the trees are balanced, then n_{\mathrm{height}} \in O(\log n) and m_{\mathrm{\height}} \in O(\log m), so the overall complexity becomes:

    [O \left( (n \log n) (m \log m) \right)]

or O\left(n^2 \log^2 n \right) if m \in \Theta(n). However, if the trees are degenerate and represent linked lists, the complexity is O(n^2 m^2). Therefore, the worst-case complexity stays the same as in the version without the keyroots. In practice, we expect the improved algorithm to perform faster.

4.8. Tracing the Sequence

We can trace the minimal sequence of operations that transform T_1 into T_2 the same way we trace the paths in the search algorithms.

When we determine the minimum of ted_{ins}, ted_{del}, and ted_{rel}, we remember which operation leads to that results. When we compute the complete TED, we can follow the operations in the reversed order.

5. Conclusion

In this article, we talked about tree edit distance. We showed how to define and how to calculate it in polynomial time.


« 上一篇: 飞蛾火焰优化算法
» 下一篇: 图的邻接和关联