1. Overview

In this tutorial, we’ll explain how to convert finite automata to regular expressions using the state elimination method.

Regular expressions are used in many applications, and most programming languages offer a way to implement them.

2. Problem Description

Let A be a finite automaton and \alpha_{pq} the transition label of the branch going from p to q, where they denote any two states of A. The conversion algorithm should return a regular expression defining the regular language that \boldsymbol{A} accepts.

For example, let’s say we have the following automaton:

Finite Automaton Example

The regular expression it’s equivalent to is b^*a(a+bb^*a)^*.

A use case for this is natural language processing (NLP). NLP engineers often design string-matching patterns, and for them, it’s much easier to read a regular expression than to look at a complex finite automaton.

3. The State Elimination Method

We first need to make \boldsymbol{A} uniform.** A uniform finite automaton has an initial state with no incoming transitions and a final state with no outgoing transitions.

Then, we repeatedly remove a randomly selected non-endpoint state and compute the regular expressions of the paths between the remaining pairs until we reduce \boldsymbol{A} to only the initial and final states.

There are other methods besides the state elimination method; these include Adren’s method, Brzozowski’s algebraic method and the transitive closure method. We’ll show this one because it’s simple and easy to understand and implement.

3.1. Pseudocode

Here’s the pseudocode:

algorithm StateEliminationMethod(A):
    // INPUT
    //    A = a uniform finite automaton
    // OUTPUT
    //    a regular expression defining the regular language that A accepts

    while A contains more than two states:
        Randomly choose a state k (not the initial or the final state)

        for each pair of states (p, q) connected via k:
            path_regex <- alpha[p,q] + alpha[p,k] alpha[k,k]* alpha[k,q]
            Simplify path_regex
            Set path_regex as the transition label alpha[p,q]
        
        Remove k from A

    return the regular expression defining the regular language that A accepts

As we see, the algorithm repeats two steps in each iteration.

The first one is to randomly choose a state k (not the initial or the final state) and remove it from A.

The second step is to compute the regular expressions for the paths between each pair (p, q) of the remaining states connected directly through k. Then, we change the transition labels \boldsymbol{\alpha_{pq}} to the computed regular expressions. The regular expressions of pairs of states that aren’t directly connected through k remain the same.

The final step is to remove k from A since all the paths going through it are now in the updated or newly added transition labels.

3.2. Computing Regular Expressions for Paths

Let k be the randomly chosen state.

We compute the new regular expressions for all possible pairs of states p and q in A (p, q \neq k) such that there’s a path p \rightarrow k \rightarrow q using the following equation:

    [\alpha'_{pq} = \alpha_{pq} + \alpha_{pk} \alpha_{kk}^* \alpha_{kq}]

where:

  • \alpha_{pq} is the transition label from state p to state q
  • \alpha_{pk} is the transition label from state p to state k
  • \alpha_{kk} is the transition label from state k to state k
  • \alpha_{kq} is the transition label from state k to state q

Afterward, we simplify the obtained regular expressions \alpha'_{pq} and set the corresponding transition labels to those expressions.

4. Example

Let’s work out an example from a paper on state elimination.

Our input finite-state machine is:

Finite Automata Example

First, we make it uniform:

Uniform Finite Automata

Now, we randomly choose a state. Let it be q_0. The next step is to compute the regular expressions for paths between the states connected directly via q_0.

The new regular expression for the transition from i to q_1 is:

    [\alpha'_{iq_1} = \alpha_{iq_1} + \alpha_{iq_0} \alpha_{q_0q_0}^* \alpha_{q_0q_1}]

Substituting the labels for their values, we get:

    [\alpha'_{iq_1} = \varepsilon b^*a]

Following this, we simplify the regular expression \varepsilon b^* a to b^*a. Then, we add a new transition between the states i and q_1:

Added Transition Label

Since there’s a path q_1 \rightarrow q_0 \rightarrow q_1, we need to update the transition label \alpha_{q_1q_1}. Following the same procedure, we get that the regular expression for the transition from q_1 to itself is a+bb^*a. Consequently, the transition label from state q_1 to state q_1 is replaced with the new regular expression:

Replaced Transition Label

The new regular expression for the transition from state q_1 to state f remains the same because the path between them doesn’t go through q_0.

Finally, we remove q_0:

Remove First State

Now, the only choice is q_1 since it’s the only non-initial and non-ending state.  Repeating the steps, we get:

Remove Second State

As a result, the automaton now contains only the initial and final states. Hence, we can no longer choose more states to remove. The label of the transition i \rightarrow f is the regular expression equivalent to the input automaton.

5. Complexity

In the worst-case scenario, all the states are directly connected in both directions (p \rightarrow q and q \rightarrow p), there’s a transition from each state to itself, and each state r connects all the pairs of the automaton’s states (p \rightarrow r \rightarrow q and p \rightarrow r \rightarrow p for any r and p, q \neq r).

Let n be the number of states (excluding the initial and the final states).

Since all the states are connected in both directions, there are \boldsymbol{(n-1)^2} possible pairs of states in the first iteration. Hence, we compute (n-1)^2 regular expressions.

In the second iteration, we compute \boldsymbol{(n-2)^2} expressions because there are n-1 states.

We keep removing states and computing the paths between states until there are no more states except the initial and the final ones. Hence, the last regular expression computation is for the transition from the initial to the final states.

Therefore, the worst-case time complexity of the state elimination method is \boldsymbol{O(n^3)} :

    [O((n-1)^2 + ... + 9 + 4 + 1)=O\left( \sum_{j=1}^{n-1}j^2 \right) = O\left( \frac{n(n+1)(2n+1)}{6} - n^2\right) = O(n^3)]

The assumption is that computing regular expressions for paths is an O(1) operation. Since the labels and expressions are strings, we can achieve that if we concatenate them in O(1) time. For example, we can treat strings as arrays that can end with a pointer to another array. That way, we can construct \alpha_{pq}{+}\alpha_{pk}\alpha_{kk}^*\alpha_{kq}  in five steps, so regex computation is O(1).

6. Conclusion

In this article, we showed how to convert a finite-state automaton to a regular expression. We explained the O(n^3) state elimination method. It works by eliminating states and changing the transition labels to the regular expressions of the three-state paths going over the eliminated states.


» 下一篇: 拉姆齐理论