如何将有限自动机转换为正则表达式？

1. Overview

In this tutorial, we’ll explain how to convert finite automata to regular expressions using the state elimination method.

Regular expressions are used in many applications, and most programming languages offer a way to implement them.

2. Problem Description

Let be a finite automaton and $\alpha_{pq}$ the transition label of the branch going from to , where they denote any two states of . The conversion algorithm should return a regular expression defining the regular language that $\boldsymbol{A}$ accepts.

For example, let’s say we have the following automaton:

The regular expression it’s equivalent to is b^*a(a+bb^*a)^* .

A use case for this is natural language processing (NLP). NLP engineers often design string-matching patterns, and for them, it’s much easier to read a regular expression than to look at a complex finite automaton.

3. The State Elimination Method

We first need to make $\boldsymbol{A}$ uniform.** A uniform finite automaton has an initial state with no incoming transitions and a final state with no outgoing transitions.

Then, we repeatedly remove a randomly selected non-endpoint state and compute the regular expressions of the paths between the remaining pairs until we reduce $\boldsymbol{A}$ to only the initial and final states.

There are other methods besides the state elimination method; these include Adren’s method, Brzozowski’s algebraic method and the transitive closure method. We’ll show this one because it’s simple and easy to understand and implement.

3.1. Pseudocode

Here’s the pseudocode:

algorithm StateEliminationMethod(A):
    // INPUT
    //    A = a uniform finite automaton
    // OUTPUT
    //    a regular expression defining the regular language that A accepts

    while A contains more than two states:
        Randomly choose a state k (not the initial or the final state)

        for each pair of states (p, q) connected via k:
            path_regex <- alpha[p,q] + alpha[p,k] alpha[k,k]* alpha[k,q]
            Simplify path_regex
            Set path_regex as the transition label alpha[p,q]
        
        Remove k from A

    return the regular expression defining the regular language that A accepts

As we see, the algorithm repeats two steps in each iteration.

The first one is to randomly choose a state (not the initial or the final state) and remove it from .

The second step is to compute the regular expressions for the paths between each pair (p, q) of the remaining states connected directly through . Then, we change the transition labels $\boldsymbol{\alpha_{pq}}$ to the computed regular expressions. The regular expressions of pairs of states that aren’t directly connected through remain the same.

The final step is to remove from since all the paths going through it are now in the updated or newly added transition labels.

3.2. Computing Regular Expressions for Paths

Let be the randomly chosen state.

We compute the new regular expressions for all possible pairs of states and in ( $p, q \neq k$ ) such that there’s a path $p \rightarrow k \rightarrow q$ using the following equation:

$[\alpha'_{pq} = \alpha_{pq} + \alpha_{pk} \alpha_{kk}^* \alpha_{kq}]$

where:

$\alpha_{pq}$ is the transition label from state to state
$\alpha_{pk}$ is the transition label from state to state
$\alpha_{kk}$ is the transition label from state to state
$\alpha_{kq}$ is the transition label from state to state

Afterward, we simplify the obtained regular expressions $\alpha'_{pq}$ and set the corresponding transition labels to those expressions.

4. Example

Let’s work out an example from a paper on state elimination.

Our input finite-state machine is:

First, we make it uniform:

Now, we randomly choose a state. Let it be q_0 . The next step is to compute the regular expressions for paths between the states connected directly via q_0 .

The new regular expression for the transition from to q_1 is:

$[\alpha'_{iq_1} = \alpha_{iq_1} + \alpha_{iq_0} \alpha_{q_0q_0}^* \alpha_{q_0q_1}]$

Substituting the labels for their values, we get:

$[\alpha'_{iq_1} = \varepsilon b^*a]$

Following this, we simplify the regular expression $\varepsilon b^* a$ to b^*a . Then, we add a new transition between the states and q_1 :

Added Transition Label

Since there’s a path $q_1 \rightarrow q_0 \rightarrow q_1$ , we need to update the transition label $\alpha_{q_1q_1}$ . Following the same procedure, we get that the regular expression for the transition from q_1 to itself is a+bb^*a . Consequently, the transition label from state q_1 to state q_1 is replaced with the new regular expression:

The new regular expression for the transition from state q_1 to state remains the same because the path between them doesn’t go through q_0 .

Finally, we remove q_0 :

Remove First State

Now, the only choice is q_1 since it’s the only non-initial and non-ending state. Repeating the steps, we get:

Remove Second State

As a result, the automaton now contains only the initial and final states. Hence, we can no longer choose more states to remove. The label of the transition $i \rightarrow f$ is the regular expression equivalent to the input automaton.

5. Complexity

In the worst-case scenario, all the states are directly connected in both directions ( $p \rightarrow q$ and $q \rightarrow p$ ), there’s a transition from each state to itself, and each state connects all the pairs of the automaton’s states ( $p \rightarrow r \rightarrow q$ and $p \rightarrow r \rightarrow p$ for any and $p, q \neq r$ ).

Let be the number of states (excluding the initial and the final states).

Since all the states are connected in both directions, there are $\boldsymbol{(n-1)^2}$ possible pairs of states in the first iteration. Hence, we compute (n-1)^2 regular expressions.

In the second iteration, we compute $\boldsymbol{(n-2)^2}$ expressions because there are n-1 states.

We keep removing states and computing the paths between states until there are no more states except the initial and the final ones. Hence, the last regular expression computation is for the transition from the initial to the final states.

Therefore, the worst-case time complexity of the state elimination method is $\boldsymbol{O(n^3)}$ :

$[O((n-1)^2 + ... + 9 + 4 + 1)=O\left( \sum_{j=1}^{n-1}j^2 \right) = O\left( \frac{n(n+1)(2n+1)}{6} - n^2\right) = O(n^3)]$

The assumption is that computing regular expressions for paths is an operation. Since the labels and expressions are strings, we can achieve that if we concatenate them in O(1) time. For example, we can treat strings as arrays that can end with a pointer to another array. That way, we can construct $\alpha_{pq}{+}\alpha_{pk}\alpha_{kk}^*\alpha_{kq}$ in five steps, so regex computation is O(1) .

6. Conclusion

In this article, we showed how to convert a finite-state automaton to a regular expression. We explained the O(n^3) state elimination method. It works by eliminating states and changing the transition labels to the regular expressions of the three-state paths going over the eliminated states.

Persistence

REST

Security