1. Overview

In this tutorial, we’ll explain how to determine whether a given English text expresses a certain word using the regular expression method.

For example, think of the scenario where we’ve several PDF text documents from our mentor, and we want to find all documents that don’t include the word “startup”.

We’d be able to write a small script that runs through all the documents and determines whether the document contains the word or not.

2. Problem Description

Let T be an English text and w be an English word. The regular expression algorithm should return true if w isn’t expressed in \boldsymbol{T} and false otherwise.

For example, let’s say we have the following English text (from the Wikipedia page on startups):

A startup or start-up is a company or project undertaken by an entrepreneur to seek, develop, and validate a scalable business model. Startups are new businesses that intend to grow large beyond the solo founder.

And we want to know if the text doesn’t include the word “founder” or not.

The regular expression algorithm should return false.

3. Regular Expression Method

A regular expression is a string of literals and metacharacters. A metacharacter is a character that guides the regular expression matching method. For example, the caret sign tells the matching method to place the searching pointer at the start of the given text.

We take as input an English text T and a word w. Then, we apply the regular expression \boldsymbol{\wedge((?!w).)^* \textdollar} to the text. The corresponding matching method proceeds as follows:

algorithm MatchTextWithoutWordMethod(T, w):
    // INPUT
    //    T = an English text
    //    w = a word
    // OUTPUT
    //    A Boolean value.

    // Initialize the pointer and the length of the word
    p <- 0
    l <- the length of the word w

    while p < the length of T:
        i <- 0
        a <- 0

        while i < l:
            if p + i < l:
                if w[i] = T[p + i]:
                    a <- a + 1
                i <- i + 1
            else:
                break

        if a = l:
            return false

        p <- p + 1

    return true

Since the regular expression begins with the caret sign, the first step is to move the search pointer to the position of the start of the text.

The next metacharacter is a capturing parenthesis, and the second step is to match a sequence of characters described by the capturing parentheses sub-pattern (?!w)..

The sub-pattern (?!w). is composed of a lookahead followed by the single character “.“. The lookahead \boldsymbol{(?!w)} tells us to match w with the next |\boldsymbol{w}| characters. We denote the matching sequence of characters with c. If it fails to match, the single character “.” (following the negative lookahead group) tells us to add the first character of c to the matching result and to increase the search pointer by one. Otherwise, we terminate the method and return false.

The metacharacter following the sub-pattern (?!w). is the greedy quantifier ^* , which tells us to keep repeating the previous steps until we reach the end of the text (hence, the dollar sign at the end of the regular expression).

4. Flowchart

The regular expression matching method flowchart:

Method Flowchart

5. Positive Example

Let’s work out an example. Our search word is “founder”, and our English text is as follows:

Franchise English Text

First, we move the search pointer to the start of the text:

Franchise Start Text

Now, we check whether the next seven characters match the word “founder”.

Since the phrase “A franc” doesn’t match the word “founder”, we add the character “A” to the matching result. Then, we move the search pointer to the following character:

Franchise Second Character

Repeating the above steps eight more times, we get the following:

Franchise Tenth Character

Accordingly, the matching result is now “A franchi”.

We keep repeating the steps until we reach the end of the text (hence, the dollar sign at the end of the given regular expression).

As a result, we successfully matched the given regular expression. The matching result is the full given text. Therefore, the algorithm returns true.

6. Negative Example

Let’s work out a negative example. Our search word is still “founder” and our English text is as follows:

Startup English Text

First, we move the search pointer to the start of the text:

Startup Start Text

Now, we check whether the next seven characters match the word “founder”.

Since the phrase “A start” doesn’t match the word “founder”, we add the character “A” to the matching result. Then, we move the search pointer to the following character:

Startup Second Character

Once again, we check whether the next seven characters (starting from the pointer) match the word “founder”.

Since the phrase ” startu” doesn’t match the word “founder”, we add the character ” ” to the matching result. The matching result is now “A “. Consequently, we move the search pointer:

Startup Third Character

Repeating the above steps eleven more times, we get:

Startup Forteenth Character

Consequently, the matching result is now “A startup or “.

Following this, we keep repeating the steps until we reach the sequence of characters “founder” that match the word “founder”:

Startup Word Founder

Finally, the negative lookahead doesn’t match. Therefore we fail to match the regular expression, and the method returns false.

7. Complexity

Let w_{1} w_{2} ... w_{n} be the given word, such that w_{1} is the first character, w_{2} is the second character and  w_{n} is the n^{th} character in the word. Furthermore, let’s assume that all the characters w_{1}, w_{2}, ..., w_{n} are syntactically the same (w_{1} = w_{2}= ... = w_{n}).

In the worst-case scenario, the text is composed of one word, the sequence of characters \boldsymbol{a_{1} a_{2} \ldots a_{n-1} a_{new}}. Let’s assume all the characters are syntactically equivalent except the last character in the text (a_{1} = a_{2}= ... = a_{n} and a_{1} \neq a_{new}). Moreover, all the identical characters are syntactically equivalent to the characters of the given word (a_{1} = w_{1}).

In the first iteration, the negative lookahead will match after checking n characters. In the second iteration, the negative lookahead matches after checking n-1 characters. In the n^{th} iteration, the negative lookahead will match after checking one character. Hence, the total computation of the negative lookahead operations is equal to n + (n -1) + (n -2) + \ldots + 1 = \frac{n(n-1)}{2} = \frac{n^2 - n}{2} .

Furthermore, the total computation of matching each character in the text with the single character “.” is n. And the total computation of appending each character in the text to the resulting match is n.

Therefore, the worst-case time complexity of the regular expression method is \boldsymbol{O(n^2)}.

8. Conclusion

In this tutorial, we showed how a regular expression could be used to match a text without a word. We described the corresponding steps to apply the regular expression to the text. The steps are a sequence of lookahead checks with a pointer incrementation.


« 上一篇: Ukkonen的后缀树算法
» 下一篇: 螺旋模型概述