匹配不包含某个单词的文本的正则表达式

1. Overview

In this tutorial, we’ll explain how to determine whether a given English text expresses a certain word using the regular expression method.

For example, think of the scenario where we’ve several PDF text documents from our mentor, and we want to find all documents that don’t include the word “startup”.

We’d be able to write a small script that runs through all the documents and determines whether the document contains the word or not.

2. Problem Description

Let be an English text and be an English word. The regular expression algorithm should return true if isn’t expressed in $\boldsymbol{T}$ and false otherwise.

For example, let’s say we have the following English text (from the Wikipedia page on startups):

A startup or start-up is a company or project undertaken by an entrepreneur to seek, develop, and validate a scalable business model. Startups are new businesses that intend to grow large beyond the solo founder.

And we want to know if the text doesn’t include the word “founder” or not.

The regular expression algorithm should return false.

3. Regular Expression Method

A regular expression is a string of literals and metacharacters. A metacharacter is a character that guides the regular expression matching method. For example, the caret sign tells the matching method to place the searching pointer at the start of the given text.

We take as input an English text and a word . Then, we apply the regular expression $\boldsymbol{\wedge((?!w).)^* \textdollar}$ to the text. The corresponding matching method proceeds as follows:

algorithm MatchTextWithoutWordMethod(T, w):
    // INPUT
    //    T = an English text
    //    w = a word
    // OUTPUT
    //    A Boolean value.

    // Initialize the pointer and the length of the word
    p <- 0
    l <- the length of the word w

    while p < the length of T:
        i <- 0
        a <- 0

        while i < l:
            if p + i < l:
                if w[i] = T[p + i]:
                    a <- a + 1
                i <- i + 1
            else:
                break

        if a = l:
            return false

        p <- p + 1

    return true

Since the regular expression begins with the caret sign, the first step is to move the search pointer to the position of the start of the text.

The next metacharacter is a capturing parenthesis, and the second step is to match a sequence of characters described by the capturing parentheses sub-pattern (?!w). .

The sub-pattern (?!w). is composed of a lookahead followed by the single character ““. The lookahead $\boldsymbol{(?!w)}$ tells us to match with the next | $\boldsymbol{w}$ | characters. We denote the matching sequence of characters with . If it fails to match, the single character “” (following the negative lookahead group) tells us to add the first character of to the matching result and to increase the search pointer by one. Otherwise, we terminate the method and return false.

The metacharacter following the sub-pattern (?!w). is the greedy quantifier , which tells us to keep repeating the previous steps until we reach the end of the text (hence, the dollar sign at the end of the regular expression).

4. Flowchart

The regular expression matching method flowchart:

5. Positive Example

Let’s work out an example. Our search word is “founder”, and our English text is as follows:

First, we move the search pointer to the start of the text:

Now, we check whether the next seven characters match the word “founder”.

Since the phrase “A franc” doesn’t match the word “founder”, we add the character “A” to the matching result. Then, we move the search pointer to the following character:

Repeating the above steps eight more times, we get the following:

Accordingly, the matching result is now “A franchi”.

We keep repeating the steps until we reach the end of the text (hence, the dollar sign at the end of the given regular expression).

As a result, we successfully matched the given regular expression. The matching result is the full given text. Therefore, the algorithm returns true.

6. Negative Example

Let’s work out a negative example. Our search word is still “founder” and our English text is as follows:

First, we move the search pointer to the start of the text:

Now, we check whether the next seven characters match the word “founder”.

Since the phrase “A start” doesn’t match the word “founder”, we add the character “A” to the matching result. Then, we move the search pointer to the following character:

Once again, we check whether the next seven characters (starting from the pointer) match the word “founder”.

Since the phrase ” startu” doesn’t match the word “founder”, we add the character ” ” to the matching result. The matching result is now “A “. Consequently, we move the search pointer:

Repeating the above steps eleven more times, we get:

Consequently, the matching result is now “A startup or “.

Following this, we keep repeating the steps until we reach the sequence of characters “founder” that match the word “founder”:

Finally, the negative lookahead doesn’t match. Therefore we fail to match the regular expression, and the method returns false.

7. Complexity

Let $w_{1} w_{2} ... w_{n}$ be the given word, such that $w_{1}$ is the first character, $w_{2}$ is the second character and $w_{n}$ is the $n^{th}$ character in the word. Furthermore, let’s assume that all the characters $w_{1}, w_{2}, ..., w_{n}$ are syntactically the same ( $w_{1} = w_{2}= ... = w_{n}$ ).

In the worst-case scenario, the text is composed of one word, the sequence of characters $\boldsymbol{a_{1} a_{2} \ldots a_{n-1} a_{new}}$ . Let’s assume all the characters are syntactically equivalent except the last character in the text ( $a_{1} = a_{2}= ... = a_{n}$ and $a_{1} \neq a_{new}$ ). Moreover, all the identical characters are syntactically equivalent to the characters of the given word ( $a_{1} = w_{1}$ ).

In the first iteration, the negative lookahead will match after checking characters. In the second iteration, the negative lookahead matches after checking n-1 characters. In the $n^{th}$ iteration, the negative lookahead will match after checking one character. Hence, the total computation of the negative lookahead operations is equal to $n + (n -1) + (n -2) + \ldots + 1 = \frac{n(n-1)}{2} = \frac{n^2 - n}{2}$ .

Furthermore, the total computation of matching each character in the text with the single character “.” is . And the total computation of appending each character in the text to the resulting match is .

Therefore, the worst-case time complexity of the regular expression method is $\boldsymbol{O(n^2)}$ .

8. Conclusion

In this tutorial, we showed how a regular expression could be used to match a text without a word. We described the corresponding steps to apply the regular expression to the text. The steps are a sequence of lookahead checks with a pointer incrementation.

Persistence

REST

Security