如何检查字符串的周期性

1. Introduction

In this tutorial, we’ll present three ways to check the periodicity of a string. In other words, we check if it’s a repetition of one of its substrings.

2. Problem Statement

We have a string $\boldsymbol{s=s_1s_2\ldots s_n}$ ( $\boldsymbol{n>1}$ ) and want to check if we can obtain it by concatenating one of its proper prefixes to itself multiple times. The substring of we repeatedly concatenate to get has to be its prefix. Otherwise, it wouldn’t be possible to form the whole string.

For example, if s=abcabcabc , then we get by concatenating z=abc two times to itself:

[abc + abc + abc]

which we write as s=z^3 . Since bca isn’t a prefix of , we can’t get abcabcabc from it.

So, formally, the prefix we want to find is of the form $\boldsymbol{z=s_{1:m}=s_1 s_2 \ldots s_m}$ where $\boldsymbol{1 \leq m < n}$ .

Let’s see how we can determine it.

3. The Brute-Force Approach

We can iterate over the prefixes of $\boldsymbol{s}$ and check if concatenating them gives us $\boldsymbol{s}$ . To test if $z=s_1 s_2 \ldots s_m$ concatenates to $s=s_1 s_2 \ldots s_n$ , we iterate over $i=1,2,\ldots, n$ and check if $s_{((i-1) \bmod m) + 1} = s_i$ for all . This way, we cyclicly iterate over :

$[\begin{aligned} i && 1 && 2 && \ldots && m && m + 1 && m + 2 && \ldots \\ s_i && s_1 && s_2 && \ldots && s_m && s_{m+1} && s_{m+2} && \ldots \\ s_{((i-1) \bmod m) + 1} && s_1 && s_2 && \ldots && s_m && s_1 && s_2 && \ldots \\ \end{aligned}]$

If we find an for which the test fails, we can stop the iteration. Otherwise, we return since we’ve proven that $s=z + z + \ldots + z \equiv z z \ldots z \equiv z^k$ for some integer .

Here’s the pseudocode:

algorithm BruteForceAlgorithm(s):
    // INPUT
    //    s = the input string (lenght = n, 1-based indexing)
    // OUTPUT
    //    If there is a proper prefix z = s[1 : m] (1 ≤ m < n) of s
    //    such that s = z^k for an integer k, returns z.
    //    Otherwise, returns an indication of failure.

    for m <- 1 to n - 1:
        // Test a new candidate
        z <- s[1 : m]
        
        i <- 1
        while i <= n and s[((i - 1) mod m) + 1] = s[i]:
            i <- i + 1
        
        if i > n:
            // The other condition was satisfied for all i ≤ n, so s = z + z +...+ z
            return z
    
    // If we've come this far, then no prefix of s can be repeated to obtain s
    return failure

3.1. Complexity

This algorithm’s worst-case time complexity is $\boldsymbol{O(n^2)}$ . Here’s why. The worst-case input is a string of the form $s_1 = s_2 = \ldots s_{n-1} \neq s_n$ . If given such an input, the algorithm runs symbol equality tests for each $m=1,2,\ldots, n-1$ of the outer loop. So, the total number of checks is $(n-1)n \in O(n^2)$ .

The complexity would stay quadratic even if we started checks from $s_{m+1}$ in the inner loop, skipping the unnecessary checks of s_1 = s_1 , s_2 = s_2 , $\ldots$ , s_m = s_m :

$[\sum_{m=1}^{n-1}(n-m)= \left(\sum_{m=1}^{n-1}n \right) - \left(\sum_{m=1}^{n-1}m \right) = n(n-1)-\frac{n(n-1)}{2} = \frac{1}{2}n^2 - \frac{1}{2}n \in O(n^2)]$

The problem with this approach is that it tries to get $\boldsymbol{s}$ even from the prefixes for which we can conclude in advance that they can’t form $\boldsymbol{s}$ . The key is to observe that $\boldsymbol{m}$ has to divide $\boldsymbol{n}$ without remainder for $\boldsymbol{s}$ to be a repetition of $\boldsymbol{s_{1:m}}$ . For instance, |ab|=2 and |abcabcabc|=9 . Even without trying to get the latter from the former, we can say that it’s an impossible task because $9 \mod 2 =1 \neq 0$ .

4. The Improved Algorithm

So, we start from $\boldsymbol{m=1}$ in the outer loop as before but don’t test whether $\boldsymbol{s=z^k}$ unless $\boldsymbol{m}$ is a divisor of $\boldsymbol{n}$ .

The iteration can stop at $\boldsymbol{\lfloor n/2 \rfloor}$ because no number greater than n/2 divides except , and has to be strictly lower than :

Here’s the pseudocode:

algorithm ImprovedAlgorithm(s):
    // INPUT
    //    s = the input string (lenght = n, 1-based indexing)
    // OUTPUT
    //    If there is a proper prefix z = s[1 : m] (1 ≤ m < n) of s
    //    such that s = z^k for an integer k, returns z.
    //    Otherwise, returns an indication of failure.

    for m <- 1, 2, ..., floor(n/2):
        if n mod m = 0:
            // Test a new candidate
            z <- s[1 : m]

            i <- 1
            while i ≤ n and s[((i-1) mod m) + 1] = s[i]:
                i <- i + 1

            if i > n:
                // The other condition was satisfied for all i, so s = z + z +...+ z
                return z

    // If we've come this far, then no prefix of s can be repeated to obtain s
    return failure

Another way would be to identify the divisors of in advance to skip unnecessary checks whether $n \mod m = 0$ .

4.1. The Complexity

Let’s derive the worst-case complexity of this approach. The worst-case input is of the same form as in the previous algorithm ( $s_1 = s_2 = \ldots s_{n-1} \neq s_n$ ). The iterations of the inner loop still perform checks but since we enter the inner loops only if $n \mod m = 0$ , the number of inner loops isn’t $\boldsymbol{n-1}$ but is equal to $\boldsymbol{d(n)-1}$ , where $\boldsymbol{d(n)}$ is the number of different divisors of $\boldsymbol{n}$ .

Since $\boldsymbol{d(n) \in n^{o(1)}}$ , the total complexity is $\boldsymbol{n^{1+o(1)}}$ , so the algorithm is definitely sub-quadratic. We can get a tighter bound if we use the result:

$[\limsup_{n \rightarrow \infty} \frac{ \log d(n) }{ \frac{\log n}{ \log \log n}} = 2]$

From there, we conclude that $d(n) \in O \left( \frac{ \log n }{\log \log n}\right)$ , so the algorithms’ complexity is:

$[\boldsymbol{O \left( n \frac{ \log n }{\log \log n} \right)}]$

That means that the algorithm’s worst-case performance is better than $O(n \log n)$ but worse than O(n) . Is there a way to solve the problem in the O(n) time?

5. The Rotation Algorithm

We can design a linear-time algorithm by using the theory of strings.

5.1. The Rotation Theorem

We’ll use the theorem which states that a string $\boldsymbol{s=s_1 s_2 \ldots s_n}$ is a repetition of one of its substrings if and only if $\boldsymbol{s}$ is a non-trivial rotation of itself. Let’s prove it.

If $s=z z \ldots z$ for some of length , then by removing the first symbols and appending them to the last occurrence of in , we get the same string.

To prove the theorem in the other direction, we assume that is a non-trivial rotation of itself. That means that there’s an such that we can remove the first m < n symbols, append them to the rest of in the same order, and obtain the starting string, :

$[s_1 s_2 \ldots s_m s_{m+1} s_{m+2} \ldots s_n = s_{m+1} s_{m+2} \ldots s_{2m} s_{2m+1} \ldots s_{n-1} s_{n} s_1 s_2 \ldots s_{m}]$

Since the LHS and RHS are equal, it must hold that $s_1 = s_{m+1}$ , $s_2 = s_{m+2}$ , $\ldots$ , $s_{m} = s_{2m}$ . So, we’ve proven that:

$[s = z z s_{2m:n} \quad \text{ for } z = s_{1:m}]$

However, since the rotation of is equal to , we can rotate it again by symbols to get the same string and conclude that $s_{m+1}=s_{2m+1}$ , $s_{m+2} = s_{2m+2}$ , $\ldots$ , $s_{2m}=s_{3m}$ , so $s=z z z s_{3m:n}$ . Following the approach, we get that $s = z^{ \frac{n}{m}}$ .

5.2. The Algorithm

So, the problem is now to check whether the input string $\boldsymbol{s}$ is a non-trivial rotation of itself. If that’s the case, $\boldsymbol{s}$ will be a proper substring of $\boldsymbol{s s}$ (not starting at the $\boldsymbol{1}$ st and the $\boldsymbol{n}$ th positions). Therefore, we reduce the original problem to a string search. We try to find the index of in s s different from 1 and . If successful, we’ll show that is periodic, and the building block will be $s_{1:(k-1)}$ .

We can use an efficient string matching algorithm that uses the substring index of s s (constructible in $\Theta(n)$ time) and then look for a non-trivial occurrence of in O(n) time. Therefore, we can solve the original problem in the linear worst-case time.

6. Conclusion

In this article, we presented three algorithms to check string periodicity. The first and the second ones are easier to understand and implement than the third one. However, the rotation algorithm is more efficient. Its worst-case complexity is O(n) , whereas the other two have super-linear time complexities.

Persistence

REST

Security