1. Introduction

In this tutorial, we analyze the worst-case, the best-case, and the average-case time complexity of QuickSelect. It’s an algorithm for finding the k-th largest element in an n-element array (1\leq k \leq n). In honor of its inventor, we also call it Hoare’s Selection Algorithm.

2. QuickSelect

QuickSelect is similar to QuickSort. The main difference is that the sorting algorithm recurses on both subarrays after partitioning, whereas the selection algorithm recurses only on the subarray that provably contains the k-th largest element:

algorithm QuickSelect(a, ℓ, h, k):
    // INPUT
    //    a = an n-element array
    //    ℓ = the start index of the subarray (initially, ℓ = 1)
    //    h = the end index of the subarray (initially, h = n)
    // OUTPUT
    //    The k-th largest element in a

    if ℓ = h:
        return a[ℓ]
    else if ℓ < h:
        p <- choose the pivot, partition a[ℓ:h], and return the new index of the pivot

        if k = p:
            return a[p]
        else if k < p:
            return QuickSelect(a, ℓ, p-1, k) // Apply QuickSelect to the left subarray
        else:
            return QuickSelect(a, p+1, h, k) // Apply QuickSelect to the right subarray

We can nicely visualize the steps QuickSelect makes while searching:

quickselect steps

2.1. Partitioning

Algorithm 1 is a generic form of QuickSelect since it doesn’t specify how to partition the input array and select pivot elements. Several methods have appeared over the years. Since it’s the easiest to analyze, we’ll use Lomuto partitioning with random pivot selection in this tutorial. Other approaches have the same or a similar complexity or can be analyzed similarly.

In brief, Lomuto selects a random element in the input array a[\ell:h] as the pivot, swaps it with the last element a[h], and splits the array in a single loop iterating from a[\ell] to a[h-1]:

algorithm LomutoPartitioning(a, ℓ, h):
    // INPUT
    //    a = an n-element array
    //    a[ℓ:h] = the subarray a[ℓ], a[ℓ+1], ..., a[h] (ℓ < h)
    // OUTPUT
    //    k = the pivot's index that splits a[ℓ:h] into left and right subarrays

    Randomly select r from {1, 2, ..., n}
    Swap a[h] with a[r]
    
    x <- a[h]
    k <- ℓ - 1
    
    for j <- ℓ to h - 1:
        if a[j] <= x:
            k <- k + 1
            Swap a[j] with a[k]
    
    Swap a[k + 1] with a[h]
    
    return k + 1

3. The Worst-Case Complexity Analysis of QuickSelect

Before deriving the complexity bounds of the worst-case scenario, let’s first determine its structure.

3.1. The Structure of the Worst-Case Scenario

We see that QuickSelect discards a portion of the original array in each recursive call. The worst-case corresponds to the longest possible execution of the algorithm. It’s the one in which the k-th largest element is the last one standing, and there are n recursive calls. A necessary condition for that is that QuickSelect discards the smallest possible portion in each call. In other words, the subarray it ignores should be empty so that it discards only the current pivot. Additionally, the pivot shouldn’t be the k-th largest element. Mathematically, each recursive call should fulfill either of the two conditions:

  • p=h and k <h
  • p=\ell and k>\ell

However, they aren’t sufficient for the worst case to occur. Let’s say that we approach the sought element always from the right. As a result, we make n-k recursive calls before we find it. On the other hand, we make k calls if we approach it only from the left. This is the worst-case scenario only if k=1 or k=n, respectively. If 1 < k < n, then to find the k-th largest element in the n-th call, the algorithm should select all the other elements as pivots before reaching it. So, QuickSelect approaches the element from both sides in the worst case if 1 < k < n.

3.2. The Analysis

Lomuto partitioning of an n-element array performs n+1 comparisons and at most n swaps. So, its complexity is \Theta(n). Let’s denote as T_n the number of steps (comparisons and swaps) QuickSelect performs when searching an n-element array. Since QuickSelect always reduces the size of the subarray to recurse on by 1 (because the ignored subarray is empty and the pivot isn’t included in the other subarray), the following recurrence holds:

(1)   \begin{equation*} T_n = \Theta(n) + T_{n-1} \end{equation*}

Unfolding it, we get:

    [\begin{aligned} T_n &= \Theta(n) + Theta(n-1) + \ldots + \Theta(1) \\ &= \Theta\left(n+(n-1)+\ldots+1\right) \\ &=\Theta\left(\frac{n(n+1)}{2}\right) \\ &=\Theta(n^2) \end{aligned}]

So, QuickSelect is of quadratic complexity in the worst case.

4. The Best-Case Complexity Analysis of QuickSelect

The best-case occurs when QuickSelect chooses the k-th largest element as the pivot in the very first call. Then, the algorithm performs \Theta(n) steps during the first (and only) partitioning, after which it terminates. Therefore, QuickSelect is \Theta(n) in the best case.

5. The Average-Case Complexity Analysis of QuickSelect

Before determining the algorithm’s average-case complexity, let’s consider what it means to analyze the average case. To be more precise, by average, we mean expected, which is a term from probability theory.

5.1. The Expected (Average) Time Complexity

The expected complexity of an algorithm is the expectation of its complexity over the space of all possible inputs. That is, we regard the input as random and following probability distribution. Then, we find the expectation under that distribution. Usually, we assume the distribution is uniform. That means that we consider all the inputs equally likely.

Even though we don’t know the actual distribution of the inputs the algorithm encounters in the real world, the uniform model is a pretty good choice: the analyses based on it have proved accurate and robust in many cases so far. Moreover, we can introduce randomness through the algorithm to ensure the uniform model holds no matter the input’s underlying distribution. For example, we can randomly shuffle the input array before we call QuickSelect. Or, we can choose the pivots at random. However, if we introduce randomness through the algorithm, then the interpretation of the average case changes.

Under the assumption that the input arrays follow a uniform distribution, the average case describes the complexity averaged over all possible inputs. But, if we randomly shuffle the input array or randomly choose the pivot element, then the average case we analyze describes the complexity averaged over all possible executions of QuickSelect on an input array.

5.2. The Random Model for QuickSelect

In the case of QuickSelect, we’ll consider only the case with all elements distinct from one another. As we assume that a contains no duplicates, it can be thought of as a permutation of \{1,2,\ldots,n}, which means that we can work with the numbers from \{1,2,\ldots,n\} instead of abstract elements a[1], a[2], \ldots, a[n]. Switching to \{1,2,\ldots,n\} is justified because each element a[i] has its unique rank in the range \{1,2,\ldots, n\} and we can disregard the case with the duplicates since it has the same complexity.

Since in each call we choose the pivot randomly, it follows that the pivot’s rank (\boldsymbol{k}) follows the uniform distribution over \boldymbol{\ell,\ell+1,\ldots, h}.

We’ll express the expected complexity of QuickSelect in terms of the input size, n. We’ll track only the number of comparisons it makes because the number of swaps in each partitioning is bounded from above by the number of comparisons.

5.3. The Analysis

Let’s define random variables X_{i,j} (i=1,2,\ldots,n-1; j=i+1,i+2,\ldots,n):

(2)   \begin{equation*} X_{i,j} = \begin{cases} 1,& i \text{ and } j \text{get compared at any time during execution of QuickSelect}\\ 0,& \text{otherwise} \end{cases} \end{equation*}

The total number of comparisons QuickSelect makes when searching through an array of n numbers is then:

(3)   \begin{equation*} \sum_{i=1}^{n-1}\sum_{j=i+1}^{n}X_{i,j} \end{equation*}

and the expected, that is, the average number of comparisons for an input of size n is:

(4)   \begin{equation*} \begin{aligned} C_n&=&E\left[\sum_{i=1}^{n-1}\sum_{j=i+1}^{n}X_{i,j}\right] \\ &=& \sum_{i=1}^{n-1}\sum_{j=i+1}^{n}E[X_{i,j}] \\ &=& \sum_{i=1}^{n-1}\sum_{j=i+1}^{n}q_{i,j} \end{aligned} \end{equation*}

where q_{i,j} is the probability that X_{i,j}=1. How can we compute q_{i,j}?

There are three cases to consider:

  • Case 1: k \leq i < j
  • Case 2: i < j \leq k
  • Case 3: i < k < j

quickselect cases

QuickSelect doesn’t compare i and j if it selects k as a pivot or a number that’s between them. It holds for each of the above three cases. If k gets to be the pivot, then the algorithm’s found the number it’s been looking for, so the search stops. If the pivot is between i and j, then i ends up in the left and j in the right subarrays during partitioning. Thus, the algorithm can compare only the elements which end up in the same subarray in a call.

But, QuickSelect does compare the pivot element to all the other numbers in a recursive call! So, for \boldsymbol{i} and \boldsymbol{j} to get compared, Quickselect should choose \boldsymbol{i} or \boldsymbol{j} as the pivot before any other number from a range of the corresponding case.

Since the pivots are randomly distributed, the selection probability of  any integer from the range [\ell, h] is \frac{1}{h-\ell+1}. Since the selections of i and j as the first pivot from [\ell, h] are mutually exclusive events, we can sum their probabilities to get q_{i,j} in each case:

(5)   \begin{equation*} q_{i,j} = \begin{cases} \frac{1}{j-k+1}+\frac{1}{1-k+1}=\frac{2}{j-k+1},& k \leq i < j \\ \frac{1}{k-i+1}+\frac{1}{k-i+1}=\frac{2}{k-i+1},& i < j \leq k \\ \frac{1}{j-i+1}+\frac{1}{j-i+1}=\frac{2}{j-i+1},& i < k < j \end{cases} \end{equation*}

We can split C_n into summations by each case and then add together the obtained sums. In Case 1, we have:

(6)   \begin{equation*} \begin{aligned} \sum_{j=k+1}^{n}\sum_{i=k}^{i-1}\frac{2}{j-k+1} &= \sum_{j=k+1}^{n}(j-k)\cdot \frac{2}{j-k+1} \\ &=2\sum_{j=k+1}^{n} \frac{j-k}{j-k+1} \\ &\leq 2 (n -k) \qquad \forall(j>k)(\frac{j-k}{j-k+1}<1) \\ &\leq 2n \in O(n) \end{aligned} \end{equation*}

We get a similar result for Case 2:

(7)   \begin{equation*} \begin{aligned} \sum_{i=1}^{k-1}\sum_{j=i+1}^{k}\frac{2}{k-i+1} &= \sum_{i=1}^{k-1}(k-i)\cdot \frac{2}{k-i+1} \\ &=2\sum_{i=1}^{k-1} \frac{k-i}{k-i+1} \\ &\leq 2(k-1) \qquad \forall(i<k)(\frac{k-i}{k-i+1}<1) \\ &\leq 2k \\ & \leq 2n \in O(n) \end{aligned} \end{equation*}

Case 3 will also be O(n):

(8)   \begin{equation*} \begin{aligned} \sum_{i=1}^{k-1}\sum_{j=k+1}^{n}\frac{2}{j-i+1} &=  2\sum_{j=k+1}^{n}\frac{1}{j}+2\sum_{j=k+1}^{n}\frac{1}{j-1} + \ldots 2\sum_{j=k+1}^{n}\frac{1}{j-k+2} \\ &=2 \left(\frac{1}{k+1}+\frac{1}{k+2}+\ldots+\frac{1}{n}\right)+2\left(\frac{1}{k}+\frac{1}{k+1}+\ldots+ \frac{1}{n-1}\right)+\ldots + \left(\frac{1}{3}+\frac{1}{4}+\ldots + \frac{1}{n-k+2}\right)\\ &=2 (H_n-H_{k}) + 2(H_{n-1}-H_{k-1})+\ldots +2(H_{n-k+2}-H_3) \\ \end{aligned} \end{equation*}

We used the definition H_m=\sum_{i=1}^{m}\frac{1}{i}. Since \ln m \leq H_m = \ln m + 1 for all m\geq 2, we have:

(9)   \begin{equation*} \begin{aligned} \sum_{i=1}^{k-1}\sum_{j=k+1}^{n}\frac{2}{j-i+1} &\leq 2(\ln n + 1 - \ln k) + 2(ln(n-1)+1 \ln (k-1)) + \ldots + 2(\ln(n-k+2)+1-\ln 2)\\ &\leq \ln \frac{n}{k} + \ln \frac{n-1}{k-1} + \ldots \frac{n-k+2}{2} \\ &\leq \ln \frac{n}{k} + \ln \frac{n-1}{k-1} + \ldots \ln\frac{n-k+2}{2} + \ln\frac{n-k+1}{1} \\ & = \ln \frac{n(n-1)\cdot\ldots\cdot (n-k+1)}{k(k-1)\cdot\ldots\cdot 2\cdot 1} \\ &= \ln \binom{n}{k} \\ &\leq \ln 2^n = (\ln 2)n \in O(n) \end{aligned} \end{equation*}

Therefore, QuickSelect is O(n) in the average case.

6. Discussion of Other Partitioning Schemes and Pivot Selection Strategies

We analyzed QuickSelect that uses Lomuto partitioning, but there are other alternatives such as Hoare partitioning. It scans the input array from both ends. Although it performs slightly more comparisons than Lomuto partitioning, it makes fewer swaps on average. Nevertheless, QuickSelect has the same complexity no matter if we use Hoare or Lomuto partitioning.

We may choose the pivot deterministically in both partitioning schemes, always selecting the first, the last, or whatever element. But, we may also pick it randomly as we did here. Both strategies have the same asymptotic complexity, but random selection may result in faster runtime in practice.

However, there’s a way to reduce the worst-case complexity to O(n), which is to select pivots with a Median of Medians. It’s an algorithm that returns an element that is guaranteed to be smaller than the top 30% and larger than the bottom 30% elements of the input array. That means that we reduce the size of the input array by a factor of 0.7 in the worst case, which gives the following recurrence:

(10)   \begin{equation*} T_n \leq \Theta(n) + T_{0.7n} \end{equation*}

and results in the linear worst-case complexity:

(11)   \begin{equation*} \begin{aligned} T_n &\leq \sum_{k=0}^{\log_{0.7}\frac{1}{n}}\Theta(0.7^kn) \\ &=\Theta\left(n\cdot \sum_{k=0}^{\log_{0.7}\frac{1}{n}}0.7^k\right) \\ &=\Theta\left(n\cdot \frac{1-\frac{0.7}{n}}{0.3}\right) \\ & \leq \Theta\left(\frac{n}{0.3}\right) \in O(n) \end{aligned} \end{equation*}

7. Conclusion

In this article, we analyzed the worst, best, and average-case time complexity of QuickSelect. The usual implementations of the algorithm, which use Hoare or Lomuto partitioning, is of quadratic complexity in the worst case but is linear in the average and best cases irrespective of the pivot selection strategy. However, using the Median of Medians to select pivots results in the linear runtime even in the worst case.


« 上一篇: 指令和程序