流式中位数 | Baeldung中文网

1. Introduction

In this tutorial, we’ll present two algorithms for calculating the median of a data stream.

2. The Median of a Stream

A median of a random variable is a value ${med}_X$ that splits the ‘s range in half: one half lower than the ${med}_X$ and the other greater than it:

$[\Pr\left(X \geq {med}_X\right) = \Pr\left( X \leq {med}_X \right) = \frac{1}{2}]$

To determine the median of a sample $\{x_i\}_{1}^{n}=\{x_1, x_2, \ldots, x_n\}$ drawn from , we sort it to get the sample order statistics $x_{(1)} \leq x_{(2)} \leq \ldots \leq x_{(n)}$ and select the middle value as the sample’s median:

$[med_n = \begin{cases} x_{\left\lceil n/2 \right\rceil},& n \text{ is odd} \\ \frac{ x_{ n/2 } + x_{ n/2 + 1} }{2},& n \text{ is even} \end{cases}]$

This formula, however, assumes that the sample is sortable. That means that we can store it in the memory and make multiple passes over it.

2.1. Data Streams

That’s not the case when the sample is a big data stream. In such cases, we read the elements one by one from the stream as soon as they become available. Additionally, were we to keep them in the memory, we’d run out of it before the stream would end. So, we can’t effectively store or sort the whole stream, and we can’t revisit an element after reading it unless we save it. Unfortunately, we don’t have enough memory to keep the whole stream. For those reasons, we can’t determine the median in the usual way.

Instead, we use the streaming algorithms specialized for data streams. By design, they run in low-memory environments that can store only a portion of the stream at any point during an algorithm’s execution. To respect those constraints, the streaming algorithms sacrifice precision for low memory complexity. So, they don’t return the estimates the offline algorithms would. The goal is to have low time and space complexities while retaining as much precision as possible.

3. Symmetric Data

If the data are distributed symmetrically, then their median is equal to their mean. So, we can find the median by calculating the streaming mean:

algorithm StreamingMeanAlgorithm(stream):
    // INPUT
    //    stream = a data stream
    // OUTPUT
    //    The mean of the stream

    n <- 0
    mean <- 0

    while stream is not empty:
        n <- n + 1
        x <- get the next (n-th) element of stream
        mean <- ((n - 1) * mean + x) / n

    return mean

This algorithm has O(1) memory complexity and runs in O(n) time. However, if we can’t guarantee that the data follow a symmetric distribution such as normal or uniform, we shouldn’t estimate the median with the streaming mean. Instead, we can use a quantile estimation streaming algorithm such as P^2 .

4. The $\boldsymbol{P^2}$ Algorithm

A -quantile of ( $p\in [0, 1]$ ) is the value x_p greater than $100p\%$ of ‘s distribution:

$[\Pr\left(X \leq x_p\right) = p \quad (p \in [0, 1])]$

The median is a 0.5 -quantile, so we can use any method for streaming quantile estimation, such as the P^2 algorithm. We’ll present it for the median case ( p=0.5 ).

4.1. Five Markers

To find the median, P^2 maintains five markers while reading the stream:

, the minimimum
, the -quantile
, the -quantile
, the -quantile
, the maximum (the -quantile)

The q_2 and q_4 are called the middle markers because they’re between the median and the extreme values. The idea is to update the markers as we read the stream. In the end, we return the current value of q_3 .

4.2. The Update Step

Each q_i is associated with n_i , the number of elements observed so far not greater than the current q_i ( i=1,2,3,4,5 ). Ideally, n_i will be approximately equal to n'_i = p_i n at any point, where is the number of elements we’ve read from the stream, and p_i determines which quantile the marker q_i represents ( p_1 = 1/n, p_2=0.25, p_3=0.5, p_4=0.75, p_5=1 ). More precisely:

$[\begin{aligned} n'_1 &= 1 \\ n'_2 &= 0.25(n-1) + 1 \\ n'_3 &= 0.5(n-1)+1 \\ n'_4 &= 0.75(n-1) + 1\\ n'_5 &= n \end{aligned}]$

When reading a new element, x_n , we increase by the n_i of all markers that are not lower than x_n . If the new n_i doesn’t differ from its ideal value n'_i by at least 1, we don’t update q_i . Otherwise, we update using the piecewise parabolic (PP or $\boldsymbol{P^2}$ ) formula (which is how the algorithm got its name). If we adjust q_i , we also modify n_i as explained in the next section.

4.3. Piecewise Parabolic Adjustment

We assume that the curve passing through any three adjacent markers is a parabola:

[q_i = a n_i^2 + b n_i + c]

That is, we assume that a second-order function is a good enough approximation of the underlying CDF between any three markers.

The ideal n_i' s are on the real scale, but the actual n_i s are integers. If the difference between an updated n_i and n'_i is < 1 , we do nothing. Otherwise, we update q_i as follows:

(1) $\begin{equation*} q_i \leftarrow q_i + \frac{d}{n_{i+1}-n_{i-1}} \left((n_i - n_{i-1} + d) \frac{q_{i+1} - q_i}{n_{i+1} - n_i} + (n_{i+1} - n_i - d) \frac{q_i - q_{i-1}}{n_{i} - n_{i-1}}\right) \end{equation*}$

where d=sgn(n_i - n'_i) , that is, d=1 if the difference is $\geq 1$ , and d=-1 if the difference is $\leq -1$ .

If q_i is to be adjusted, we adjust n_i as well:

$[n_i \leftarrow n_i + d]$

4.4. Technical Remarks

The adjustment formula applies only to $\boldsymbol{i=2,3,4}$ . The first marker, which corresponds to the minimum, is replaced by x_n if it’s the new minimum. The same goes for the maximum marker, q_5 . We don’t adjust their values as we do for q_2 , q_3 , and q_4 .

All the $\boldsymbol{n_i}$ s must be different. So, if adjustment leads to any two being equal, we don’t apply it. Also, q_i s must be non-decreasing ( $q_1 \leq q_2 \leq q_3 \leq q_4 \leq q_5$ ). If the parabolic adjustment violates the ordering, we use the linear alternative instead:

(2) $\begin{equation*} q_i \leftarrow q_i + d\frac{q_{i+d} - q_i}{n_{i+d} - n_i} \end{equation*}$

Last but not least, we use the first five elements x_1, x_2, x_3, x_4, x_5 to initialize the markers, so we don’t update or adjust them. So the main part of the algorithm starts with n>5 .

4.5. Pseudocode

Here we present the pseudocode of P^2 :

algorithm P2Algorithm(stream):
    // INPUT
    //    stream = a data stream
    // OUTPUT
    //    The median of stream

    for n <- 1, 2, 3, 4, 5:
        x_n <- get the next element of stream

    Sort x_1, x_2, x_3, x_4, x_5

    Use the sorted five elements to initialize the markers q_i
    for i <- 1 to 5:
       Set n_i to i

    n <- 5

    while stream is not empty:
        n <- n + 1
        Update the ideal positions n_i` for each i=1,2,4,5
        x_n <- get the next element of stream
        Find the lowest j such that x_n >= q_j
        for i <- j to 5:
            n_i <- n_i + 1
        for i <- 2, 3, 4:
            d_i <- n_i` - n_i
            if |d_i| >= 1:
                d <- sgn(d_i)
                Calculate the adjusted marker q_i` using the parabolic formula
                if q_{i-1} < q_i` < q_{i+1}:
                    q_i <- q_i`
                else:
                    Use the linear formula to calculate the adjusted marker q_i`
                    q_i <- q_i`
                n_i <- n_i + d

    return q_3

4.6. Possible Improvements and Trade-Offs

It turns out that P^2 performs rather well. The reason is that second-degree curves are sufficiently precise piecewise approximations of the actual CDFs the data follow in practice. Higher-order curves may offer even better estimates, but improved precision may not be worth the increased mathematical complexity and computational burden.

Just as five markers work better than three (the minimum q_1 , the target q_2 , and the maximum q_3 ), seven markers would probably allow more precise estimation. However, using more markers would slow down the algorithm.

4.7. Complexity

The space complexity of P^2 is O(1) because it uses the constant number of markers. Time complexity is O(n) because it reads the stream once.

5. Example

Let’s demonstrate P^2 on the following stream:

$[x = [50,30,20,40,10,10,20, \ldots]]$

5.1. Initialization

We read and sort the first five numbers to initialize the q_i s and n_i s ( i=1,2,3,4,5 ):

$[\begin{aligned} && q_1 = 10 && q_2 = 20 && q_3 = 30 && q_4 =40 && q_5 = 50 \\ && n_1 = 1 && n_2 = 2 && n_3 = 3 && n_4 = 4 && n_5 = 5 \end{aligned}]$

Since n=5 , the ideal n_i' values are:

$[\begin{aligned} n_1' &= 1 \\ n_2' &= 0.25(5-1) + 1 = 0.25\cdot 4 + 1 = 2 \\ n_3' &= 0.5(5-1)+1 = 0.5 \cdot 4 + 1 = 3 \\ n'_4 &= 0.75(5-1) + 1 = 0.75\cdot 4 + 1 = 4 \\ n'_5 &= 5 \end{aligned}]$

5.2. Reading the Sixth Element

Now, we read x_6=10 . Since n=6 , the ideal values change:

$[\begin{aligned} n_1' &= 1 \\ n_2' &= 0.25(6-1) + 1 = 0.25\cdot 5 + 1 = 2.25 \\ n_3' &= 0.5(6-1)+1 = 0.5 \cdot 5 + 1 = 3.5 \\ n'_4 &= 0.75(6-1) + 1 = 0.75\cdot 5 + 1 = 4.75 \\ n'_5 &= 6 \end{aligned}]$

and so do the actual values:

$[\begin{aligned} && n_1 = 2 && n_2 = 3 && n_3 = 4 && n_4 = 5 && n_5 = 6 \end{aligned}]$

No absolute difference |n'_i-n_i| is $\geq 1$ , so we don’t update q_i .

5.3. Reading the Seventh Element

Now, we read x_7 = 20 . As before, the ideal values change:

$[\begin{aligned} n_1' &= 1 \\ n_2' &= 0.25(7-1) + 1 = 0.25\cdot 6 + 1 = 2.5 \\ n_3' &= 0.5(7-1)+1 = 0.5 \cdot 6 + 1 = 4 \\ n'_4 &= 0.75(7-1) + 1 = 0.75\cdot 6 + 1 = 5.5 \\ n'_5 &= 7 \end{aligned}]$

and so do the actual ones:

$[\begin{aligned} n_1 = 2 && n_2 = 4 && n_3 = 5 && n_4 = 6 && n_5 = 7 \end{aligned}]$

Since $d'_2 = 2.5-4=-1.5 \leq -1$ and $d'_3 = 4-5=-1 \leq -1$ , we need to update q_2 and q_3 . Using the parabolic formula for q_2 (and d=-1 ), we get:

$[\begin{aligned} q'_2 &= 20 +\frac{-1}{5-2}((4-2-1)\cdot \frac{30-20}{5-4} + (5-4+1)\cdot\frac{20-10}{4-2}) \\ &= 20- \frac{1}{3}(1\cdot 10 + 2\cdot 5) \\ &=20 - \frac{20}{3} \approx 13.33 \end{aligned}]$

Since 10< 13.33 < 30 , we accept the parabolic adjustment. We also update $n_2 \leftarrow 4 - 1 = 3$ . The same logic applies to q_3 .

6. Conclusion

In this article, we presented two algorithms for finding the median of a big data stream. First, if the data are symmetrical, we run the streaming mean algorithm because the median and the mean of symmetric distributions are the same. Otherwise, we can use P^2 , one of the many methods designed to estimate the streaming quantiles, of which the median is a special case.

Persistence

REST

Security