1. Introduction
In this tutorial, we’ll present two algorithms for calculating the median of a data stream.
2. The Median of a Stream
A median of a random variable is a value that splits the ‘s range in half: one half lower than the and the other greater than it:
To determine the median of a sample drawn from , we sort it to get the sample order statistics and select the middle value as the sample’s median:
This formula, however, assumes that the sample is sortable. That means that we can store it in the memory and make multiple passes over it.
2.1. Data Streams
That’s not the case when the sample is a big data stream. In such cases, we read the elements one by one from the stream as soon as they become available. Additionally, were we to keep them in the memory, we’d run out of it before the stream would end. So, we can’t effectively store or sort the whole stream, and we can’t revisit an element after reading it unless we save it. Unfortunately, we don’t have enough memory to keep the whole stream. For those reasons, we can’t determine the median in the usual way.
Instead, we use the streaming algorithms specialized for data streams. By design, they run in low-memory environments that can store only a portion of the stream at any point during an algorithm’s execution. To respect those constraints, the streaming algorithms sacrifice precision for low memory complexity. So, they don’t return the estimates the offline algorithms would. The goal is to have low time and space complexities while retaining as much precision as possible.
3. Symmetric Data
If the data are distributed symmetrically, then their median is equal to their mean. So, we can find the median by calculating the streaming mean:
algorithm StreamingMeanAlgorithm(stream):
// INPUT
// stream = a data stream
// OUTPUT
// The mean of the stream
n <- 0
mean <- 0
while stream is not empty:
n <- n + 1
x <- get the next (n-th) element of stream
mean <- ((n - 1) * mean + x) / n
return mean
This algorithm has memory complexity and runs in time. However, if we can’t guarantee that the data follow a symmetric distribution such as normal or uniform, we shouldn’t estimate the median with the streaming mean. Instead, we can use a quantile estimation streaming algorithm such as .
4. The Algorithm
A -quantile of () is the value greater than of ‘s distribution:
The median is a -quantile, so we can use any method for streaming quantile estimation, such as the algorithm. We’ll present it for the median case ().
4.1. Five Markers
To find the median, maintains five markers while reading the stream:
- , the minimimum
- , the -quantile
- , the -quantile
- , the -quantile
- , the maximum (the -quantile)
The and are called the middle markers because they’re between the median and the extreme values. The idea is to update the markers as we read the stream. In the end, we return the current value of .
4.2. The Update Step
Each is associated with , the number of elements observed so far not greater than the current (). Ideally, will be approximately equal to at any point, where is the number of elements we’ve read from the stream, and determines which quantile the marker represents (). More precisely:
When reading a new element, , we increase by the of all markers that are not lower than . If the new doesn’t differ from its ideal value by at least 1, we don’t update . Otherwise, we update using the piecewise parabolic (PP or ) formula (which is how the algorithm got its name). If we adjust , we also modify as explained in the next section.
4.3. Piecewise Parabolic Adjustment
We assume that the curve passing through any three adjacent markers is a parabola:
That is, we assume that a second-order function is a good enough approximation of the underlying CDF between any three markers.
The ideal s are on the real scale, but the actual s are integers. If the difference between an updated and is , we do nothing. Otherwise, we update as follows:
(1)
where , that is, if the difference is , and if the difference is .
If is to be adjusted, we adjust as well:
4.4. Technical Remarks
The adjustment formula applies only to . The first marker, which corresponds to the minimum, is replaced by if it’s the new minimum. The same goes for the maximum marker, . We don’t adjust their values as we do for , , and .
All the s must be different. So, if adjustment leads to any two being equal, we don’t apply it. Also, s must be non-decreasing (). If the parabolic adjustment violates the ordering, we use the linear alternative instead:
(2)
Last but not least, we use the first five elements to initialize the markers, so we don’t update or adjust them. So the main part of the algorithm starts with .
4.5. Pseudocode
Here we present the pseudocode of :
algorithm P2Algorithm(stream):
// INPUT
// stream = a data stream
// OUTPUT
// The median of stream
for n <- 1, 2, 3, 4, 5:
x_n <- get the next element of stream
Sort x_1, x_2, x_3, x_4, x_5
Use the sorted five elements to initialize the markers q_i
for i <- 1 to 5:
Set n_i to i
n <- 5
while stream is not empty:
n <- n + 1
Update the ideal positions n_i` for each i=1,2,4,5
x_n <- get the next element of stream
Find the lowest j such that x_n >= q_j
for i <- j to 5:
n_i <- n_i + 1
for i <- 2, 3, 4:
d_i <- n_i` - n_i
if |d_i| >= 1:
d <- sgn(d_i)
Calculate the adjusted marker q_i` using the parabolic formula
if q_{i-1} < q_i` < q_{i+1}:
q_i <- q_i`
else:
Use the linear formula to calculate the adjusted marker q_i`
q_i <- q_i`
n_i <- n_i + d
return q_3
4.6. Possible Improvements and Trade-Offs
It turns out that performs rather well. The reason is that second-degree curves are sufficiently precise piecewise approximations of the actual CDFs the data follow in practice. Higher-order curves may offer even better estimates, but improved precision may not be worth the increased mathematical complexity and computational burden.
Just as five markers work better than three (the minimum , the target , and the maximum ), seven markers would probably allow more precise estimation. However, using more markers would slow down the algorithm.
4.7. Complexity
The space complexity of is because it uses the constant number of markers. Time complexity is because it reads the stream once.
5. Example
Let’s demonstrate on the following stream:
5.1. Initialization
We read and sort the first five numbers to initialize the s and s ():
Since , the ideal values are:
5.2. Reading the Sixth Element
Now, we read . Since , the ideal values change:
and so do the actual values:
No absolute difference is , so we don’t update .
5.3. Reading the Seventh Element
Now, we read . As before, the ideal values change:
and so do the actual ones:
Since and , we need to update and . Using the parabolic formula for (and ), we get:
Since , we accept the parabolic adjustment. We also update . The same logic applies to .
6. Conclusion
In this article, we presented two algorithms for finding the median of a big data stream. First, if the data are symmetrical, we run the streaming mean algorithm because the median and the mean of symmetric distributions are the same. Otherwise, we can use , one of the many methods designed to estimate the streaming quantiles, of which the median is a special case.