1. Introduction
In this tutorial, we’ll explore methods to create a random sample of a fixed size from Scala List. For example, if we need to select some five arbitrary elements from the list: List[Int] = List(0,1,2,3,4,5,6,7,8,9), the result might resemble List(8, 0, 3, 7, 5).
2. Certain Considerations Regarding Functional Approaches
In Scala, we often adhere to the functional style when working with collections, employing combinators like map and flatMap. Essentially, we traverse the collection and update each value. However, when it comes to accessing some random elements, traversing the list for updates may perform pretty weakly.
3. Solution Based on Tail Recursion
For the purpose of illustration, we can address the problem using tail recursion:
def getRandomSampleRec[T](list: List[T], size: Int): List[T] = {
@tailrec
def rec(xs: List[T], acc: List[T]): List[T] = {
if (acc.size == size) {
acc
} else {
val index = Random.nextInt(xs.size)
val (left, right) = xs.splitAt(index)
val (xsUpd, next) = if (right.nonEmpty) {
(left ::: right.tail, right.head)
} else {
(left.dropRight(1), left.tail.last)
}
rec(xsUpd, next :: acc)
}
}
if (size == 0) {
List.empty[T]
} else if (size > list.size) {
list
} else {
rec(list, List.empty[T])
}
}
Let’s review the key steps of the solution:
- We utilize splitAt(index) on the list xs to effectively create a new list (xsUpd) without the random element next.
- During each iteration, we prepend the element next to the accumulator acc. We terminate once the acc reaches the desired size.
It’s important to note that here, we impose a significant burden on performance. This is due to the fact that splitAt and list concatenation (:::) both have O(N) time complexity and, since we repeat these operations size times, the overall time complexity becomes O(size*N). Utilizing this method would be appropriate only for small samples.
4. Solution Based on Zipping Collection With Random Indexes
Among other purely functional solutions, we might consider the following approach:
def getRandomSampleZip[T](list: List[T], size: Int): List[T] =
list
.map(elem => (Random.nextInt(), elem))
.sortBy(_._1)
.map(_._2)
.take(size)
Let’s break down the process:
- We associate each element of the original List[T] with a random Int index using map. As a result, we obtain a List[(Int, T)].
- Next, we sort the List[(Int, T)] by these random indexes.
- We then convert the sorted collection back into a List[T] by omitting random indexes using map(_._2), and we limit the result with take(size).
By following this approach, the time complexity is O(N*log(N)), which corresponds to the complexity of the sorting operation.
5. Utilizing Random.shuffle from the Standard Library
We can leverage a standard Scala library method — shuffle* from *scala.util.Random — to mix elements of the List. Then, we take only the desired number of elements:
def getRandomSampleShuffle[T](list: List[T], size: Int): List[T] =
Random.shuffle(list).take(size)
Random.shuffle uses the mutable.ArrayBuffer data structure under the hood, which enhances the performance of element access by index.
This approach boasts a time complexity of O(N) since the shuffle operation traverses the list only once. In the worst case, the list will be traversed twice, if the size of the randomized sample matches the size of the initial list (as seen when take is called). Furthermore, this solution is notably elegant.
6. Testing
For testing purposes, we’ll extract samples from a sorted List of elements ranging from 0 to 99. Throughout the testing phase, our objectives include ensuring that the resulting samples’ elements are both shuffled (and hence not sorted) and non-repeating. Additionally, we must confirm that the sample sizes are accurate. As an appendix, we may also ensure which method performs the best:
"RandomFixedSizeSample" should {
"create a random sample out of the initial List" in {
val list = List.range(0, 100)
val sampleSize = 10
val list_0 = RandomFixedSizeSample.getRandomSampleRec(list, sampleSize)
val list_1 = RandomFixedSizeSample.getRandomSampleZip(list, sampleSize)
val list_2 = RandomFixedSizeSample.getRandomSampleShuffle(list, sampleSize)
list_0.size shouldBe sampleSize
list_1.size shouldBe sampleSize
list_2.size shouldBe sampleSize
list_0.toSet.size shouldBe list_0.size
list_1.toSet.size shouldBe list_1.size
list_2.toSet.size shouldBe list_2.size
isSorted(list_0) shouldBe false
isSorted(list_1) shouldBe false
isSorted(list_2) shouldBe false
}
"ensure getRandomSampleShuffle is the most performant, then goes getRandomSampleZip and then getRandomSampleRec" in {
val list = List.range(0, 10_000)
val sampleSize = 100
val start_0 = System.nanoTime()
RandomFixedSizeSample.getRandomSampleRec(list, sampleSize)
val end_0 = System.nanoTime()
val duration_0 = end_0 - start_0
val start_1 = System.nanoTime()
RandomFixedSizeSample.getRandomSampleZip(list, sampleSize)
val end_1 = System.nanoTime()
val duration_1 = end_1 - start_1
val start_2 = System.nanoTime()
RandomFixedSizeSample.getRandomSampleShuffle(list, sampleSize)
val end_2 = System.nanoTime()
val duration_2 = end_2 - start_2
duration_0 should be > duration_1
duration_1 should be > duration_2
}
7. Conclusion
As we can see, leveraging scala.util.Random method shuffle leads to the best performance and conciseness. Additionally, we’ve observed that zipping collections with random numbers is also pretty powerful.
As usual, these examples are available over on GitHub.