1. Introduction

In this tutorial, we’ll explore methods to create a random sample of a fixed size from Scala List. For example, if we need to select some five arbitrary elements from the list: List[Int] = List(0,1,2,3,4,5,6,7,8,9), the result might resemble List(8, 0, 3, 7, 5).

2. Certain Considerations Regarding Functional Approaches

In Scala, we often adhere to the functional style when working with collections, employing combinators like map and flatMap. Essentially, we traverse the collection and update each value. However, when it comes to accessing some random elements, traversing the list for updates may perform pretty weakly.

3. Solution Based on Tail Recursion

For the purpose of illustration, we can address the problem using tail recursion:

def getRandomSampleRec[T](list: List[T], size: Int): List[T] = {
    @tailrec
    def rec(xs: List[T], acc: List[T]): List[T] = {

      if (acc.size == size) {
        acc
      } else {
        val index = Random.nextInt(xs.size)
        val (left, right) = xs.splitAt(index)
        val (xsUpd, next) = if (right.nonEmpty) {
          (left ::: right.tail, right.head)
        } else {
          (left.dropRight(1), left.tail.last)
        }
        rec(xsUpd, next :: acc)
      }
    }

    if (size == 0) {
      List.empty[T]
    } else if (size > list.size) {
      list
    } else {
      rec(list, List.empty[T])
    }
  }

Let’s review the key steps of the solution:

  1. We utilize splitAt(index) on the list xs to effectively create a new list (xsUpd) without the random element next.
  2. During each iteration, we prepend the element next to the accumulator acc. We terminate once the acc reaches the desired size.

It’s important to note that here, we impose a significant burden on performance. This is due to the fact that splitAt and list concatenation (:::) both have O(N) time complexity and, since we repeat these operations size times, the overall time complexity becomes O(size*N). Utilizing this method would be appropriate only for small samples.

4. Solution Based on Zipping Collection With Random Indexes

Among other purely functional solutions, we might consider the following approach:

def getRandomSampleZip[T](list: List[T], size: Int): List[T] =
  list
    .map(elem => (Random.nextInt(), elem))
    .sortBy(_._1)
    .map(_._2)
    .take(size)

Let’s break down the process:

  1. We associate each element of the original List[T] with a random Int index using map. As a result, we obtain a List[(Int, T)].
  2. Next, we sort the List[(Int, T)] by these random indexes.
  3. We then convert the sorted collection back into a List[T] by omitting random indexes using map(_._2), and we limit the result with take(size).

By following this approach, the time complexity is O(N*log(N)), which corresponds to the complexity of the sorting operation.

5. Utilizing Random.shuffle from the Standard Library

We can leverage a standard Scala library method — shuffle* from *scala.util.Randomto mix elements of the List. Then, we take only the desired number of elements:

  def getRandomSampleShuffle[T](list: List[T], size: Int): List[T] =
    Random.shuffle(list).take(size)

Random.shuffle uses the mutable.ArrayBuffer data structure under the hood, which enhances the performance of element access by index.

This approach boasts a time complexity of O(N) since the shuffle operation traverses the list only once. In the worst case, the list will be traversed twice, if the size of the randomized sample matches the size of the initial list (as seen when take is called). Furthermore, this solution is notably elegant.

6. Testing

For testing purposes, we’ll extract samples from a sorted List of elements ranging from 0 to 99. Throughout the testing phase, our objectives include ensuring that the resulting samples’ elements are both shuffled (and hence not sorted) and non-repeating. Additionally, we must confirm that the sample sizes are accurate. As an appendix, we may also ensure which method performs the best:

"RandomFixedSizeSample" should {
  "create a random sample out of the initial List" in {
    val list = List.range(0, 100)
    val sampleSize = 10
    val list_0 = RandomFixedSizeSample.getRandomSampleRec(list, sampleSize)
    val list_1 = RandomFixedSizeSample.getRandomSampleZip(list, sampleSize)
    val list_2 = RandomFixedSizeSample.getRandomSampleShuffle(list, sampleSize)

    list_0.size shouldBe sampleSize
    list_1.size shouldBe sampleSize
    list_2.size shouldBe sampleSize

    list_0.toSet.size shouldBe list_0.size
    list_1.toSet.size shouldBe list_1.size
    list_2.toSet.size shouldBe list_2.size

    isSorted(list_0) shouldBe false
    isSorted(list_1) shouldBe false
    isSorted(list_2) shouldBe false
  }
  "ensure getRandomSampleShuffle is the most performant, then goes getRandomSampleZip and then getRandomSampleRec" in {
    val list = List.range(0, 10_000)
    val sampleSize = 100

    val start_0 = System.nanoTime()
    RandomFixedSizeSample.getRandomSampleRec(list, sampleSize)
    val end_0 = System.nanoTime()
    val duration_0 = end_0 - start_0

    val start_1 = System.nanoTime()
    RandomFixedSizeSample.getRandomSampleZip(list, sampleSize)
    val end_1 = System.nanoTime()
    val duration_1 = end_1 - start_1

    val start_2 = System.nanoTime()
    RandomFixedSizeSample.getRandomSampleShuffle(list, sampleSize)
    val end_2 = System.nanoTime()
    val duration_2 = end_2 - start_2

    duration_0 should be > duration_1
    duration_1 should be > duration_2
  }

7. Conclusion

As we can see, leveraging scala.util.Random method shuffle leads to the best performance and conciseness. Additionally, we’ve observed that zipping collections with random numbers is also pretty powerful.

As usual, these examples are available over on GitHub.