1. Introduction

In this tutorial, we’ll look at how we can remove duplicate values from a Scala list.

This is a simple problem that can teach us some idiomatic constructs. It appears frequently, and its solution is not as obvious as it might seem at first sight.

2. Description of Solution

When thinking about a solution to this problem, we should consider what the expected output should be:

  • In the best-case scenario, a list with no duplicated elements, the resulting list should be identical to the input one
  • In the worst case, a list where all elements are identical, the result should be a list with a single element
  • And in the degenerate case of an empty list, the result should be an empty list too

There are some factors that contribute to this problem not being trivial.

2.1. Duplicity Is Ambiguous

For atomic values, assessing equality is trivial. For objects, that is not enough because equality might be based on a subset of the class’ fields.

2.2. The List Is Not an Efficient Collection

Lists are very intuitive collections, and they are great for many purposes. They are suitable for some operations (like inserting elements at the beginning) and for building recursive algorithms.

With that being said, a list is a collection that has no indexes. There is no random access to its elements, and there is no inherent order of any type. In practical terms, eliminating duplicates from a list is more complicated and more expensive than for other collections. There are, indeed, collections that just don’t allow duplicates (the set) and collections, like the map, that are naturally indexed (for example, by the identity).

If we want to remove duplicates from a list, we need to create an algorithm for that because the list collection doesn’t possess one. Additionally, we need to realize that it will not be lightning-fast because of the considerations listed above.

2.3. Laziness Cannot Be Used

One more caveat, although this can be said of any collection, not only lists, is that finding and removing duplicates from a list only works for finite lists. In the functional world, it’s frequent to find infinite collections that only calculate their elements as required in a lazy way. We need to be sure that the algorithm we create will be working with finite lists. Removing duplicate elements from infinite lists is equivalent to just filtering the elements from a stream, removing those that already have been received.

3. Testing De-duplication

Before we start coding the de-duplication algorithms, we need to have a way to test if what we write is correct.

We are going to test two cases:

  1. List of integers
  2. List of objects

3.1. De-duplication in Lists of Integers

This is the simplest case:

class DuplicatesRemoverSpec extends AnyWordSpec {
  // ...
  "DuplicatesRemover" should {
    "return a shorter integers list without duplicates" in {
      val withDuplicates = List(3, 7, 2, 7, 1, 3, 4)
      val withoutDuplicates = List(3, 7, 2, 1, 4)
      assertResult(withoutDuplicates)(
        DuplicatesRemover.removeDuplicatesRecursively(withDuplicates))
      assertResult(withoutDuplicates)(
        DuplicatesRemover.removeDuplicatesIteratively(withDuplicates))
    }
  // ...
}

3.2. De-duplication in Lists of Objects

The tests might start like this:

class DuplicatesRemoverSpec extends AnyWordSpec {

  "DuplicatesRemover" should {
    // ...
     "de-duplicate lists of objects" in {
      case class FullIdentityPerson(userId: String, firstName: String, lastName: String)

      // First, let's test full equivalence
      val withFullDuplicates = List(
        FullIdentityPerson(userId = "mm01", firstName = "Mickey", lastName = "Mouse"),
        FullIdentityPerson(userId = "jw04", firstName = "John", lastName = "Wayne"),
        FullIdentityPerson(userId = "mm01", firstName = "Marilyn", lastName = "Manson"),
        FullIdentityPerson(userId = "sh01", firstName = "Sherlock", lastName = "Holmes"),
        FullIdentityPerson(userId = "mm01", firstName = "Mickey", lastName = "Mouse"),
        FullIdentityPerson(userId = "jw04", firstName = "John", lastName = "Watson")
      )
      val withoutFullDuplicates = List(
        FullIdentityPerson(userId = "mm01", firstName = "Mickey", lastName = "Mouse"),
        FullIdentityPerson(userId = "jw04", firstName = "John", lastName = "Wayne"),
        FullIdentityPerson(userId = "mm01", firstName = "Marilyn", lastName = "Manson"),
        FullIdentityPerson(userId = "sh01", firstName = "Sherlock", lastName = "Holmes"),
        FullIdentityPerson(userId = "jw04", firstName = "John", lastName = "Watson")
      )
      assertResult(withoutFullDuplicates)(
        DuplicatesRemover.removeDuplicatesRecursively(withFullDuplicates))
      assertResult(withoutFullDuplicates)(
        DuplicatesRemover.removeDuplicatesIteratively(withFullDuplicates))
      // ...

4. Recursive Approach

The first approach we are going to illustrate is not the most efficient, but it has the advantage of being very intuitive and elegant and paves the way for more efficient ones.

The main idea is very simple: use a recursive function that will:

  1. Separately receive the last element of the list, and the list without its last element
  2. Find out whether the last element is already in the list
    • If it is, return the received list, with duplicate elements removed, and the last element attached to it
    • Otherwise, return just the received list, with duplicate elements removed

The key token of the above description is “with duplicate elements removed” because that is the recursive invocation. The code looks like this:

def removeDuplicatesRecursively[T](list: List[T]): List[T] = {
  def assessRemoval(init: List[T], last: T): List[T] =
    if (init.isEmpty) List(last)
    else {
      val recursiveList = assessRemoval(init.init, init.last)
      if (init.contains(last)) recursiveList
      else recursiveList :+ last
    }

  if (list.isEmpty) list
  else assessRemoval(list.init, list.last)
}

5. Iterative Approach

The recursive approach described above is elegant and easy to understand. Unfortunately, it has a big problem. Since the recursive function is not tail-recursive, each successive invocation consumes more heap memory. Therefore, for very long lists, there will be a point where memory exhaustion will happen.

To avoid that, we need to design a non-recursive approach. We will make use of folding, a very efficient and extremely versatile mechanism that all sequence-derived collections have.

Starting with an empty list, let’s scan each element of the original list and, if it is already in the new list, ignore it. Otherwise, append it to the new list:

def removeDuplicatesIteratively[T](list: List[T]): List[T] =
  list.foldLeft(List.empty[T]) { (partialResult, element) =>
    if (partialResult.contains(element)) partialResult
    else partialResult :+ element
  }

5.1. Possible Optimization

As stated before, the list is not the most efficient structure for performing comparisons. It’s not difficult to see that this problem offers an optimization opportunity in the form of memoization, using an additional structure to record whether elements are already in the list or not. If there are many duplicates, this can reduce the running time of the process at the expense of more memory and a more complex implementation.

6. Using Scala Library

As is commonly the case, Scala libraries implement efficient solutions for many of the problems we face in the field. In this particular case, the Seq base class, of which List is a specialization, has a method called distinct that does the job. Thus, our implementation might look as simple as this:

def removeDuplicatesWithLibrary[T](list: List[T]): List[T] =
  list.distinct

We could always directly use the distinct method, or the more flexible distinctBy, that accepts a function to transform the elements before the comparison is performed.

It is also possible to remove duplicate elements by temporarily transforming the List to a Set, then back to a List, because one of the basic features of a Set is that it doesn’t allow duplicate elements:

def removeDuplicatesViaSet[T](list: List[T]): List[T] = list.toSet.toList

However, this approach is not recommended because there is the danger that the elements will change their order in the resulting List, as a Set is an unordered collection.

7. Conclusion

Removing duplicate elements from a list is an interesting problem, algorithmically speaking. It frequently appears in academic exercises and in technical interviews. When it appears in the professional field, it usually does in controlled situations, where the number of elements of the list is not very big and where there are no hard performance constraints.

In a real-life situation, if a long list is likely to contain duplicates and if those are going to be removed, the best philosophy is to think of alternative structures to avoid facing the problem in the first place.

As usual, the full source code for the article is available on GitHub.


« 上一篇: Scala 中的嵌套函数