数组.sort()与Collections.sort()之间的区别

1. Overview

Sorting is a fundamental operation in computer science, essential for data organization and manipulation across various applications.

In this tutorial, we’ll compare Java’s commonly used sorting methods: Arrays.sort() and Collections.sort(). While serving the same primary function—sorting data—each method has its own features, caveats, and optimal use cases.

2. Basic Overview

Let’s begin by examining the fundamental differences between these two methods.

2.1. Arrays.sort()

The Arrays.sort() method is a utility function for sorting arrays in Java. It allows to sort arrays of primitive data types and objects. Whether we’re working with numerical data or alphabetical strings, Arrays.sort() can arrange the elements in ascending order. Additionally, we can modify the behavior with custom comparators for object arrays. This method is part of the java.util.Arrays class, which provides a suite of utilities for array manipulation.

2.2. Collections.sort()

On the other hand, Collections.sort() is designed for sorting instances of the List interface in Java’s Collection Framework. Unlike Arrays.sort(), which is limited to arrays, Collections.sort() can sort more dynamic data structures like ArrayList, LinkedList, and other classes that implement the List interface. Collections.sort() is a member of the java.util.Collections class, a utility class filled with static methods for operating on collections.

3. Stability

Let’s imagine that we have a collection of tasks:

tasks = new ArrayList<>();
tasks.add(new Task(1, 1, "2023-09-01"));
tasks.add(new Task(2, 2, "2023-08-30"));
tasks.add(new Task(3, 1, "2023-08-29"));
tasks.add(new Task(4, 2, "2023-09-02"));
tasks.add(new Task(5, 3, "2023-09-05"));
tasks.add(new Task(6, 1, "2023-09-03"));
tasks.add(new Task(7, 3, "2023-08-28"));
tasks.add(new Task(8, 2, "2023-09-01"));
tasks.add(new Task(9, 1, "2023-08-28"));
tasks.add(new Task(10, 2, "2023-09-04"));
tasks.add(new Task(11, 3, "2023-08-31"));
tasks.add(new Task(12, 1, "2023-08-30"));
tasks.add(new Task(13, 3, "2023-09-02"));

We would like to sort them first by their priority and then by their due date. We’ll be sorting them with two different approaches. In the first case, we’ll be using a stable algorithm provided by Collections:

final List<Task> tasks = Tasks.supplier.get();
Collections.sort(tasks, Comparator.comparingInt(Task::getPriority));
System.out.println("After sorting by priority:");
for (Task task : tasks) {
    System.out.println(task);
}
Collections.sort(tasks, Comparator.comparing(Task::getDueDate));
System.out.println("\nAfter sorting by due date:");
for (Task task : tasks) {
    System.out.println(task);
}

Also, we’ll sort the tasks using a non-stable algorithm. Because Java doesn’t offer the option to sort a List of reference types using a non-stable algorithm, we have a simple implementation of quicksort:

List<Task> tasks = Tasks.supplier.get();
quickSort(tasks, Comparator.comparingInt(Task::getPriority));
System.out.println("After sorting by priority:");
for (Task task : tasks) {
    System.out.println(task);
}
quickSort(tasks, Comparator.comparing(Task::getDueDate));
System.out.println("\nAfter sorting by due date:");
for (Task task : tasks) {
    System.out.println(task);
}

The code is overall the same. The only difference is the algorithms used. The sorting is happening in two steps. The first step sorts the tasks by priority and the second by due date.

The difference in the results is subtle but might significantly affect the code’s functionality and introduce hard-to-debug bugs. The stable version produces the following output:

After sorting by due date:
Task: #9  | Priority: 1 | Due Date: 2023-08-28
Task: #7  | Priority: 3 | Due Date: 2023-08-28
Task: #3  | Priority: 1 | Due Date: 2023-08-29
Task: #12 | Priority: 1 | Due Date: 2023-08-30
Task: #2  | Priority: 2 | Due Date: 2023-08-30
Task: #11 | Priority: 3 | Due Date: 2023-08-31
Task: #1  | Priority: 1 | Due Date: 2023-09-01
Task: #8  | Priority: 2 | Due Date: 2023-09-01
Task: #4  | Priority: 2 | Due Date: 2023-09-02
Task: #13 | Priority: 3 | Due Date: 2023-09-02
Task: #6  | Priority: 1 | Due Date: 2023-09-03
Task: #10 | Priority: 2 | Due Date: 2023-09-04
Task: #5  | Priority: 3 | Due Date: 2023-09-05

The tasks are sorted by date, and the previous ordering by priority will be saved when the dates are the same. Whereas the non-stable version gives us this:

After sorting by due date:
Task: #9  | Priority: 1 | Due Date: 2023-08-28
Task: #7  | Priority: 3 | Due Date: 2023-08-28
Task: #3  | Priority: 1 | Due Date: 2023-08-29
Task: #2  | Priority: 2 | Due Date: 2023-08-30
Task: #12 | Priority: 1 | Due Date: 2023-08-30
Task: #11 | Priority: 3 | Due Date: 2023-08-31
Task: #1  | Priority: 1 | Due Date: 2023-09-01
Task: #8  | Priority: 2 | Due Date: 2023-09-01
Task: #4  | Priority: 2 | Due Date: 2023-09-02
Task: #13 | Priority: 3 | Due Date: 2023-09-02
Task: #6  | Priority: 1 | Due Date: 2023-09-03
Task: #10 | Priority: 2 | Due Date: 2023-09-04
Task: #5  | Priority: 3 | Due Date: 2023-09-05

The tasks #2 and #12 have the same due date, but the priority is inversed. This is because a non-stable algorithm produces a non-deterministic behavior when it deals with equal but distinguishable items.

Because primitives don’t have identity or additional parameters, we can sort them using a non-stable algorithm. The only thing they have is a value, and that’s why we don’t care about the stability of the algorithms we’re using for primitives. As the example above shows, the stability feature is crucial for sorting objects.

That’s why Arrays.sort() uses the same implementation of a non-stable algorithm for primitives, such as Quicksort or Dual-Pivot Quicksort. When dealing with the collections of reference types, both Arrays.sort() and Collections.sort() use the same implementation, usually Merge Sort or TimSort.

4. Complexity

Let’s make a simple example comparing Merge Sort and Quicksort to show the difference between these algorithms and eventually between Collections.sort() and Arrays.sort(). We’ll be using simple implementations for both of them. This is done partially because Java doesn’t provide these algorithms, so we can pick them directly, and partly because the current algorithms have too many tweaks and improvements. Hence, it’s hard to develop similar test cases for both of them.

We will run the following tests to compare the throughput:

@Measurement(iterations = 2, time = 10, timeUnit = TimeUnit.MINUTES)
@Warmup(iterations = 5, time = 10)
public class PerformanceBenchmark {

    private static final Random RANDOM = new Random();
    private static final int ARRAY_SIZE = 10000;
    private static final int[] randomNumbers = RANDOM.ints(ARRAY_SIZE).toArray();
    private static final int[] sameNumbers = IntStream.generate(() -> 42).limit(ARRAY_SIZE).toArray();
    public static final Supplier<int[]> randomNumbersSupplier = randomNumbers::clone;
    public static final Supplier<int[]> sameNumbersSupplier = sameNumbers::clone;

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @Fork(value = 1, jvmArgs = {"-Xlog:gc:file=gc-logs-quick-sort-same-number-%t.txt,filesize=900m -Xmx6gb -Xms6gb"})
    public void quickSortSameNumber() {
        Quicksort.sort(sameNumbersSupplier.get());
    }
    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @Fork(value = 1, jvmArgs = {"-Xlog:gc:file=gc-logs-quick-sort-random-number-%t.txt,filesize=900m -Xmx6gb -Xms6gb"})
    public void quickSortRandomNumber() {
        Quicksort.sort(randomNumbersSupplier.get());
    }
    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @Fork(value = 1, jvmArgs = {"-Xlog:gc:file=gc-logs-merge-sort-same-number-%t.txt,filesize=900m -Xmx6gb -Xms6gb"})
    public void mergeSortSameNumber() {
        MergeSort.sort(sameNumbersSupplier.get());
    }
    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @Fork(value = 1, jvmArgs = {"-Xlog:gc:file=gc-logs-merge-sort-random-number-%t.txt,filesize=900m -Xmx6gb -Xms6gb"})
    public void mergeSortRandomNumber() {
        MergeSort.sort(randomNumbersSupplier.get());
    }
}

After running these tests, we got two artifacts: one is the performance metrics, and the other one is the garbage collection log.

4.1. Time Complexity

Let’s review the performance metrics for the tests above:

Benchmark                                    Mode  Cnt     Score     Error  Units
PerformanceBenchmark.mergeSortRandomNumber  thrpt    4  1489.983 ± 401.330  ops/s
PerformanceBenchmark.quickSortRandomNumber  thrpt    4  2757.273 ±  29.606  ops/s

Firstly, using Quicksort to sort random numbers, in general, is almost twice as fast as the MergeSort. The fact that Quicksort happens in place reduces the space complexity affecting the performance, which we discuss in the next section.

Also, we can see that Quicksort may degrade quite rapidly in some cases:

Benchmark                                    Mode  Cnt     Score     Error  Units
PerformanceBenchmark.mergeSortSameNumber    thrpt    4  5295.502 ±  98.624  ops/s
PerformanceBenchmark.quickSortSameNumber    thrpt    4   118.211 ±   0.117  ops/s

For example, when all the numbers are the same, MergeSort performs much better and is blazingly fast. Although we’re using a simple implementation of Quicksort and MergeSort, the behavior is generally similar to their more optimized and complex counterparts.

Please keep in mind that the performance of an algorithm and its time complexity might not correlate as there are many things we should consider additional: space complexity, hidden constant factors, optimization, adaptivity, etc.

The upper bound for the Quicksort and Dual-Pivot Quicksort is higher than MergeSort or TimSort. However, due to a series of improvements and checks, performance issues are rendered negligible and, overall, can be ignored. Thus, we can assume that Merge Sort, TimSort, Quicksort, and Dual-Pivot Quicksort would have the same time complexity.

DualPivotQuicksort.sort() method, for example, considers many parameters, such as parallelization, array size, presorted runs, and even the recursion depth. Depending on the primitive types and the size of an array, Java can use different algorithms, like Insertion Sort or Counting Sort. That’s why it’s hard to compare highly optimized algorithms; they would generally produce similar results.

4.2. Space Complexity

As mentioned previously, while being non-stable, Quicksort and Dual-Pivot Quicksort algorithms come with a trade-off as they use less space. Depending on the implementation, they might have at most O(log*n) space complexity. This is a nice feature that might have a significant performance impact. In our case, let’s concentrate on sorting random numbers.

While the time complexity of these algorithms is considered roughly the same, we have dramatic differences in the performance:

Benchmark                                    Mode  Cnt     Score     Error  Units
PerformanceBenchmark.mergeSortRandomNumber  thrpt    4  1489.983 ± 401.330  ops/s
PerformanceBenchmark.quickSortRandomNumber  thrpt    4  2757.273 ±  29.606  ops/s

To investigate this difference, we could look at the garbage collection logs. We’ll be using IBM Garbage Collection and Memory Visualizer:

Variant

mergeSort

quickSort

Forced collection count

Full collections

GC Mode

Mean garbage collection pause (ms)

0.33

0.47

Number of collections triggered by allocation failure

26848

588

Proportion of time spent in garbage collection pauses (%)

0.72

0.02

Proportion of time spent in stop-the-world garbage collection pauses (%)

0.72

0.02

Proportion of time spent unpaused (%)

99.28

99.98

Young collections – Mean garbage collection pause (ms)

0.33

0.47

Young collections – Mean interval between collections (ms)

46.6

2124

As we can see, the number of garbage collection events is significantly higher for MergeSort (26848 vs. 588), and it’s understandable as this algorithm uses more space.

4.3. Optimization

Because Merge Sort and TimSort require more space, using a non-stable algorithm for primitives is the optimization step, assuming that Quicksort and Dual-Pivot Quicksort degradation to O(n^2) is negligible. Technically, it’s possible to use a non-stable sorting algorithm for a collection of reference types and get a boost in performance. This can be done if stability isn’t important or equal objects are non-distinguishable.

One of the improvements we can use for wrapper classes is to convert them into primitives, do the sorting, and convert them back. Let’s consider the following test:

@Measurement(iterations = 2, time = 10, timeUnit = TimeUnit.MINUTES)
@Warmup(iterations = 5, time = 10)
@Fork(value = 2)
public class ObjectOverheadBenchmark {

    private static final ThreadLocalRandom RANDOM = ThreadLocalRandom.current();

    @State(Scope.Benchmark)
    public static class Input {
        public Supplier<List<Integer>> randomNumbers = () -> RANDOM.ints().limit(10000).boxed().collect(Collectors.toList());
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public void sortingPrimitiveArray(Input input, Blackhole blackhole) {
        final int[] array = input.randomNumbers.get().stream().mapToInt(Integer::intValue).toArray();
        Arrays.sort(array);
        final List<Integer> result = Arrays.stream(array).boxed().collect(Collectors.toList());
        blackhole.consume(result);
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public void sortingObjectArray(Input input, Blackhole blackhole) {
        final Integer[] array = input.randomNumbers.get().toArray(new Integer[0]);
        Arrays.sort(array);
        blackhole.consume(array);
    }
    
    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public void sortingObjects(Input input, Blackhole blackhole) {
        final List<Integer> list = input.randomNumbers.get();
        Collections.sort(list);
        blackhole.consume(list);
    }
}

The fact that we’re using Arrays.sort() here significantly boosts our performance Collection.sort():

Benchmark                                       Mode  Cnt     Score     Error  Units
ObjectOverheadBenchmark.sortingObjectArray     thrpt    4   982.849 ±  19.201  ops/s
ObjectOverheadBenchmark.sortingObjects         thrpt    4   976.778 ±  10.580  ops/s
ObjectOverheadBenchmark.sortingPrimitiveArray  thrpt    4  1998.818 ± 373.654  ops/s

Sorting an int[] produces more than a 100% increase in performance.

5. Implementation Details of Arrays.sort() and Collections.sort()

Please note that the algorithms and tests used in the previous sections. They don’t reflect the library implementation’s performance, as they have more complex processes that allow them to optimize specific cases. The tests are presented only to provide more visual information about the inner workings of the simple implementation for these two kinds of algorithms.

It’s virtually impossible to compare Collections.sort() and Arrays.sort() without their underlying algorithms.

The underlying algorithms are the crucial part contributing to these two methods’ complexity and performance. Because Collections.sort() is implemented with stability in mind, they use Merge Sort or TimSort. At the same time, sorting primitives doesn’t require this property and can use Quicksort or Dual-Pivot Quicksort.

To better understand these methods, we looked at the sorting algorithms they use directly.

6. Conclusion

Sorting is one of the cornerstone operations in computer science, enabling efficient data manipulation across many applications. By understanding the differences between algorithms, we can make more informed decisions when coding and optimizing for performance and functionality according to your specific requirements.

Thus, although this article aims to highlight the difference between Collections.sort() and Arrays.sort(). It is also a great guide to understanding the differences between different sorting algorithms. As always, the code for the implementation of this algorithm can be found over on GitHub.

Persistence

REST

Security