1. Introduction
In this tutorial, we’ll explain how to calculate the percentage improvement when the performance metric is time. For instance, we may be interested in such a metric when optimizing our code to run faster and comparing a new version’s performance to that of the old one.
Such an empirical approach complements a more rigorous mathematical analysis of the worst-case and average-case complexity. What’s more, if our code is too complex for the complexity analysis and comparison, the empirical evaluation is our only option. If that’s the case, we’ll probably have to use some statistics to justify our conclusions.
2. What Is a Time Improvement?
Let’s say that is the time our new version of code takes to run some tasks and that is the time the previous version took. Our goal is to quantify how much is better than in percentages.
When comparing times, smaller numbers are better, but people intuitively associate small quantities with low performance. So, we need a metric that inverts and so that if , it assigns a larger score to the new version.
Also, we want the score to express the improvement in relative, not absolute terms. Why? Well, most people would agree that reducing the duration from 1000 to 900 seconds isn’t as big an improvement as decreasing it from 200 to 100. Percentages help us communicate such differences effectively.
There are essentially two ways to achieve both goals.
3. The Improvement as the Time We Save
First, we can say that the improvement is the reduction in time needed to complete the same task as before. So, we first find the difference of and , and then divide it with :
(1)
For example, let’s say that we’re designing a new algorithm for training neural networks. To test it, we fit a network to some data. The algorithm needed 300 seconds to train the network. We try to do better, so after some clever optimization, we reduce the training time to 30 seconds. Using the above formula, we get 90%:
3.1. Interpretation
Defined like this, the improvement tells us what fraction of the old version’s execution time we can save by switching to the new version. If , the score (1) is positive, which makes sense since we actually do save time. In contrast, if , we get a negative percentage. We interpret it as the indicator that the old version is better, but negative percentages aren’t that intuitive for most people.
4. The Improvement as the New Amount of Work Done
Another way to compute the percentage improvement is to ask how much work we can do with the new version for the same amount of time the old version took to complete a unit of work (e.g., a task).
So, if the old version took seconds to complete 1 unit, and the new version completes 1 unit of work in seconds, then the fraction:
(2)
tells us precisely what we asked. If , we get a number > 100%, which agrees with our intuition. Conversely, if , we get a number lower than 1, which means that the new version works slower.
4.1. Example
Using the same execution times as above, we get:
So, we can complete ten units of work with the new version for the same amount of time it takes the old version to finish only one.
4.2. Variation
A variation of this approach is:
(3)
The only difference is that we subtract 100% from the result we get using the formula (2). That slightly changes the interpretation. Now, the score denotes the additional number of units the new version can complete until the old one completes a unit of work.
However, if , we’ll get a negative percentage, which isn’t intuitive. Further, if we get a score between 0% and 100%, that means that the new version is faster but completes less than one additional unit of work in comparison to the old code.
5. Statistical Evaluation of Improvement
Statistically speaking, we shouldn’t make conclusions about the relative performance of our two versions of code by comparing single runs. It may happen that the new version ran faster just because the CPUs were busy with other tasks running in parallel when we ran the old version.
Moreover, if the execution time can vary with units of work, we should measure the time for several randomly selected units that are representative of the tasks the code will encounter in practice.
So, to be methodologically correct, we should compare the distributions of measured times for different runs on various units.
5.1. Setup
Let be the execution time of the -th run of the old version for the -th unit. Similarly, let denote the -th execution time of the new version for the same, -th unit.
With units and runs, we have two matrices of execution times. To estimate the improvement, we should compare those matrices. Depending on whether the corresponding elements are matched or not, we can proceed in two ways.
5.2. Matched Runs
Ideally, we would perform all the runs under the same conditions. For example, that means the CPUs would have the same overload that is due to other processes running in parallel. Or, if the runs depend on random number generation, we’d use the same seed in all the -th runs. In such cases, we say that the runs are matched because there is a 1-to-1 correspondence between times ( corresponds only to , and vice versa).
To get the overall improvement, we can average the pairwise improvements calculated for the matched elements in the matrices:
(4)
where is one of the improvement scores we discussed in previous sections. Since a picture is worth a thousand words, we should accompany the average with a plot of the distribution of the scores. If most of the scores are high, that’s strong evidence that the new version is faster than the old one.
5.3. What if the Runs Aren’t Matched?
If there’s no natural correspondence between and , we can match all the runs we performed on the same units. More specifically, instead of calculating for the same and , we compare each to each (for all ):
(5)
In this case, we can have different numbers of runs for each version, but it’s generally a good idea to devote the same computational effort to both versions.
5.4. Example
Let’s say that we ran our code on three tasks four times and that the times are matched:
Using the formula (4) with the improvement definition (2), we get the average improvement:
So, we conclude that the new code completes, on average, 1.42 tasks by the time the old one finishes a single unit. Percentage-wise, that’s an improvement of 142%.
5.5. Statistical Testing
Additionally, we can run a statistical test to check if the actual improvement is > 100%. One way would be to count the individual scores that are > 100% and construct a confidence interval around the proportion of such scores. If the interval covers large percentages, we can be fairly sure that the new code is faster and that our results aren’t due to just chance.
6. Conclusion
In this article, we talked about how to compute the percentage improvement when the performance metric is time. We presented two formulae and discussed the problem from a statistical perspective as well.