1. Introduction

This quick article is focused on JMH (the Java Microbenchmark Harness). First, we get familiar with the API and learn its basics. Then we would see a few best practices that we should consider when writing microbenchmarks.

Simply put, JMH takes care of the things like JVM warm-up and code-optimization paths, making benchmarking as simple as possible.

2. Getting Started

To get started, we can actually keep working with Java 8 and simply define the dependencies:

<dependency>
    <groupId>org.openjdk.jmh</groupId>
    <artifactId>jmh-core</artifactId>
    <version>1.36</version>
</dependency>
<dependency>
    <groupId>org.openjdk.jmh</groupId>
    <artifactId>jmh-generator-annprocess</artifactId>
    <version>1.36</version>
</dependency>

The latest versions of the JMH Core and JMH Annotation Processor can be found in Maven Central.

Next, create a simple benchmark by utilizing @Benchmark annotation (in any public class):

@Benchmark
public void init() {
    // Do nothing
}

Then we add the main class that starts the benchmarking process:

public class BenchmarkRunner {
    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(args);
    }
}

Now running BenchmarkRunner will execute our arguably somewhat useless benchmark. Once the run is complete, a summary table is presented:

# Run complete. Total time: 00:06:45
Benchmark      Mode  Cnt Score            Error        Units
BenchMark.init thrpt 200 3099210741.962 ± 17510507.589 ops/s

3. Types of Benchmarks

JMH supports some possible benchmarks: Throughput, AverageTime, SampleTime, and SingleShotTime. These can be configured via @BenchmarkMode annotation:

@Benchmark
@BenchmarkMode(Mode.AverageTime)
public void init() {
    // Do nothing
}

The resulting table will have an average time metric (instead of throughput):

# Run complete. Total time: 00:00:40
Benchmark Mode Cnt  Score Error Units
BenchMark.init avgt 20 ≈ 10⁻⁹ s/op

4. Configuring Warmup and Execution

By using the @Fork annotation, we can set up how the benchmark execution happens: the value parameter controls how many times the benchmark will be executed, and the warmup parameter controls how many times a benchmark will dry run before results are collected, for example:

@Benchmark
@Fork(value = 1, warmups = 2)
@BenchmarkMode(Mode.Throughput)
public void init() {
    // Do nothing
}

This instructs JMH to run two warm-up forks and discard results before moving onto real timed benchmarking.

Also, the @Warmup annotation can be used to control the number of warmup iterations. For example, @Warmup(iterations = 5) tells JMH that five warm-up iterations will suffice, as opposed to the default 20.

5. State

Let’s now examine how a less trivial and more indicative task of benchmarking a hashing algorithm can be performed by utilizing State. Suppose we decide to add extra protection from dictionary attacks on a password database by hashing the password a few hundred times.

We can explore the performance impact by using a State object:

@State(Scope.Benchmark)
public class ExecutionPlan {

    @Param({ "100", "200", "300", "500", "1000" })
    public int iterations;

    public Hasher murmur3;

    public String password = "4v3rys3kur3p455w0rd";

    @Setup(Level.Invocation)
    public void setUp() {
        murmur3 = Hashing.murmur3_128().newHasher();
    }
}

Our benchmark method then will look like:

@Fork(value = 1, warmups = 1)
@Benchmark
@BenchmarkMode(Mode.Throughput)
public void benchMurmur3_128(ExecutionPlan plan) {

    for (int i = plan.iterations; i > 0; i--) {
        plan.murmur3.putString(plan.password, Charset.defaultCharset());
    }

    plan.murmur3.hash();
}

Here, the field iterations will be populated with appropriate values from the @Param annotation by the JMH when it is passed to the benchmark method. The @Setup annotated method is invoked before each invocation of the benchmark and creates a new Hasher ensuring isolation.

When the execution is finished, we’ll get a result similar to the one below:

# Run complete. Total time: 00:06:47

Benchmark                   (iterations)   Mode  Cnt      Score      Error  Units
BenchMark.benchMurmur3_128           100  thrpt   20  92463.622 ± 1672.227  ops/s
BenchMark.benchMurmur3_128           200  thrpt   20  39737.532 ± 5294.200  ops/s
BenchMark.benchMurmur3_128           300  thrpt   20  30381.144 ±  614.500  ops/s
BenchMark.benchMurmur3_128           500  thrpt   20  18315.211 ±  222.534  ops/s
BenchMark.benchMurmur3_128          1000  thrpt   20   8960.008 ±  658.524  ops/s

6. Dead Code Elimination

When running microbenchmarks, it’s very important to be aware of optimizations. Otherwise, they may affect the benchmark results in a very misleading way.

To make matters a bit more concrete, let’s consider an example:

@Benchmark
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public void doNothing() {
}

@Benchmark
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public void objectCreation() {
    new Object();
}

We expect object allocation costs more than doing nothing at all. However, if we run the benchmarks:

Benchmark                 Mode  Cnt  Score   Error  Units
BenchMark.doNothing       avgt   40  0.609 ± 0.006  ns/op
BenchMark.objectCreation  avgt   40  0.613 ± 0.007  ns/op

Apparently finding a place in the TLAB, creating and initializing an object is almost free! Just by looking at these numbers, we should know that something does not quite add up here.

Here, we’re the victim of dead code elimination. Compilers are very good at optimizing away the redundant code. As a matter of fact, that’s exactly what the JIT compiler did here.

In order to prevent this optimization, we should somehow trick the compiler and make it think that the code is used by some other component. One way to achieve this is just to return the created object:

@Benchmark
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public Object pillarsOfCreation() {
    return new Object();
}

Also, we can let the Blackhole consume it:

@Benchmark
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public void blackHole(Blackhole blackhole) {
    blackhole.consume(new Object());
}

Having Blackhole consume the object is a way to convince the JIT compiler to not apply the dead code elimination optimization. Anyway, if we run theses benchmarks again, the numbers would make more sense:

Benchmark                    Mode  Cnt  Score   Error  Units
BenchMark.blackHole          avgt   20  4.126 ± 0.173  ns/op
BenchMark.doNothing          avgt   20  0.639 ± 0.012  ns/op
BenchMark.objectCreation     avgt   20  0.635 ± 0.011  ns/op
BenchMark.pillarsOfCreation  avgt   20  4.061 ± 0.037  ns/op

7. Constant Folding

Let’s consider yet another example:

@Benchmark
public double foldedLog() {
    int x = 8;

    return Math.log(x);
}

Calculations based on constants may return the exact same output, regardless of the number of executions. Therefore, there is a pretty good chance that the JIT compiler will replace the logarithm function call with its result:

@Benchmark
public double foldedLog() {
    return 2.0794415416798357;
}

This form of partial evaluation is called constant folding. In this case, constant folding completely avoids the Math.log call, which was the whole point of the benchmark.

In order to prevent constant folding, we can encapsulate the constant state inside a state object:

@State(Scope.Benchmark)
public static class Log {
    public int x = 8;
}

@Benchmark
public double log(Log input) {
     return Math.log(input.x);
}

If we run these benchmarks against each other:

Benchmark             Mode  Cnt          Score          Error  Units
BenchMark.foldedLog  thrpt   20  449313097.433 ± 11850214.900  ops/s
BenchMark.log        thrpt   20   35317997.064 ±   604370.461  ops/s

Apparently, the log benchmark is doing some serious work compared to the foldedLog, which is sensible.

8. Conclusion

This tutorial focused on and showcased Java’s micro benchmarking harness.

As always, code examples can be found on GitHub.