optimization – Libera #java

Oh, shit, git! is a site that shows a number of common things git users say, as well as how to fix the situations that cause them. Note that the title indicates that the readers are expected to be of age. Some examples:
- Oh shit, I did something terribly wrong, please tell me git has a magic time machine!?!
- Oh shit, I committed and immediately realized I need to make one small change!
- Oh shit, I need to change the message on my last commit!
- Oh shit, I accidentally committed something to master that should have been on a brand new branch!
- Oh shit, I accidentally committed to the wrong branch!
- Oh shit, I tried to run a diff but nothing happened?!
- F*ck this noise, I give up.
  It’s worth noting that your editor has used at least a variant of all of these statements, most of them even in recent memory.
User whaley, in a moment when he wasn’t accompanied by roosters, cows, and goats, pointed out “Suffering-oriented programming“, saying that it convinced him that “make it work, make it pretty, make it fast” is the right approach, as compared to “make it work, make it fast, make it pretty.” Good read.
From Reddit: “Recaf is an open-source framework for authoring extensions (dialects) as libraries for Java. You can redefine every major syntactic element of the language, either add new ones or create your own flavor of Java that matches your needs. It can be used to give syntactic support to libraries, to generate and instrument code. Last but not least you can experiment with the design and implementation of Java extensions in plain Java.” Seems like another tool to compete (sort of) with Lombok and Autovalue.

Today someone asked about the size of java source files, because they wanted to know if they could unroll a loop without making javac throw up, I think. It’s an interesting question, but the main thing that caught the channel’s attention was the claim that an unrolled loop was orders of magnitude faster than the loop itself, such that the unrolled loop took 0.1 millisecond and the normal loop took 0.5 milliseconds – for a loop that simply ran 10000 times.
For one thing, the idea that a loop that runs 10000 times takes 0.1 ms sounds absurd; computers are fast. A Commodore 64 could run that loop in ten seconds. (Actually, after testing it ended up running in roughly twelve seconds. This is the first BASIC code I’ve written in close to thirty years.)

Let’s clear one thing out first: java methods – not classes – can be up to 64K. (if you want to know the actual details, see Limitations of the Java Virtual Machine.) I don’t know of a source limitation, although the idea of writing a 64K source file sounds horrifying and rather difficult to manage; I’m sure some enterprising soul can try to determine the largest java source file that can be written such that it still compiles to a valid .class file.
So let’s dig in for a bit, and see what Java does, from the standpoint of observation. Let’s write a loop and time it, with a number of iterations, to see what the JVM actually does to the code as it runs.

This was run on an eight-core Windows 8.1 machine, using the Java 8 virtual machine. Your results may vary based on your clock speed, memory, system load, operating system, and JVM. This should be obvious, but it’s worth remembering.

Here’s our source code, which contains two classes:

package org.javachannel;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.OptionalDouble;
import java.util.function.Consumer;
public class Thing {
  public static void main(String[] args) {
    Thing t = new Thing();
    for (int i = 0; i < 1500; i++) {
      t.measure(i);
    }
    t.display();
  }
  private void display() {
    SlidingAverage sa = new SlidingAverage();
    slide.stream().forEach(sa);
    sa.display();
    System.out.println("sum: " +
        artifact.stream().mapToInt(Integer::intValue).sum());
  }
  List slide = new ArrayList<>(2000);
  List artifact = new ArrayList<>(2000);
  private void measure(int i) {
    int sum = 0;
    long start = System.nanoTime();
    for (int j = 0; j < 10000; j++) {
      sum += j % 3;
    }
    long end = System.nanoTime();
    slide.add((end - start));
    artifact.add(sum);
  }
}
class SlidingAverage implements Consumer {
  List integers = new LinkedList<>();
  List averages = new ArrayList<>();
  @Override
  public void accept(Long value) {
    integers.add(value);
    if (integers.size() > 30) {
      integers.remove(0);
      OptionalDouble od = integers.stream().mapToLong(Long::longValue).average();
      if (od.isPresent()) {
        averages.add((long) od.getAsDouble());
      }
    }
  }
  public void display() {
    int count = 1;
    for (Long average : averages) {
      if (count % 30 == 0) {
        System.out.printf("%-5d %-7d%n", count, average);
      }
      count++;
    }
  }
}

The most important part of all of this is the loop itself:

  private void measure(int i) {
    int sum = 0;
    long start = System.nanoTime();
    for (int j = 0; j < 10000; j++) {
      sum += j % 3;
    }
    long end = System.nanoTime();
    slide.add((end - start));
    artifact.add(sum);
  }

This runs the inner block 10000 times, adding to the result based on the loop counter; this should prevent the JVM from optimizing the loop out of existence. It also adds a good bit to our runtime, as we'll see, but this is less of an impact than you might think.
The rest of the code is really me trying to be fancy, showing a sliding scale of averages; basically, every "average" is based on the prior thirty samples (where each sample is an iteration through the loop.)
So what did this show me when it ran? Here's some output:

30    56051
60    53855
90    54397
120   55338
150   54269
180   55965
210   55495
...
600   56820
630   64889
660   64304
690   50177
720   9536
750   9536
780   10448
810   9522
840   9536
870   10235
...
1440  9508
1470  9508
sum: 14998500

So what this is showing us is that the average of our first thirty times through the loop was ~56000 nanoseconds. A nanosecond is one billionth of a second, so there are 1000000 nanoseconds in a single millisecond; it’s saying that it took 0.056 milliseconds to run the loop the first time through, on average.
It stabilized for a while, until it ran the loop roughly 700 times; then the average dropped down to roughly 10000 nanoseconds.
If you remove the sum += j%3 from the loop, replacing it with sum += 1, the results are dramatic to say the very least: after 180 times through the loop, the average drops down to around 15 nanoseconds.

So what can we conclude?

I was actually surprised by the results; I expected the averages to fall, but not as quickly in the run as they actually did. I expected Hotspot to compile the code to machine language pessimistically, then run through two more levels of optimization before finally settling in on an optimal variation.
Instead, we saw three levels of optimization (as opposed to four). The first is difficult to see – it’s that very first value, which hides a number of calls (roughly ten) that actually took 100000 nanoseconds. We then stabilized around 56000 nanoseconds, and then dropped 80% of our runtime for each loop.
It’s safe to say that Hotspot did a good job on this code – maybe not perfect, but pretty darned close. I didn’t run this on a JVM with debugging enabled, so I don’t know what actual phases of compilation were used.
It’s also safe to say that these numbers are not reliable. Benchmarks are notoriously difficult to write, and even more notorious in Java (because it compiles and optimizes in multiple stages, as we see here), and even more notorious when it’s a microbenchmark like this one. ~10000 ns sounds like an incredibly long time for such a simple loop (“it’s only four or five lines!”) — but it really isn’t. (For reference, a normal blink of an eye takes around 300-400 milliseconds – or 300,000,000 nanoseconds.)

Tag: optimization

Interesting Links – 26 Sep 2016

Watching HotSpot in action, deductively

So what can we conclude?