Sunday, October 12, 2014

A case for Java on Desktop (with a benchmark)

When I started CodeRefractor, Windows 8 was somewhat just released, and .Net looked an unstable platform to invest in (for reasons like: Xaml/WPF team was disolved, or Metro runtime being a new way to get locked in into Microsoft's garden, or Xamarin, but not get me started there also).

In the last year many of the shortcomings were fixed but mostly from Xamarin's team and the related projects: MonoGame runs on most platforms nicely, Xamarin added some optimizations (which were part of CR :) ) and made the performance story better, but also even the Microsoft's story is much better in the .Net world (Ryu JIT, an Ahead-of-Time strategy which would give a similar performance profile with CR when finished).

So why run Java on desktop? Here is my top list of today why you should consider Java to C++ or C# on desktop:
- all major IDEs of Java are free of charge (IntelliJ Idea, Eclipse and Netbeans)
- the framework of UIs (JavaFx) is decent and has all the features you would expect on a top-notch UI framework including: CSS styling, GPU acceleration
- Java fixes with Java 8 two of the biggest coding/behavior issues: lambdas make working with UI events, threads etc. sane! The language is very repetitive for my taste, but it makes it very readable and the removal of PerfGen (a nifty bug-by-design in my view)
- Java 8 added Stream API, which as for me the biggest feature is adding of a simple way to write parallel code.

So fine, if you grant me that Java works mostly as language/frameworks as almost all languages of year 2010, why not pick still all other languages? As for me: cross platform and top-notch performance are really huge features. To give a Jar and it is supported on all platforms is a huge deal.

So I made claims about performance, isn't it Java slow? Especially on desktop? I would say: "depends" and "if you know how Java works, Java runs fast".

So I did this weekend a simple Qt application and I want to do something that compilers like C++ do know how to optimize it: a rotation of a picture by 90 degrees (reference hardware: Core i5M-540M, Win64, GCC MinGw 4.8.1-32 bit, Java 8.0 u20 64 bit)

This is the "hottest of the hottest" loop that would rotate a picture in C++ (without using assembly):
void MainWindow::rotatePixelData(int* __restrict src, int* __restrict dest, int width, int height){
    int newWidth = height;
    int newHeight = width;
    for(int x=0; x<width; x++) {
        for(int y=0; y<height; y++) {
            int newX = height-1-y;
            int newY = x;
            dest[newY*newWidth+newX] = src[y*width+ x];
        }
    }
}

Best time of rotating a 4K picture is: 144 ms (the first time is like 161 ms).

The Java code is the same just is not using "__restrict":
static void rotatePixelData(int[] src, int[] dest, int width, int height) {
        int newWidth = height;
        int newHeight = width;
        for (int x = 0; x < width; x++) {
            for (int y = 0; y < height; y++) {
                int newX = height - 1 - y;
                int newY = x;
                dest[newY * newWidth + newX] = src[y * width + x];
            }
        }
    }

Typical time: 126 ms (the worst, yet not the first 146 ms).

Of course, without wanting to prove myself that I show bad benchmarks, or that I didn't tune the compiler enough, I wanted to make a difference significant from Java:

public class RotatorParallel {
    static List<Integer> buildSteps(int steps){
        List<Integer> result= new ArrayList<>();
        for(int i = 0;i<steps;i++){
            result.add(i);
        }
        return result;
    }
    public static void quickerRotatePixelData(int[] src, int[] dest, int width, int height) {
        int newWidth = height;
        List<Integer> steps = buildSteps(height);
        steps.parallelStream().forEach((Integer yBoxed)->{
            int y = yBoxed;
            for (int x = 0; x < width; x++) {
                int newX = height - 1 - y;
                int newY = x;
                dest[newY * newWidth + newX] = src[y * width + x];
            }
        }
        );
    }
}

This code is clearly longer, and it is more contrived, and what it does, is splitting the computation per row and using the parallel Streams API I've told earlier: it will run using all cores.

The most times are 62-64 ms and the slowest of the slowest are 70 ms. (so is exactly 2X speedup)

Some things to consider:
- C++ code is using 32 bits (yet I optimized as much as I could without using assembly). Java is using the same Jar and by default I was using 64 bit. This is a case where you guarantee that with no setup Java gives better numbers overall. If you compile on 64 bit, your code will not run on machines with 32 bit CPUs. I would expect that the code will run close to Java but not faster than 5 ms (but is also possible that it runs slower).
- Java optimizes based on hot code, and to rotate a picture is hot code
- using Stream API has an overhead, so split the processing into bigger chunks if you make it parallel. The most of overheads are with merging data (if you do coding like map-reduce) and some "invisible" boxing, like even the shown sample, the chunk is a row of image, and the overhead of boxing of a simple "int" type is there written in an explicit form
- the code is not CPU cache friendly: if you will split the picture into chunks (like sqsuares of 128x128 and you rotate them, I would expect that both Java and C++ will get better by a factor of 2, but what will not change the relation of Java running slower/faster).
- having multiple cores will make the code faster by the numbers of cores. If you use C++, you can criticize me with: why didn't you use OpenMP, and I can say, yes excluding that OpenMP is not working on all compilers of today (I'm talking here mostly about missing it in Clang/Mac OS X). So you have to make likely your mini-multi-core framework to be cross platform.

Still I think that Java is not optimal for deploy (meaning: you have to ask users to install Java runtime) and developer experience has to improve (hopefully it will improve with the next Java (named 9) by using "Project Jigsaw" and "sjavac"). As for me Jigsaw (which splits Java runtime into modules) is what have to be added in the next release as at least in the public presentations they predicted that the startup of the Java applications will improve still. And this is a big deal in the impression that Java is slow.

5 comments:

  1. I don't think there should be a speed difference, specially of 15% at the single thread version.

    Afaik, Qt passes the -O2 flag to GCC. Could you please try specifying the -O3 flag at the QMAKE_CXXFLAGS variable?

    Also, I don't know how the JVM works, but is it possible that it generates code specific for the CPU at runtime, like the -march and -mtune flags do at compile time for GCC?

    ReplyDelete
  2. Hi V, I changed as you suggested:
    -Ofast: (here are various outputs I'm getting)
    170 ms
    148 ms
    147 ms
    *135 ms (best outlier 135 ms)
    147 ms

    -O3:
    154 ms
    149 ms
    *145 ms

    So it looks that changing flags it will show a variation into the margin of error.

    Also I've added the 64 bit and it seems that the performance degrades a little:
    -O3 -m64
    163 ms
    *141 ms
    151 ms
    144 ms
    146 ms
    149 ms
    150 ms
    147 ms

    Also, it is not a Qt issue, the code is isolated in a non-qt code.

    But I recommend you to do just what I've presented to you: make an array of:
    3840 x 2160 integers, and use the code posted in this blog. Make it in a console application (you can use Linux if you want). And make a Java application that does set an array in the same manner and call the code of rotation. Or if you don't want to do so, I can give to you the Jar and you just call it that executes this code on your machine. And if you find any (significant) difference, I will let you post a full entry to adjust numbers and to put your own differences. It would be an interesting post!

    "Also, I don't know how the JVM works, but is it possible that it generates code specific for the CPU at runtime, like the -march and -mtune flags do at compile time for GCC?"
    The short answer is yes, but are some variables to be talked about, if you want, I can make a blog entry how Java optimizer works.

    ReplyDelete
  3. Just runned the Java and C++ at my Linux and Windows boxes. I really don't know exactly how to explain the results.

    Windows: AMD 6100 BE, 16GB RAM, Win 8.1, GCC 4.8.2, Qt 5.3
    Linux: AMD 6100 BE, 16GB RAM, Gentoo (Linux 3.14.14), GCC 4.7.3, Qt 5.3

    During all tests, the CPU governor was at performance mode.

    Both programs were just copy+paste of yours, with code reading clock values before and after.

    Java version did a bit slower than yours, averaging 132 ms on Windows and 137 ms on Linux.

    Now things get odd.

    C++ Windows (average: 151 ms)
    161 ms
    146 ms
    152 ms
    148 ms
    148 ms

    Command line: g++ -O3 rotate.cpp -o rotate

    C++ on Linux (average: 144 ms)
    114 ms
    115 ms
    114 ms
    114 ms
    115 ms

    Command line: g++ -m64 -mtune=generic -O3 rotate.cpp -o rotate

    Since there is a huge difference between Windows and Linux versions and aparently the only difference is the architecture, I decided to test on Linux at 32 bit mode. I didn't expected a difference at all, since all your code seems 32 bit aligned, or a really small speed gain for 64 bit due to the amount of register, but...

    g++ -m32 -mtune=generic -O3 rotate.cpp -o rotate
    average: 109 ms

    g++ -m32 -mtune=generic -O2 rotate.cpp -o rotate
    average: 114 ms

    Well... 32 bit with -O3 wins.
    But that is fine. More interesting thought is the difference of performance between Windows and Linux. Looks like GCC generates better code on Linux or Linux is a better platform for running CPU intensive workloads.

    I would love to read about the JVM optimizer!

    ReplyDelete
    Replies
    1. I will try to finish it tonight at least to point your results and in the wost case I will do a follow up post.

      Can you try to call the multi-core code also? I am curious if your 3x2 modules will scale 6x (or close to) or not. The first loop you may see it as an outlier, but I don't know. You can run it on any OS (at least on Java side).

      You will have to create a separate class (or import the two methods in your Java class).

      Delete
    2. Sorry for not replying faster.

      The parallel versions did not scale with the processors increase. I believe in a true multi-processor/multi-core system, it would scale. Despite having being branded as a 6 core processor, the AMD FX 6100 BE is actually a 3 core processor with a very archaic SMT implementation: it basically replicate the integer unit and the decoding unit afaik.

      I used OpenMP at the C++ version, and the code you provided for the Java version.

      For the OpenMP, I simply added the line:

      #pragma omp parallel for
      for (int x = 0; x < width; x++) {
      // ...

      The speed up on both cases were close to 4x. Java had a slightly advantage here, and gave 4.5x. The parallel version of the Java test gave some very close numbers to the C++ version, 29ms in average, and varying between 25ms and 31ms (the worst C++ case was 29ms).

      I doubled the image size to verify whether the thread creation was consuming too much CPU time. Terrible idea. The cache (I believe) became a huge problem. The speed up was only 3x at the best case (both for Java and C++) and the time were varying a lot (between 420ms and 540ms).

      I would say to don't rush about the Java article, but it is already open in another Firefox tab...

      Thank you!

      Delete