Monday, October 13, 2014

Java Optimizer - a primer

As I've posted the previous blog entry about Java running fast (at least for a mini benchmark: to rotate a picture) a great feedback was appearing in comments from user named V.

He exposed one problem of my methodology, the fact that GCC behaves better on Linux: he gets like 151 ms on Windows (also my testing platform) and a great time of 114 ms (on 32 bit) on Linux. I have to say: this is great news (at least for GCC users on Linux)! Java gave a runtime of executing the same task like 130 ms on his machine (regardless on the OS).

So, as a introduction: why Java gets so good numbers even it compiles the code every time it reads the "bytecode" file and it "interprets" all the code?

Java in a very simpler form is having this diagram:

Don't worry of so many steps but the main thing to know is that when you "execute" Java, in fact there is a program, named Java Virtual Machine which reads all the instructions your program have (named Java bytecode) and is packed typically as a Jar file (or many .class files).

In the upper diagram is also shown some part which in fact doesn't run anything but makes sure your program is correctly formatted.

So what happens when you try to write something like Hello world in Java?
1. you write your program.java,
2. you compile it using javac as a result will get one .class file (which you can pack as .jar if you want) which will describe your original program as stack operations (like instead of a = 2+3, will be rewritten as: "push int 2" "push int 3" "add int" "loadint into "a" "
3. you say to java.exe (on Windows, will be just java on Linux/OS X) to "run" your .class file
4. The java.exe will validate first if your .class is well formatted
5. Java will interpret (meaning will run all the .class instructions one by one) up to a threshold is met (typically is 10.000 "loop edges" meaning 10.000 times when the same sequence of instructions is interpreted)
6. if your hot code touches this threshold will compile it using something named like C1 and C2 compilers (also known as client and server compilers). The main differences (which I found presented publicly) between C1 and C2 is that C2 makes more aggressive inlining and C2 also uses a slower technique to allocate registers. On 64 bit Java, C2 compiler is enabled by default, and from Java 8 the "tiered mode" is enabled by default (for me is not clear yet if is for both 32 or 64 bit or just 32 bit, I never tested it) which it means that some "hottier" code is compiled with C1 then is compiled with C2.

So the "tricks" to get fast code in Java are:
- make as many repetitions as possible of hot code: don't run many branches that they get "warmy" but not "hot enough". They will not compile.
- write the code that doesn't do useless stuff in that loop
- if you need really better performance, use C2 (-server) or Tiered mode. They are enabled both by default on JVM from Java 8, so just updating your JVM you get good defaults on performance
- don't execute a lot of method dispatches (also known as virtual calls): try to make your hot loop to use just one class: C2 will likely inline these calls correctly. Sometimes C2 inlines them speculatively, so you can get great performance even you don't care about virtual calls.

Some big differences that Java get (which can show even more speedup than C++ can get) is "over the object file boundaries" meaning that in some cases C/C++ code runs slower because C++ cannot inline from one object file to another (right now is possible to some extent) but certainly is impossible to inline a "hot code" from an Dll/libSO. Even more important, is that the case where Java keeps at runtime always the list of loaded classes, and even you use virtual methods everywhere, if you don't have any class that overrides your class, the calls will be always dispatched as early binded (not-virtual) calls.

One very interesting talk which discusses the optimizations that Java does, is this Gil Tene's one: is a great way to understand what most optimizing compilers do, so is a great talk even you are using C++ (at least the first 1/2 half is showing typical optimizations that CR or .Net or any C++ compiler for the last 10 years will do for you!).

At last, but not at least, Java has some performance pitfalls, some which happen to be also C++ pitfalls:
- CPU stalls: when your data does not fit nicely into CPU cache your code will run typically 10 times slower (or even more for lower memory bandwidth CPUs like ARMs). Yes, you read correctly: not 10% slower, but 10 times slower. Making your algorithm to read and write into consecutive addresses in memory will run much faster. So when you write your program try to think like the hottest of the hottest code to keep data into 256 KB of data both for reads and writes. Java makes it a bit faster as it doesn't have the equivalent of "value type" so an array of 3D coordinates is in fact an array of individual pointers to separate memory structure. For this reason, I would recommend to pack coordinates into arrays and have an index as a separate class to do your processing. But look into this problem: this can hurt your C++ performance also!
- Java has a startup lag and you still have to wait for Java to warm up: this means that is not ideal for games, but there are tricks without tricking Java to run well enough for games: start your game engine as the first thing. This is in my knowledge what Minecraft does (I don't play it, sorry :( ). This let JVM to learn for the first seconds what is your code does and gives a chance to compile most of what your game will do
- some Java APIs (mostly web services) tend to be "fat" meaning you have to process XML or JSon to read your settings, or to communicate or things like this. I recommend to use faster replacements for calling remote data (without optimizing yourself the dataset). Instead of sending raw Xml (or even saving it on disk), use OpenEXI and to send data over the wire, if you have a more complex data-set, I found KryoNet to run very fast. This is also true for C++ coding, but is a more common problem in the "Java mindset" as far as I've noticed.
- Java gives to you mostly the same performance than a fairly good C++ code (maybe a bit slower on Linux but not as much as you should worry, as V showed to me) but you have to improve the other slow parts: don't be smart on writing your own threading solution (is fairly common in C++ world to rewrite everything) but use as much as possible existing solutions: JavaFx has accelerated graphics for UI for example and is excellent for most UIs. If you find it to be slow as it parses a large XML for UI, create just the small UI parts which are expensive by hand, but don't go over-board! I found for example in jHeroes2 that the original game file named Heroes2.agg is like 40 MB, but is it faster to read it, compress every individual item (this big file is a collection of small files), make it like: Heroes2.opt (of 20 MB file) and reading+unpacking is faster (at least on my machine) than reading the 40 MB. This is also because there is such of a huge difference between CPUs and disk speeds.
- The last but not least is the GC: Java's allocator is much faster than C/C++ heap allocator, but is called for almost every object. C/C++ and C# can allocate them on stack (so they "free" on exit of the function automatically) So try that most objects you allocate to not keep trivial data, but also don't keep them to long in memory. The "long pause" (named full GC) in Java is always proportional of how much data you have at the moment the GC triggered. If you find that you have to many pauses, try to double/triple the memory of your application and the GC should call at most half of the times, or use G1 an adaptive memory collector. If you run games, I would expect you will not allocate (excluding very trivial amount of very short-lived data as key strokes or, so the full GC is never called.

Hope this blog entry was useful and I hope (as I may not add many things to CR in the close future) to add some other items like this. Would you find (the reader) useful to see some Youtube entries about "how to" programming tutorials? Would you follow them?

Sunday, October 12, 2014

A case for Java on Desktop (with a benchmark)

When I started CodeRefractor, Windows 8 was somewhat just released, and .Net looked an unstable platform to invest in (for reasons like: Xaml/WPF team was disolved, or Metro runtime being a new way to get locked in into Microsoft's garden, or Xamarin, but not get me started there also).

In the last year many of the shortcomings were fixed but mostly from Xamarin's team and the related projects: MonoGame runs on most platforms nicely, Xamarin added some optimizations (which were part of CR :) ) and made the performance story better, but also even the Microsoft's story is much better in the .Net world (Ryu JIT, an Ahead-of-Time strategy which would give a similar performance profile with CR when finished).

So why run Java on desktop? Here is my top list of today why you should consider Java to C++ or C# on desktop:
- all major IDEs of Java are free of charge (IntelliJ Idea, Eclipse and Netbeans)
- the framework of UIs (JavaFx) is decent and has all the features you would expect on a top-notch UI framework including: CSS styling, GPU acceleration
- Java fixes with Java 8 two of the biggest coding/behavior issues: lambdas make working with UI events, threads etc. sane! The language is very repetitive for my taste, but it makes it very readable and the removal of PerfGen (a nifty bug-by-design in my view)
- Java 8 added Stream API, which as for me the biggest feature is adding of a simple way to write parallel code.

So fine, if you grant me that Java works mostly as language/frameworks as almost all languages of year 2010, why not pick still all other languages? As for me: cross platform and top-notch performance are really huge features. To give a Jar and it is supported on all platforms is a huge deal.

So I made claims about performance, isn't it Java slow? Especially on desktop? I would say: "depends" and "if you know how Java works, Java runs fast".

So I did this weekend a simple Qt application and I want to do something that compilers like C++ do know how to optimize it: a rotation of a picture by 90 degrees (reference hardware: Core i5M-540M, Win64, GCC MinGw 4.8.1-32 bit, Java 8.0 u20 64 bit)

This is the "hottest of the hottest" loop that would rotate a picture in C++ (without using assembly):
void MainWindow::rotatePixelData(int* __restrict src, int* __restrict dest, int width, int height){
    int newWidth = height;
    int newHeight = width;
    for(int x=0; x<width; x++) {
        for(int y=0; y<height; y++) {
            int newX = height-1-y;
            int newY = x;
            dest[newY*newWidth+newX] = src[y*width+ x];
        }
    }
}

Best time of rotating a 4K picture is: 144 ms (the first time is like 161 ms).

The Java code is the same just is not using "__restrict":
static void rotatePixelData(int[] src, int[] dest, int width, int height) {
        int newWidth = height;
        int newHeight = width;
        for (int x = 0; x < width; x++) {
            for (int y = 0; y < height; y++) {
                int newX = height - 1 - y;
                int newY = x;
                dest[newY * newWidth + newX] = src[y * width + x];
            }
        }
    }

Typical time: 126 ms (the worst, yet not the first 146 ms).

Of course, without wanting to prove myself that I show bad benchmarks, or that I didn't tune the compiler enough, I wanted to make a difference significant from Java:

public class RotatorParallel {
    static List<Integer> buildSteps(int steps){
        List<Integer> result= new ArrayList<>();
        for(int i = 0;i<steps;i++){
            result.add(i);
        }
        return result;
    }
    public static void quickerRotatePixelData(int[] src, int[] dest, int width, int height) {
        int newWidth = height;
        List<Integer> steps = buildSteps(height);
        steps.parallelStream().forEach((Integer yBoxed)->{
            int y = yBoxed;
            for (int x = 0; x < width; x++) {
                int newX = height - 1 - y;
                int newY = x;
                dest[newY * newWidth + newX] = src[y * width + x];
            }
        }
        );
    }
}

This code is clearly longer, and it is more contrived, and what it does, is splitting the computation per row and using the parallel Streams API I've told earlier: it will run using all cores.

The most times are 62-64 ms and the slowest of the slowest are 70 ms. (so is exactly 2X speedup)

Some things to consider:
- C++ code is using 32 bits (yet I optimized as much as I could without using assembly). Java is using the same Jar and by default I was using 64 bit. This is a case where you guarantee that with no setup Java gives better numbers overall. If you compile on 64 bit, your code will not run on machines with 32 bit CPUs. I would expect that the code will run close to Java but not faster than 5 ms (but is also possible that it runs slower).
- Java optimizes based on hot code, and to rotate a picture is hot code
- using Stream API has an overhead, so split the processing into bigger chunks if you make it parallel. The most of overheads are with merging data (if you do coding like map-reduce) and some "invisible" boxing, like even the shown sample, the chunk is a row of image, and the overhead of boxing of a simple "int" type is there written in an explicit form
- the code is not CPU cache friendly: if you will split the picture into chunks (like sqsuares of 128x128 and you rotate them, I would expect that both Java and C++ will get better by a factor of 2, but what will not change the relation of Java running slower/faster).
- having multiple cores will make the code faster by the numbers of cores. If you use C++, you can criticize me with: why didn't you use OpenMP, and I can say, yes excluding that OpenMP is not working on all compilers of today (I'm talking here mostly about missing it in Clang/Mac OS X). So you have to make likely your mini-multi-core framework to be cross platform.

Still I think that Java is not optimal for deploy (meaning: you have to ask users to install Java runtime) and developer experience has to improve (hopefully it will improve with the next Java (named 9) by using "Project Jigsaw" and "sjavac"). As for me Jigsaw (which splits Java runtime into modules) is what have to be added in the next release as at least in the public presentations they predicted that the startup of the Java applications will improve still. And this is a big deal in the impression that Java is slow.

Friday, October 10, 2014

Don't trust (VM) benchmarks and how to spot bad benchmarks (II)

Here is somewhat a more realistic yet very bad benchmark:
http://youtu.be/C3-p-Txnr4Q

In fact this is in a way the best benchmark you can get:
- it is presented the same sourcecode
- it is running on the same device

This benchmark is simply bad because it doesn't explain to you what it compares to nicely.

So how can you run the same code in two environment (like 2 compilers) and get 3 times slow down: when you don't compare the same thing. And here is why:
- the first contender is: the Xamarin "VM" is in fact IKVM running on top of Mono which runs in top of LLVM.
- the second contender is: RoboVM which runs on top of LLVM

So why such of a great difference? There are more reasons why: IKVM translates Java bytecodes into a translated CIL (MSIL) bytecodes and Mono's optimizer is running this bytecode through few passes and after this it gives it to LLVM. These translations do add overhead as they are not thought one for each  other. The biggest hit is on IKVM to CIL layer I would predict.

How RoboVM works: RoboVM takes Java bytecode and optimizes them using an academic framework (named Soot) which at the end outputs a faster version of the original Java bytecode and is outputting the same thing that LLVM wants/expects (SSA form).

In simpler terms: Xamarin is "quick interpreter" of Java byteocde, when RoboVM is an optimizing compiler of the Java bytecode. And at the end, is to be expected that Xamarin runs slower. Is not because Mono runs slow code (even it runs a bit slower than .Net, it produces fairly decent code), but is very likely that IKVM does not produce 1:1 code to Mono platform.

Understanding your compiler/platform is critical to make your code to run fast, not on the assembly level, but what the code expects and you can get really nice numbers.

To be clear: RoboVM is the best way to run Java on iOS by a wide margin, but I would say that if your decision is to pick for example RoboVM or Xamarin (and the licensing costs of Xamarin are not scarring you :) ), and you want to run RoboVM because it runs faster, you will be likely wrong. Write C# on Xamarin and you may likely run (a bit) faster than RoboVM.

I will make a future post why Java I think has a great opportunity in (near) future for desktop, performance wise, but I will will make a post how to create fast code running Java and things like this.

NB. Is any reader running Pi at home? Or any similar card-sized ARM based desktop. Can you run a small benchmark if I write for you some code? Contact me on Google Plus if yes. I am really curious how it perform when I will write the next post.

Thursday, October 9, 2014

Announcing my next F/OSS project

And it is gonna be written in Java.

Please if you are interested either in Heroes 2 or Java, please follow updates into this new F/OSS project I've started to contribute to.

Tuesday, October 7, 2014

Don't trust (VM) benchmarks and how to spot bad benchmarks

Introduction

Around year 2000 the computing was changed drastically, even more with the launching of multi-core era around year 2005 and after. Right now, we are in another "revolution" using "computing". And every time you will see new benchmarks where a generation is 25% faster (like A8 is faster than A7) or 100% (like A7 is "up-to" 100% faster than A6), and so on!

But as there are many whereabouts in real life, benchmarks are critical to show if you can have a good experience for users:
- you want your image effect to finish fast. Even more when there is a movie to be processed and a movie as is a huge collection of photos, you care about the language is used and the time your animation is finished
- you want all cores to be used especially when 4 cores look to be the norm today or at least 2 cores with hyperthreading, so you want that all your hardware to be used
- you want to save battery for your mobile devices
- and many other reasons where when things go fast, you will feel happier than if things go slow

What you should care about

But when you as an user may want all these, sometimes you miss the mark. I simply mean that when you do your tasks, you may look in the wrong direction:
- the disk is many many times slower than memory. So let's say you have to make as fast as possible  a picture you've just received on Skype from a friend and he/she asks you to use Sepia tones. Starting Photoshop (or Gimp) can take literally many seconds if not a full minute (if you have more plugins on startup). So using Paint.Net will start the picture editing program faster, you can select faster to Sepia tone and you can send eventually faster the picture back faster. This is not to say that theoretically Photoshop if you would have 100 pictures to load it may not finish them faster (or any other program on that matter), but is often when starting a smaller program makes your task to finish faster
- on a similar note: do you have enough RAM (system memory)? When your system feels slow, is very often that is one of the two factors: you have few memory on your system (and memory is really one of the most disproportionately cheap component) or your CPU is really outdated (like let's say you want to process big pictures with a Pentium 4 1.7 GHz from 2001). But even the 2nd case is what's make your system slow, is still likely that adding memory will make it faster and usable even 10 years later!
- if your application doesn't care much about RAM and disk, is (very) probably a game, so the next question is: do you have the best video card is in your budget? It can be an integrated video card, but to be the fastest you can afford. Very few games are CPU limited, and even the CPU is to slow, and you have just 30 frames per second (so you can "feel" gaming lag), you can increase game detail and you will lose very little performance (if you have a to fast video card), and all will decrease to 28 FPS. So nothing to lose here!

What Internet says

Ok, being said that, I noticed some very bad benchmarks lately over the Internet, and they are made basically to put shocking numbers. like Android L will run basically slower than Android 4.4, or the reverse (these are Google numbers) in the attached picture, or the latest of them, a bit old, but with a big surprise.


What is funny in these 3 sources, are basically these conclusions:

From source (1) you will "understand" that performance will decrease slightly between Android K and L, but criticism is basically that is a Javascript code, the source (2) states that is not true, but Android L will run a bit faster than Android K, and the source (3) states that JavaScript is in fact the fastest, even faster than Java.

So why this mess? As a (somewhat) compiler writer, I understand the following: all compilers have tradeoffs and sometimes (like the preview builds) you enable debug information to get better information. Also, when you compare, you have to make sure you compare the same thing. The last item that matter.

Investigating the claims

So, is it JavaScript faster or slower than Java? For people following this blog, I can say that CR is faster than .Net, but I can have to make some qualifiers:
- CR doesn't handle exceptions at all, including bounds checking, NPE, so if you have a lot of iterations, you can get really awesome generated code, but it compares a minimalist "C-like" code with .Net's full code
- CR anyway uses C++ allocator which is roughly an order of magnitude slower than the .Net's GC, so if you allocate a lot of objects .Net ca be faster even .Net has exceptions (the same story is with Mono). CR has optimizations to help on this (mostly escape analysis, but users have to write code in a compiler friendly way).

Similarly: JavaScript as of today is not faster than Java, excluding you write Asm.JS (in which I would expect that Asm.JS to be slightly faster in very few cases with the same abstraction level as Java), but overall you should expect that C/C++ code will run with at least 50% faster than JavaScript with ASM.JS and even faster than without (like 2-3x times faster). This happen on a desktop class CPU only, on tablets/phones, the difference is wider, because the CPUs are weaker so the VMs there will do fewer optimizations. Here is the source I am using which is known for its impartiality.

Java most of the time (in low level math) with hot loops receives like 90% of C++ performance. In my experience, CodeRefractor was a bit faster (11%) in a very tight loop with no special hand-tune of code (but still tunning as much the compiler flags) so I can confirm as a consistent experience

How can it happen still that Java is slower in the 3rd source than JavaScript? Simple: Dalvik is not Java VM (in the strict sense) as CodeRefractor is not .Net (or Mono). Meaning that Dalvik has many designs choices which are strange in the JVM space: they are using a register based VM (like CodeRefractor) which means that their interpreter is faster (even with no compilation you should expect around 60% faster interpretation, there are scientific papers which found that interpreters on stack VMs work faster, the Google team claimed that they are around 2x faster than the JVM, which is consistent even a bit on the high side with scientific papers) and they are using a JIT (dynamic compilation) just for small subsections of the program. I would expect that they get 2x times faster because Google documented that they will preoptimize the bytecode: so the interpreter will run not your bytecode you gave to clients, but a slightly simplified Dalvik bytecode.

JavaScript VM is using a different technique on Android: is using V8 JavaScript virtual machine which uses a "direct to native" quick JIT and after that they are using a more advanced compiler for the hot code. This makes that all JS code (even JS is intrinsically slower) to be compiled.

For the sake of correctness, V8 was not all the time faster than the JIT of the Dalvik, the Android's "Java" implementation, but Dalvik was not improved from 2010 JIT wise, but most of improvements were GC related, and that Android applications will use GPU so they will use much less CPU to draw the screen, making applications to feel faster without Dalvik to become more CPU efficient. Do you remember how I've started the post: when applications need to feel faster, the best way to improve is to improve the components, not CPU only: this is what Google teams did, and I think it was the most sensible thing to do.

As a conclusion, Android L also introduces ART an AOT technique of compilation ("similar" with CR) meaning that (most of) compilation will happen on install time, making at least compilation to be program wide and I would expect as the Google team will improve the ART runtime, will make it very likely faster (if is not already) than V8.

So, how to spot bad benchmarks?
Every benchmarks which are not made for the domain of the user (typically your application), you should state upfront is a bad benchmark. Even more when is comparing "runtimes" or VMs in different class.

This is not to say that are not useful benchmarks, but you can get so easily tricked (and even experts can make mistakes, like this amazing talk of Gil Tene about GC latency and "almost"!) asking your compiler vendor (or JS vendor) to optimize for SunSpider, instead optimizing for your (web) application (like FaceBook, GMail, or what you really use from the web).

Soon I will likely announce my next (not CR related) FOSS project, and will be in the world of Java. The reason is that I understood why Java can give so much performance if written properly and the multitude of tools, but this is for another post. Also, I want to make clear that CR will likely not be updated for a time (excluding someone is interested on some feature and needs assistance). 

Sources:
1. http://www.reddit.com/r/Android/comments/2987ny/just_finished_doing_444_dalvik_444_art_and_l/
2.  http://www.cnet.de/88132773/android-l-und-android-4-4-4-im-benchmark-test/
3. http://www.stefankrause.net/wp/?p=144

Monday, August 18, 2014

Some future plans

Few great contributions were arriving to CR, mostly that the compiler works with more string operations (the single limitation is that they ignore the CultureInfo part), but other than this, it is done.

Also, it is feature wise even it is buggy the most complete CR version ever. You have supported cases of virtual methods, a much clean defined architecture that is incremental, optimizations are using various scopes, from block level, to multi-function dead-code elimination.

So what I would expect in future to be done:
- expect no future improvements than bug fixes excluding someone has time to maintain the project
- I will refactor it slowly to support another runtimes (an old wish of mine), in special cases if CR can work with simple use-cases for your random use-case. Let's say you want to write a runtime so CR can work with embedded C++ compilers, or you want to define a new class special for your use-case
- I will support people which want to contribute and I want with this occasion to than to the people who contributed already: Ronald Adonyo and Bogdan Mustiata, two great developers. Ronald made it much easier and transparent the testing of the compiler and he made fixes all around the runtime; Bogdan added support for most string operations.

Saturday, July 26, 2014

Why you cannot beat the compiler...

In the last weeks there was a big rewrite inside CodeRefractor. The main change is basically that we do in a multiple pass algorithm the building of code tree that can be built from the Main method. The changes we propagated from the frontend to the deep of the virtual methods backend (where the code is generated) and simplified.

using System;
class A
{
    public void F() { Console.WriteLine("A.F"); }
    public virtual void G() { Console.WriteLine("A.G"); }
}
class B : A
{
    public void F() { Console.WriteLine("B.F"); }
    public override void G() { Console.WriteLine("B.G"); }
}
class Test
{
    static void Main()
    {
        B b = new B();
        A a = b;
        a.F();
        a.G();
        b.G();
    }
}

Let's start with this code. Which is CR's output for Main method (C++ code)?

System_Void _Test_Main()
{

_B_F();
_B_G();
_B_G();
return;
}
 
CR doesn't have full inlining (yet!) but it has analysis steps that check if variables are used over the method boundaries and the most important part: it does it safely!

Many people may say that this code is contrived and unrealistic and I have to agree, also it has to be a "lucky case" that many things to happen:
- methods to devirtualize correctly
- the members of the method calls are not used
- the instance of B has no members  that are used, so the compiler can remove it safely

But on the other hand, this is also the strength of the compilers today: one month ago CR would not be able to write this code, but today it cans, so all you have to do to get a bit better performance is just update the compiler (or wait for the next release) and voila, you're good to go. Even more, every time when in future the compiler is able to devirtualize, it will devirtualize, over all program, and the most important bit: with no work from your side!

So, every time when you write code that targets a compiler (any compiler, not only CR), if the compiler project is improving, all you have to do is to wait, or to report bugs if the compiler doesn't compile your code, or update minimally your code to avoid the compiler issues and the code will improve itself.