Monday, October 13, 2014

Java Optimizer - a primer

As I've posted the previous blog entry about Java running fast (at least for a mini benchmark: to rotate a picture) a great feedback was appearing in comments from user named V.

He exposed one problem of my methodology, the fact that GCC behaves better on Linux: he gets like 151 ms on Windows (also my testing platform) and a great time of 114 ms (on 32 bit) on Linux. I have to say: this is great news (at least for GCC users on Linux)! Java gave a runtime of executing the same task like 130 ms on his machine (regardless on the OS).

So, as a introduction: why Java gets so good numbers even it compiles the code every time it reads the "bytecode" file and it "interprets" all the code?

Java in a very simpler form is having this diagram:

Don't worry of so many steps but the main thing to know is that when you "execute" Java, in fact there is a program, named Java Virtual Machine which reads all the instructions your program have (named Java bytecode) and is packed typically as a Jar file (or many .class files).

In the upper diagram is also shown some part which in fact doesn't run anything but makes sure your program is correctly formatted.

So what happens when you try to write something like Hello world in Java?
1. you write your,
2. you compile it using javac as a result will get one .class file (which you can pack as .jar if you want) which will describe your original program as stack operations (like instead of a = 2+3, will be rewritten as: "push int 2" "push int 3" "add int" "loadint into "a" "
3. you say to java.exe (on Windows, will be just java on Linux/OS X) to "run" your .class file
4. The java.exe will validate first if your .class is well formatted
5. Java will interpret (meaning will run all the .class instructions one by one) up to a threshold is met (typically is 10.000 "loop edges" meaning 10.000 times when the same sequence of instructions is interpreted)
6. if your hot code touches this threshold will compile it using something named like C1 and C2 compilers (also known as client and server compilers). The main differences (which I found presented publicly) between C1 and C2 is that C2 makes more aggressive inlining and C2 also uses a slower technique to allocate registers. On 64 bit Java, C2 compiler is enabled by default, and from Java 8 the "tiered mode" is enabled by default (for me is not clear yet if is for both 32 or 64 bit or just 32 bit, I never tested it) which it means that some "hottier" code is compiled with C1 then is compiled with C2.

So the "tricks" to get fast code in Java are:
- make as many repetitions as possible of hot code: don't run many branches that they get "warmy" but not "hot enough". They will not compile.
- write the code that doesn't do useless stuff in that loop
- if you need really better performance, use C2 (-server) or Tiered mode. They are enabled both by default on JVM from Java 8, so just updating your JVM you get good defaults on performance
- don't execute a lot of method dispatches (also known as virtual calls): try to make your hot loop to use just one class: C2 will likely inline these calls correctly. Sometimes C2 inlines them speculatively, so you can get great performance even you don't care about virtual calls.

Some big differences that Java get (which can show even more speedup than C++ can get) is "over the object file boundaries" meaning that in some cases C/C++ code runs slower because C++ cannot inline from one object file to another (right now is possible to some extent) but certainly is impossible to inline a "hot code" from an Dll/libSO. Even more important, is that the case where Java keeps at runtime always the list of loaded classes, and even you use virtual methods everywhere, if you don't have any class that overrides your class, the calls will be always dispatched as early binded (not-virtual) calls.

One very interesting talk which discusses the optimizations that Java does, is this Gil Tene's one: is a great way to understand what most optimizing compilers do, so is a great talk even you are using C++ (at least the first 1/2 half is showing typical optimizations that CR or .Net or any C++ compiler for the last 10 years will do for you!).

At last, but not at least, Java has some performance pitfalls, some which happen to be also C++ pitfalls:
- CPU stalls: when your data does not fit nicely into CPU cache your code will run typically 10 times slower (or even more for lower memory bandwidth CPUs like ARMs). Yes, you read correctly: not 10% slower, but 10 times slower. Making your algorithm to read and write into consecutive addresses in memory will run much faster. So when you write your program try to think like the hottest of the hottest code to keep data into 256 KB of data both for reads and writes. Java makes it a bit faster as it doesn't have the equivalent of "value type" so an array of 3D coordinates is in fact an array of individual pointers to separate memory structure. For this reason, I would recommend to pack coordinates into arrays and have an index as a separate class to do your processing. But look into this problem: this can hurt your C++ performance also!
- Java has a startup lag and you still have to wait for Java to warm up: this means that is not ideal for games, but there are tricks without tricking Java to run well enough for games: start your game engine as the first thing. This is in my knowledge what Minecraft does (I don't play it, sorry :( ). This let JVM to learn for the first seconds what is your code does and gives a chance to compile most of what your game will do
- some Java APIs (mostly web services) tend to be "fat" meaning you have to process XML or JSon to read your settings, or to communicate or things like this. I recommend to use faster replacements for calling remote data (without optimizing yourself the dataset). Instead of sending raw Xml (or even saving it on disk), use OpenEXI and to send data over the wire, if you have a more complex data-set, I found KryoNet to run very fast. This is also true for C++ coding, but is a more common problem in the "Java mindset" as far as I've noticed.
- Java gives to you mostly the same performance than a fairly good C++ code (maybe a bit slower on Linux but not as much as you should worry, as V showed to me) but you have to improve the other slow parts: don't be smart on writing your own threading solution (is fairly common in C++ world to rewrite everything) but use as much as possible existing solutions: JavaFx has accelerated graphics for UI for example and is excellent for most UIs. If you find it to be slow as it parses a large XML for UI, create just the small UI parts which are expensive by hand, but don't go over-board! I found for example in jHeroes2 that the original game file named Heroes2.agg is like 40 MB, but is it faster to read it, compress every individual item (this big file is a collection of small files), make it like: Heroes2.opt (of 20 MB file) and reading+unpacking is faster (at least on my machine) than reading the 40 MB. This is also because there is such of a huge difference between CPUs and disk speeds.
- The last but not least is the GC: Java's allocator is much faster than C/C++ heap allocator, but is called for almost every object. C/C++ and C# can allocate them on stack (so they "free" on exit of the function automatically) So try that most objects you allocate to not keep trivial data, but also don't keep them to long in memory. The "long pause" (named full GC) in Java is always proportional of how much data you have at the moment the GC triggered. If you find that you have to many pauses, try to double/triple the memory of your application and the GC should call at most half of the times, or use G1 an adaptive memory collector. If you run games, I would expect you will not allocate (excluding very trivial amount of very short-lived data as key strokes or, so the full GC is never called.

Hope this blog entry was useful and I hope (as I may not add many things to CR in the close future) to add some other items like this. Would you find (the reader) useful to see some Youtube entries about "how to" programming tutorials? Would you follow them?


  1. Thank your for the article!

    Never thought Java has 2 compilers. Very smart indeed.

    Java, as .Net, suffers from GC pressure? The same techniques that reduce the problem in .Net can be applied? (like reducing the object creation/deletion)

    Does Java have an AoT compiler? This would be able to improve the startup performance, I think...

    1. Java as default implementation (Oracle's or OpenJDK == reference implementation) suffers more than .Net of memory pressure, yet it has the advantage of more GC algorithms (which are in a way more mature than .Net's ones), but it doesn't support "struct type" (heterogenous data that can be stored into array without polluting all heap with references).

      As for AOT compiler: there are commercial (Excelsior's JET) and bad opensource ones (GCJ). As for me, I think that this mixed mode implementation is good enough, yet there are other ways (one of them supposedly being Java's 9 Jigsaw way) that Java's startup can be improved. As for me, I would say that Java's performance is good enough even on startup. And I'm mostly working with my dual-core 4 years old laptop (and it will only improve).

      The tricky part is that is certainly slower, so "if it doesn't feel fast, it isn't fast" which Java suffers. If your Java game processes huge XMLs and comsumes a lot of memory, the impression is that: "what a slow application".

      Also, maybe you know this mantra already: but memory is really cheap. I have 3 GB free of memory right now, and most of memory is used by a browser (which is mostly JS Heap, a lot of tab's pictures, and such).

      I can say that 2 GB memory system with Linux can run most Java applications properly (even the rotation of the 4K picture) and will not see big benefits by switching all code to C++ (if original code was written using the same C++ paradigms: read binary, not Xml).