Monday, April 27, 2015

Can RyuJIT beat Java -server?

The very short answer is always: depends.

RyuJIT is the new compiler architecture to compile .Net. It is supposedly branched out of „x86” branch of .Net and is modernized. And there are benchmarks and the startup performance got better, but did the throughput improved to beat Java?

The good part is that this month Microsoft opensourced many parts of the .Net stack as part of CoreCLR stack, and one of them is also the RyuJIT part of it. So we can look inside it. The code can be found here.
First of all, what RyuJIT seems to do is to give a fairly lightweight as high level optimizations which I think they are the minimal optimization set on Debug configuration:
- it builds SSA form (a form that improves precision of the compiler to remove aggressively data)
- it does fairly aggressive dead code elimination (based on liveness)
- it does Linear Scan Register Allocation

More optimizations can be enabled, which they mostly consist into common subexpression elimination, bounds check eliminations, a more aggressive dead code elimination (global liveness analysis).


Initially I was really surprised on how few optimizations look to be available inside RyuJIT, and looking a bit more into the code, some new gems appear, like it looks that there are in special a aggressive inlining and ”loop cloning” (which if I understood the code right, should make a loop to 1000 to be in fact split in 250 loops of 4 times repeating the iteration). This optimization I think is also important as RyuJIT supports SIMD intrinsics, so it can make a CPU specialized code.

Of course these optimizations all help and if you profile and tweak code, your code will be good enough, but still, it can beat Java -server?

At this stage, the answer is clearly no. In fact, if you write your code using Firefox's JIT for JavaScript, this optimizer has more advanced optimizations exposed, like Global Variable Numbering (GVN) and a better register allocator. I would not be surprised if you write "use asm" and this code to run much faster on Firefox JIT.

There are two items why RyuJIT should not run faster than Java and they are:
- it doesn't have many and more complex high level optimizations (I didn't even find Loop-invariant-code-motion, an optimization that even CodeRefractor has). Of course adding them will slow down the JIT time
- as RyuJIT will likely inline small functions/properties and duplicate parts of loops, will increase CPU registers (especially on x86) pressure and the LSRA allocator gives a fairly good performance, but is 10-20% slower than the full register allocator (used by Java server, still is the same with the warmup Java client register allocator)

Where RyuJIT can work faster is to allocate on stack faster than Java does, but eventually the code will get into tight loops and this code will run slower than Java by around 20%, if you don't make the mistake of making a hidden allocation on Java side. Also Dictionary<K,T> in .Net is much CPU cache friendly so if you run big dictionaries and you don't use Java optimized dictionaries like Google's Guava but the default JDK libraries, you will also run slower (even 10x slower), but why not use Guava, you will also have slowdowns for the wrong reasons.

At last, there is an area that even Java can generate 20% faster code, that you don't allocate memory in your tight loop, at last Java can still run slower, and this is when you call native libraries. This is not Java's JIT fault, but is simply that the .Net's mapping to "metal" is much clearer, including in-memory layout, automatic 1:1 marshaling (that is done just with one big memory copy of an array of structures for example) which is simply done better.

One last note about JIT and SIMD: Java doesn't have intrinsics because it does automatically rewrite your code to use SIMD and use proper instructions automatically. This in my mind is the best way to do stuff, so Java can run times faster just because a loop is vectorized, but certainly you have to write your loop SIMD-friendly. This is very similar with autovectorization promised in Visual Studio 2012.

No comments:

Post a Comment