Monday, April 27, 2015

Can RyuJIT beat Java -server?

The very short answer is always: depends.

RyuJIT is the new compiler architecture to compile .Net. It is supposedly branched out of „x86” branch of .Net and is modernized. And there are benchmarks and the startup performance got better, but did the throughput improved to beat Java?

The good part is that this month Microsoft opensourced many parts of the .Net stack as part of CoreCLR stack, and one of them is also the RyuJIT part of it. So we can look inside it. The code can be found here.
First of all, what RyuJIT seems to do is to give a fairly lightweight as high level optimizations which I think they are the minimal optimization set on Debug configuration:
- it builds SSA form (a form that improves precision of the compiler to remove aggressively data)
- it does fairly aggressive dead code elimination (based on liveness)
- it does Linear Scan Register Allocation

More optimizations can be enabled, which they mostly consist into common subexpression elimination, bounds check eliminations, a more aggressive dead code elimination (global liveness analysis).


Initially I was really surprised on how few optimizations look to be available inside RyuJIT, and looking a bit more into the code, some new gems appear, like it looks that there are in special a aggressive inlining and ”loop cloning” (which if I understood the code right, should make a loop to 1000 to be in fact split in 250 loops of 4 times repeating the iteration). This optimization I think is also important as RyuJIT supports SIMD intrinsics, so it can make a CPU specialized code.

Of course these optimizations all help and if you profile and tweak code, your code will be good enough, but still, it can beat Java -server?

At this stage, the answer is clearly no. In fact, if you write your code using Firefox's JIT for JavaScript, this optimizer has more advanced optimizations exposed, like Global Variable Numbering (GVN) and a better register allocator. I would not be surprised if you write "use asm" and this code to run much faster on Firefox JIT.

There are two items why RyuJIT should not run faster than Java and they are:
- it doesn't have many and more complex high level optimizations (I didn't even find Loop-invariant-code-motion, an optimization that even CodeRefractor has). Of course adding them will slow down the JIT time
- as RyuJIT will likely inline small functions/properties and duplicate parts of loops, will increase CPU registers (especially on x86) pressure and the LSRA allocator gives a fairly good performance, but is 10-20% slower than the full register allocator (used by Java server, still is the same with the warmup Java client register allocator)

Where RyuJIT can work faster is to allocate on stack faster than Java does, but eventually the code will get into tight loops and this code will run slower than Java by around 20%, if you don't make the mistake of making a hidden allocation on Java side. Also Dictionary<K,T> in .Net is much CPU cache friendly so if you run big dictionaries and you don't use Java optimized dictionaries like Google's Guava but the default JDK libraries, you will also run slower (even 10x slower), but why not use Guava, you will also have slowdowns for the wrong reasons.

At last, there is an area that even Java can generate 20% faster code, that you don't allocate memory in your tight loop, at last Java can still run slower, and this is when you call native libraries. This is not Java's JIT fault, but is simply that the .Net's mapping to "metal" is much clearer, including in-memory layout, automatic 1:1 marshaling (that is done just with one big memory copy of an array of structures for example) which is simply done better.

One last note about JIT and SIMD: Java doesn't have intrinsics because it does automatically rewrite your code to use SIMD and use proper instructions automatically. This in my mind is the best way to do stuff, so Java can run times faster just because a loop is vectorized, but certainly you have to write your loop SIMD-friendly. This is very similar with autovectorization promised in Visual Studio 2012.

Friday, April 17, 2015

Rant (and offtopic) AMD 16 way Zen core?

No way! This piece of news on Fudzilla is silly at best.

At least not with all combined. And this is not because I dream a conspiracy or I don't like AMD. In fact my last hardware I bought was AMD (yet a GPU, but it was only because I didn't need a CPU for a long time)

Let's clarify why, there is no area in itself on CPU even with AMD dense libraries that is used for GPU to fit all. There are CPUs with very good packing of transistors and have a very similar specification with this new future CPU, it is using even a worse litography (22 nm compared with the 14 nm FinFET for AMD case) and this is a Xeon CPU.

But the worse part in my mind is that even the specifications are in the reachability of the smaller transistors the following parts give to me doubts that there will be in 2016 (even in December, the launch date) a full 16 core CPU:
- AMD has no experience with 16 core, their previous designs were 8x2 cores designs, not to say that they are not impressive, but maybe the tradition of late and underwhelming delivery of AMD (likely because it lost some of key engineers when the company shrunk) makes me skeptical that they have a good working prototype already (as Zen is expected to be launched in 1 year from now, it requires at least some prototypes to be made with some time before, AMD Carizzo for instance had good CPUs sampling around 6 months ago and is still not launched)
- 14 nm FinFET is not necessarily that good compared with Intel's 14 nm, because some parts of the interconnect are using a bigger process.
- the design is an APU and in general CPU cache and APUs do require a lot of CPU real estate. You cannot scale infinitely an CPU to all directions because the heat for instance can break it really fast

At last, but not at least, is: who cares? The benchmarks and real numbers in applications are what matter. AMD Bulldozer is a great CPU, especially if you count the core count, and the initial real life delivery was bad, really bad. When Intel Haswell CPUs were launched, you can assume realistically that 2 AMD cores (or one "module") of AMD runs basically in the same as 1 Intel Core.

Even here on the blog, a desktop CPU (AMD 6 core - or 3 cores with 2 modules - read into comments) can run maybe a bit worse with 1 core and probably it will run very similar with a  dual core laptop with similar generations (i5M first gen vs AMD 6150 BE desktop),

I am sure that no AMD engineer is here, but what it looks to me is that the best architectures AMD have are probably Puma/Jaguar based (which themselves I think are based to a simplified Phenom cores) which run inside tablets/slow laptops/consoles. They don't have HSA, but they do run damn well. If there would be a low-power cluster of 2 x 8 cores Pumas, I would likely want a APU like this: it is starved on memory front, but all than this, many algorithms that are CPU intensive are CPU cache friendly, so the CPU will fine on those, and the non-CPU intensives maybe will run fine just because there are many cores to split the workload.

Good luck AMD!