Don't trust (VM) benchmarks and how to spot bad benchmarks

Introduction

Around year 2000 the computing was changed drastically, even more with the launching of multi-core era around year 2005 and after. Right now, we are in another "revolution" using "computing". And every time you will see new benchmarks where a generation is 25% faster (like A8 is faster than A7) or 100% (like A7 is "up-to" 100% faster than A6), and so on!

But as there are many whereabouts in real life, benchmarks are critical to show if you can have a good experience for users:
- you want your image effect to finish fast. Even more when there is a movie to be processed and a movie as is a huge collection of photos, you care about the language is used and the time your animation is finished
- you want all cores to be used especially when 4 cores look to be the norm today or at least 2 cores with hyperthreading, so you want that all your hardware to be used
- you want to save battery for your mobile devices
- and many other reasons where when things go fast, you will feel happier than if things go slow

What you should care about

But when you as an user may want all these, sometimes you miss the mark. I simply mean that when you do your tasks, you may look in the wrong direction:
- the disk is many many times slower than memory. So let's say you have to make as fast as possible a picture you've just received on Skype from a friend and he/she asks you to use Sepia tones. Starting Photoshop (or Gimp) can take literally many seconds if not a full minute (if you have more plugins on startup). So using Paint.Net will start the picture editing program faster, you can select faster to Sepia tone and you can send eventually faster the picture back faster. This is not to say that theoretically Photoshop if you would have 100 pictures to load it may not finish them faster (or any other program on that matter), but is often when starting a smaller program makes your task to finish faster
- on a similar note: do you have enough RAM (system memory)? When your system feels slow, is very often that is one of the two factors: you have few memory on your system (and memory is really one of the most disproportionately cheap component) or your CPU is really outdated (like let's say you want to process big pictures with a Pentium 4 1.7 GHz from 2001). But even the 2nd case is what's make your system slow, is still likely that adding memory will make it faster and usable even 10 years later!
- if your application doesn't care much about RAM and disk, is (very) probably a game, so the next question is: do you have the best video card is in your budget? It can be an integrated video card, but to be the fastest you can afford. Very few games are CPU limited, and even the CPU is to slow, and you have just 30 frames per second (so you can "feel" gaming lag), you can increase game detail and you will lose very little performance (if you have a to fast video card), and all will decrease to 28 FPS. So nothing to lose here!

What Internet says

Ok, being said that, I noticed some very bad benchmarks lately over the Internet, and they are made basically to put shocking numbers. like Android L will run basically slower than Android 4.4, or the reverse (these are Google numbers) in the attached picture, or the latest of them, a bit old, but with a big surprise.

What is funny in these 3 sources, are basically these conclusions:

From source (1) you will "understand" that performance will decrease slightly between Android K and L, but criticism is basically that is a Javascript code, the source (2) states that is not true, but Android L will run a bit faster than Android K, and the source (3) states that JavaScript is in fact the fastest, even faster than Java.

So why this mess? As a (somewhat) compiler writer, I understand the following: all compilers have tradeoffs and sometimes (like the preview builds) you enable debug information to get better information. Also, when you compare, you have to make sure you compare the same thing. The last item that matter.

Investigating the claims

So, is it JavaScript faster or slower than Java? For people following this blog, I can say that CR is faster than .Net, but I can have to make some qualifiers:
- CR doesn't handle exceptions at all, including bounds checking, NPE, so if you have a lot of iterations, you can get really awesome generated code, but it compares a minimalist "C-like" code with .Net's full code
- CR anyway uses C++ allocator which is roughly an order of magnitude slower than the .Net's GC, so if you allocate a lot of objects .Net ca be faster even .Net has exceptions (the same story is with Mono). CR has optimizations to help on this (mostly escape analysis, but users have to write code in a compiler friendly way).

Similarly: JavaScript as of today is not faster than Java, excluding you write Asm.JS (in which I would expect that Asm.JS to be slightly faster in very few cases with the same abstraction level as Java), but overall you should expect that C/C++ code will run with at least 50% faster than JavaScript with ASM.JS and even faster than without (like 2-3x times faster). This happen on a desktop class CPU only, on tablets/phones, the difference is wider, because the CPUs are weaker so the VMs there will do fewer optimizations. Here is the source I am using which is known for its impartiality.

Java most of the time (in low level math) with hot loops receives like 90% of C++ performance. In my experience, CodeRefractor was a bit faster (11%) in a very tight loop with no special hand-tune of code (but still tunning as much the compiler flags) so I can confirm as a consistent experience

How can it happen still that Java is slower in the 3rd source than JavaScript? Simple: Dalvik is not Java VM (in the strict sense) as CodeRefractor is not .Net (or Mono). Meaning that Dalvik has many designs choices which are strange in the JVM space: they are using a register based VM (like CodeRefractor) which means that their interpreter is faster (even with no compilation you should expect around 60% faster interpretation, there are scientific papers which found that interpreters on stack VMs work faster, the Google team claimed that they are around 2x faster than the JVM, which is consistent even a bit on the high side with scientific papers) and they are using a JIT (dynamic compilation) just for small subsections of the program. I would expect that they get 2x times faster because Google documented that they will preoptimize the bytecode: so the interpreter will run not your bytecode you gave to clients, but a slightly simplified Dalvik bytecode.

JavaScript VM is using a different technique on Android: is using V8 JavaScript virtual machine which uses a "direct to native" quick JIT and after that they are using a more advanced compiler for the hot code. This makes that all JS code (even JS is intrinsically slower) to be compiled.

For the sake of correctness, V8 was not all the time faster than the JIT of the Dalvik, the Android's "Java" implementation, but Dalvik was not improved from 2010 JIT wise, but most of improvements were GC related, and that Android applications will use GPU so they will use much less CPU to draw the screen, making applications to feel faster without Dalvik to become more CPU efficient. Do you remember how I've started the post: when applications need to feel faster, the best way to improve is to improve the components, not CPU only: this is what Google teams did, and I think it was the most sensible thing to do.

As a conclusion, Android L also introduces ART an AOT technique of compilation ("similar" with CR) meaning that (most of) compilation will happen on install time, making at least compilation to be program wide and I would expect as the Google team will improve the ART runtime, will make it very likely faster (if is not already) than V8.

So, how to spot bad benchmarks?
Every benchmarks which are not made for the domain of the user (typically your application), you should state upfront is a bad benchmark. Even more when is comparing "runtimes" or VMs in different class.

This is not to say that are not useful benchmarks, but you can get so easily tricked (and even experts can make mistakes, like this amazing talk of Gil Tene about GC latency and "almost"!) asking your compiler vendor (or JS vendor) to optimize for SunSpider, instead optimizing for your (web) application (like FaceBook, GMail, or what you really use from the web).

Soon I will likely announce my next (not CR related) FOSS project, and will be in the world of Java. The reason is that I understood why Java can give so much performance if written properly and the multitude of tools, but this is for another post. Also, I want to make clear that CR will likely not be updated for a time (excluding someone is interested on some feature and needs assistance).

Sources:
1. http://www.reddit.com/r/Android/comments/2987ny/just_finished_doing_444_dalvik_444_art_and_l/
2. http://www.cnet.de/88132773/android-l-und-android-4-4-4-im-benchmark-test/
3. http://www.stefankrause.net/wp/?p=144

Code Refractor - Virtual Machines/Compiler performance musings

Tuesday, October 7, 2014

Don't trust (VM) benchmarks and how to spot bad benchmarks

No comments:

Post a Comment

Contributors

Blog Archive