Code Refractor - Virtual Machines/Compiler performance musings

Tuesday, September 24, 2013

NBody: can C# be fast?

One question for blog readers is: "how fast could be my C# code if is written in C++?" And in many cases people come with their "numbers" like:
- Java JIT is faster than C (or C++)
- C# is faster than Java because it has structs
- C++ is faster because everything in this world is written in C++ for performance, for example games, etc.

Of course this blog will come with it's personal spin, so if you have a strong opinion about who expects to win, please ignore what is written next.

A benchmark I looked to optimize as it has many common operations for a computing kernel is the NBody benchmark, because it includes:
- mathematical operations
- math intrinsics
- somewhat complex array accesses and iterations
- is written in a cache friendly way(like the reads are mostly sequential) so if you have many memory indirections the performance will be hurt
- it doesn't depend on a complex library functionality (like Regex would do)

Given this, the NBody sourcecode which is hardcoded to 5.000.000 will run in:

(update, there is a last minute fix, .Net times were included with non-release builds, the best time is now 1550 ms)

Runtime	Time(ms)
.Net 4.5 64 bit	1550
MinGW 4.7 32bit	2860
MinGW 4.7 64bit	2840
Win JDK 6 -server 32 bit	1500
Linux JDK 7 -server 64 bit	1444
Linux G++ 4.7 64 bit (-O3)	1494
Linux G++ 4.7 64 bit (PGO)	1378

Some people will notice that I didn't test MS VC++ (on Windows) or Clang on Linux. In fact, I did, but maybe I was wrong in my setups and MS VC++ was slower than MinGW on Windows. Clang++ was also slower than GCC, like would run in 2 seconds (so was like on my up-to-date Mint Linux). The point is mostly to test (best) managed vs (best) native compilers. Also the test doesn't test C# vs C++, but C# against a C++ translation of this C# code made by CodeRefractor, which can be interpreted in any way a reader wants.

So what I found in this testing:
- if you do low level math, Java may save the day, it is by far the safest choice, and is very easy to setup and the conversion of the C# code
- if you know what you're doing, you can get much faster performance if you use the proper OS/compiler in C++ world. Even I didn't use Intel Compiler, or I have access for now just to MS VC++ 2010, using MinGW and Linux can get you a tad faster code (at least if you count the code is written somewhat low level but neutral on optimizations)
- MinGW will give virtually the same performance in NBody on 32 and 64 bit (which was a big surprise for me) at least on this code. Maybe is a bug in my setup, but I pick the best time for this test, and in general I was getting sometimes slower times on 64 bit
- using PGO, 64bit, GCC, -O3, Linux gave at least on my machine the best performance.

For people to reproduce these tests, I have the following machine:
i5 540M 2core @ 2.4 Ghz
6 G RAM
The source is under this revision (of the output C++ file) which reflects the NBody benchmark which is the result of CodeRefractor
Windows 7 64bit
GCC is 4.7.2 under MinGW (part of DevC++ distribution)
VS 2010
.Net 4.5
For Linux: Mint 15 (with updates)/GCC 4.7.3
Best running GCC arguments: -std=c++11 -ldl -O3 -flto -mtune=native -march=native -fprofile-use
where:
-O3 = level 3 of optimizations,
-flto = global optimizer
-mtune=native = optimize for my machine (I think it may not matter)
-march=native = use instructions of my machine (I think it may not matter)
-fprofile-use (using PGO running profiles)
In fact without mtune and march (on 64 bit) the performance is basically the same, but I put these parameters to make sure that users who read this blog and try to reproduce to get similar level of performance.

Sunday, September 22, 2013

Status update: Part 7

New instructions are implemented but somewhat are done partially:
- ldftn instruction is implemented (a bit hackish) for now, and is a crucial part of making delegates work
- similarly, ldelema instruction is implemented which is very important to make unsafe code (with pointers) to work

But on the same time, I reorganize the code to work around the type system and this will make that triggering this code will do a code that will (basically) work for these instructions but the program to fail by including partial code.

In time they will work, but are not in proper state (as for now).

As an experience of these implementations I found that CR needs a lot of small fixes and testing, and as I will add support for more instructions (or bigger features), will likely slow down the development because I will have to make sure it will not regress.

If you have time and you want as an user to invest it, try to do the following:
- take the sources from: https://github.com/ciplogic/CodeRefractor (you have: "download zip" option, if you don't know how to clone, or you don't want to clone it)
- make a small program inside SimpleAdditions solution
- you will likely (if you will do it right away) get some compiler errors
- report them either here (on the blog) or on the GitHub issues page: https://github.com/ciplogic/CodeRefractor/issues

It is very important for me which (small) programs you're running as I can focus testing them. If you know C#, even you don't understand how a compiler works, you can help this project in small or large.

I make this post of asking for help as CR is not backed financially by no entity (the single part is backed is only my free time), so just contributing to it will possible become a project useful for users (like for example some developer to write OpenGL games in C#, and later to recompile them to a platform with no support of C#, or to not pay a license of Xamarin Studio).

Tuesday, September 10, 2013

Status update: Part 6

This update is critical for performance of math code and it has also fixes in multiple ways, so I recommend you to play with the Git source in Github:

- the logic of mapped types (bootstrap types) is a better story. Strings are implemented right now directly in .Net;
- a simple Class Hierarchy Analysis with default support of devirtualization is done (CR doesn't support at all devirtualized calls for now, but if you have a sequence where .Net reports a virtual call, and CR can detect the type, will remove the virtual call and will make it a direct call). This is done per-instance cases;
This makes that this program to work:

            var charData = new[] {'H','e','l','l','o',' ','w','o','r','l','d'};
            var s = new string(charData);
            Console.WriteLine(s);

Sure, not impressive, but it require many parts to work in place;
- all objects are constructed using make_shared, this means a speedup in loops that you allocate multiple objects
As more parts start to be in place, I would expect that in 2-3 months from now, I will make a first release (maybe on the New Year!?) milestone. I hope to make StringBuilder, String and File class basically working for some common operations.

A mini-roadmap for the next release:
- StringBuilder.Append(String) will work
- Bug fixes will be directed into higher level of optimizations with the programs using StringBuilder

Thursday, August 29, 2013

Status update: Part 5

This is a bit earlier update but noteworthy:
- optimizations are rewritten using an Usages&Definitions (although is not an Use-Def chain) which makes optimization step more precise and with less duplicated code
- the speed is put on par with simple optimizations that were before
- boolean operations and all found simple expressions are evaluated at compile time
- similarly fixes are added for pure functions evaluation (functions with mathematical expressions and simple operations) so they are called with constants

So, all things that were planned are done before time.

There are two areas I am planning to work in longer future, one is a regression and one is a new feature:
- strings are not working from the moment I removed the backed C library support, I will need to workaround some C speciffic using the CR's OpenRuntime design
- Delegates, and Delegates from PInvoke

If you find some areas that you think are more important to work at, write me privately at ciprian (dot) mustiata (at) gmail com.

There is still no release done (no 0.01) and if you know how to do installers and have time to pack the release, it would be great as I will publish it as a release. For now fixing bugs is for me more important than a release (not to say that release is not important, but as there are still critical features missing, some people will complain for it's state, so I don't think is releasable, excluding if someone wants a release to contribute back),

Thursday, August 22, 2013

Status Update: Part 4

This month was hectic for me, but either way, you can see some updates in the CR.

* first and foremost, there was a bug fix so that PInvoke calls to other than StdCall convention call to work. This bug is critical to have an working PInvoke. As PInvoke is fixed, the other big missing part is: Delegates and a way to call a callback using a delegate (a fairly common pattern in wrapper libraries like OpenTK)
* a simple purity checker for functions was written, so if you have simple math operations in a function, you may expect that the function if is called with constants will be evaluated at compile time
* a bit improved framework for optimizations analysis was added, though the code is not as good as the previous generated code, but willl catch up. The noteworthy optimization: "remove dead store" is implemented, which will make the dead code elimination more aggressive
* some bugs were discovered and they will be addressed in the next iteration
* I will define the optimization part as a thesis project, (a thesis which can be read inside the Git repo, for interested) which will mean that in many cases the optimizations (up to one year from now) will be more robust and more powerful

What I plan to do next:
* bug fixing
* catch up on optimization front
* make sure that (almost) all simple math expressions are evaluated at compile time. Right now, small misses are here and there, like:
x = (double)2;
is not rewritten as:
x = 2.0;
which in turn disables in some codes other optimizations

Sunday, August 4, 2013

Opinion: C/C++ is the today's assembly on matter of performance

The world relies on C language, so people that care to inter-operate with external components has to consume C. This makes it easier to target C as an output. But targeting C (or C++), gives to you in many ways better performance than writing your own assembly: the today's C compiler optimizer pipeline include many advanced optimizations including but not limited to:
- very good register allocation
- very good local and dataflow optimizations
- very good math performance
- fairly weak (unless you enable link time optimizations) inter-method optimizations

So writing a compiler that targets C (or like CR which targets C++), means that the code can focus on other important parts which are fed to the C++ compiler:
- reduce redundancies of smart-pointers (the assignment of smart pointers is expensive, and a compiler will not simply do it for you - the C++ output without any optimizations in NBody is like 8 times slower than the equivalent .Net time, but after removing the redundancies, the code in C++ gets faster)
- simplify some logic items to remove some variables in the code, so C++ compiler will have less to thinker about
- do some basic dead-code elimination and dataflow analysis, so at least if the target C++ compiler is not that good, the C++ code to be compiled to not be fully inefficient

There are some cases when assembly was used as a performance benefit, and you didn't want to wait the compiler to add support for the instructions you were missing, or worse, the compiler will never generate the code using these instructions. I'm talking here about SSEx or AVX. But using an up-to-date compiler gives you this: Visual Studio 2012 gives to you support for AVX (or SSE2), GCC too, LLVM too. For free, for the loops you made them compiler friendly. In fact, not writing them in assembly, is really a great thing, because the compiler can inline your method, and most compilers will not inline assembly only methods.

Similarly, writing the things up-front in C++, will make your code work on platforms that maybe were never intended to work in the first place, like Arm64, or maybe are very little changes to be made.

The last death stroke in my view, is that today's CPUs are so different than they were let's say 20 years ago, and the main difference, is that the processors today are out-of-order, not in-order. This means that instructions are mostly executed speculative and you're most time you spend in code is "waiting", for memory to be written, for memory to be read, etc. This makes that optimizations of "shift left by 2" or similar minded optimizations to not give any performance benefit, but optimizing your data to fit into the L1, L2 and L3 cache, can give much more speedup sometimes than writing SSEx codes (look to this very interesting talk).

This is also why CodeRefractor at least for the following months will try to improve its capabilities with just a small focus on optimizations, and certainly they will be on the high level. The feature I'm working now is to merge strings all over the project, so they will give a better cache locality. Will speed up greatly the code? I'm not sure, but the performance that C++ gives it from let-go is good enough to start with.

Wednesday, July 31, 2013

Status Updates - Part 3

As I was on vacations, I did make smaller tasks in the free time, but there are some note-worthy updates, mostly in optimization front:
- there is a [PureMethod] attribute that you can mark functions. If this attribute is found, the function is considered pure, and as a consequence, if you call it with constants, at compile time, the code is evaluated. It will be great if in future the functions are computed for purity, but this is a longer plan (is possible to be done, but are many cases)
- there is an inlining code possible (at least for simple functions), but the optimization is disabled as it requires a lot of testing. Anyway, this opens a lot of possibilities on matter of performance: if you have a call of a function with a constant, and this method is inlined, more optimizations can successfully occur. In the medium plan is to bug-fix it and test the inliner to work with most small cases
- the compiler optimizer is split into parallel and serial optimizations. The good part of it, is that as more functions are initially defined, all cores are used to compile every function into cores. The inliner (or future purity computer) are serial optimizations. This reduces the compilation time of NBody (on my I5 first gen) from 200 ms to 150 ms of generating the C++ code, still the C++ code compilation takes longest
- the function bodies are defined like a sequence of simple operations. So, optimizations that do delete one item, are rewritten to be way faster by doing the deletes in batches
- unit tests are a bit weaker right now, but they do compile/run much faster. They test the capability of the compiler to give an output, not the execution output. They run now properly, so the unit testing is working again

So in short, you will get the same code (if you don't mark it with [PureMethod] everywhere) faster.

Added code to reflect APIs, it will be needed to generate stubs for OpenRuntime code. This code needs some love, and if there are any readers interested, would it be great if someone can look into generating some library empty wrappers.

Future plans:
- enums support
- string type improvements (it depends on enum in part)
- string merging - all occurrences of the same string should be defined just once in a string table
- (maybe) fixes into inliner - at least the inline of functions call overhead should not exist at least in some cases that can be detected: empty functions, simple setters/getters
- (maybe) a purity checker - computing purity gives speedups extensively if the developer uses constants. So if the functions can be computed for their purity (without [PureMethod] annotations), when called everywhere with constants, they will give zero overhead on execution time