Code Refractor - Virtual Machines/Compiler performance musings

Tuesday, October 22, 2013

Status update: Part 8

This post is fairly weak in content as I noticed that there are some bugs introduced in the newest bits, but there are also some improvements:
- the generated code does not use for generated types namespaces. Some name mangling were fixed because of this
- the duplicate runtime methods do not appear anymore
- documentation is vastly improved: as I use CodeRefractor as a bachelor paper, I described all the optimizations and various components of CR in an extensive manner. There are 40 written pages with content that may be interesting. If you are interested (at least as a lecture) about various internal parts or how the code is made, I recommend reading the Documentation folder.
- generic class specialization works (partially at least) and name mangling is better of managing generics
- there are also advances (small ones) into supported delegates, but is still not ended. The best part is that the delegates code is isolated and when I will have time I will be able to finish them

Still the main area of focus remains for now: bug fixing and/or code cleanup.

I want with this post to thank Jetbrains by offering a (free) license of ReSharper to advance CodeRefractor. I think that R# is a great tool in itself and I hope you will test it if you are not using it you will see how it improves the code quality of your code base.

Monday, October 7, 2013

NBody: can C# be very fast?

As you read in the previous blog post, performance of Java and .Net can be fast (or even faster) than C++ compiled code. Anyway, this was a reason for me to look the performance discrepancy and to implement three "remaining" optimizations that made the gap of performance really big:
- escape analysis: allocations for objects that are not escaping are made on stack
- common subexpression elimination
- loop invariant code motion (LICM)

So what it means in performance terms? I will not put exact numbers, but with the best C++ time, you will get a bit less than 1300 ms (on Linux), and around 1400 ms on Windows (on both VC++ and MinGW).

Why these optimizations were so important:
- escape analysis removes for cases of parameters of functions necessity of increment/decrement reference counting
- having a lot of small items (in expressions) that do repeat, they will be precomputed once. This part also work over function calls (if the functions are evaluated as pure).
So if you have a rotation matrix, and you compute against cosine (alpha) and sine(alpha), you don't have to cache the sine and cosine, the compiler will do it for you automatically.
- LICM (Wikipedia article) will work as common subexpressions, but in case your code has expressions that do not change over the loops, they will be executed at the start of the loop once, and not at every iteration. This optimization works also with pure functions, so if you make a function call, this function call will be moved outside of the loop also.

This also means that I will not work (excluding there are bugs) for optimizations for some time, but you may try to generate code and the result "should scream".

Tuesday, September 24, 2013

NBody: can C# be fast?

One question for blog readers is: "how fast could be my C# code if is written in C++?" And in many cases people come with their "numbers" like:
- Java JIT is faster than C (or C++)
- C# is faster than Java because it has structs
- C++ is faster because everything in this world is written in C++ for performance, for example games, etc.

Of course this blog will come with it's personal spin, so if you have a strong opinion about who expects to win, please ignore what is written next.

A benchmark I looked to optimize as it has many common operations for a computing kernel is the NBody benchmark, because it includes:
- mathematical operations
- math intrinsics
- somewhat complex array accesses and iterations
- is written in a cache friendly way(like the reads are mostly sequential) so if you have many memory indirections the performance will be hurt
- it doesn't depend on a complex library functionality (like Regex would do)

Given this, the NBody sourcecode which is hardcoded to 5.000.000 will run in:

(update, there is a last minute fix, .Net times were included with non-release builds, the best time is now 1550 ms)

Runtime	Time(ms)
.Net 4.5 64 bit	1550
MinGW 4.7 32bit	2860
MinGW 4.7 64bit	2840
Win JDK 6 -server 32 bit	1500
Linux JDK 7 -server 64 bit	1444
Linux G++ 4.7 64 bit (-O3)	1494
Linux G++ 4.7 64 bit (PGO)	1378

Some people will notice that I didn't test MS VC++ (on Windows) or Clang on Linux. In fact, I did, but maybe I was wrong in my setups and MS VC++ was slower than MinGW on Windows. Clang++ was also slower than GCC, like would run in 2 seconds (so was like on my up-to-date Mint Linux). The point is mostly to test (best) managed vs (best) native compilers. Also the test doesn't test C# vs C++, but C# against a C++ translation of this C# code made by CodeRefractor, which can be interpreted in any way a reader wants.

So what I found in this testing:
- if you do low level math, Java may save the day, it is by far the safest choice, and is very easy to setup and the conversion of the C# code
- if you know what you're doing, you can get much faster performance if you use the proper OS/compiler in C++ world. Even I didn't use Intel Compiler, or I have access for now just to MS VC++ 2010, using MinGW and Linux can get you a tad faster code (at least if you count the code is written somewhat low level but neutral on optimizations)
- MinGW will give virtually the same performance in NBody on 32 and 64 bit (which was a big surprise for me) at least on this code. Maybe is a bug in my setup, but I pick the best time for this test, and in general I was getting sometimes slower times on 64 bit
- using PGO, 64bit, GCC, -O3, Linux gave at least on my machine the best performance.

For people to reproduce these tests, I have the following machine:
i5 540M 2core @ 2.4 Ghz
6 G RAM
The source is under this revision (of the output C++ file) which reflects the NBody benchmark which is the result of CodeRefractor
Windows 7 64bit
GCC is 4.7.2 under MinGW (part of DevC++ distribution)
VS 2010
.Net 4.5
For Linux: Mint 15 (with updates)/GCC 4.7.3
Best running GCC arguments: -std=c++11 -ldl -O3 -flto -mtune=native -march=native -fprofile-use
where:
-O3 = level 3 of optimizations,
-flto = global optimizer
-mtune=native = optimize for my machine (I think it may not matter)
-march=native = use instructions of my machine (I think it may not matter)
-fprofile-use (using PGO running profiles)
In fact without mtune and march (on 64 bit) the performance is basically the same, but I put these parameters to make sure that users who read this blog and try to reproduce to get similar level of performance.

Sunday, September 22, 2013

Status update: Part 7

New instructions are implemented but somewhat are done partially:
- ldftn instruction is implemented (a bit hackish) for now, and is a crucial part of making delegates work
- similarly, ldelema instruction is implemented which is very important to make unsafe code (with pointers) to work

But on the same time, I reorganize the code to work around the type system and this will make that triggering this code will do a code that will (basically) work for these instructions but the program to fail by including partial code.

In time they will work, but are not in proper state (as for now).

As an experience of these implementations I found that CR needs a lot of small fixes and testing, and as I will add support for more instructions (or bigger features), will likely slow down the development because I will have to make sure it will not regress.

If you have time and you want as an user to invest it, try to do the following:
- take the sources from: https://github.com/ciplogic/CodeRefractor (you have: "download zip" option, if you don't know how to clone, or you don't want to clone it)
- make a small program inside SimpleAdditions solution
- you will likely (if you will do it right away) get some compiler errors
- report them either here (on the blog) or on the GitHub issues page: https://github.com/ciplogic/CodeRefractor/issues

It is very important for me which (small) programs you're running as I can focus testing them. If you know C#, even you don't understand how a compiler works, you can help this project in small or large.

I make this post of asking for help as CR is not backed financially by no entity (the single part is backed is only my free time), so just contributing to it will possible become a project useful for users (like for example some developer to write OpenGL games in C#, and later to recompile them to a platform with no support of C#, or to not pay a license of Xamarin Studio).

Tuesday, September 10, 2013

Status update: Part 6

This update is critical for performance of math code and it has also fixes in multiple ways, so I recommend you to play with the Git source in Github:

- the logic of mapped types (bootstrap types) is a better story. Strings are implemented right now directly in .Net;
- a simple Class Hierarchy Analysis with default support of devirtualization is done (CR doesn't support at all devirtualized calls for now, but if you have a sequence where .Net reports a virtual call, and CR can detect the type, will remove the virtual call and will make it a direct call). This is done per-instance cases;
This makes that this program to work:

            var charData = new[] {'H','e','l','l','o',' ','w','o','r','l','d'};
            var s = new string(charData);
            Console.WriteLine(s);

Sure, not impressive, but it require many parts to work in place;
- all objects are constructed using make_shared, this means a speedup in loops that you allocate multiple objects
As more parts start to be in place, I would expect that in 2-3 months from now, I will make a first release (maybe on the New Year!?) milestone. I hope to make StringBuilder, String and File class basically working for some common operations.

A mini-roadmap for the next release:
- StringBuilder.Append(String) will work
- Bug fixes will be directed into higher level of optimizations with the programs using StringBuilder

Thursday, August 29, 2013

Status update: Part 5

This is a bit earlier update but noteworthy:
- optimizations are rewritten using an Usages&Definitions (although is not an Use-Def chain) which makes optimization step more precise and with less duplicated code
- the speed is put on par with simple optimizations that were before
- boolean operations and all found simple expressions are evaluated at compile time
- similarly fixes are added for pure functions evaluation (functions with mathematical expressions and simple operations) so they are called with constants

So, all things that were planned are done before time.

There are two areas I am planning to work in longer future, one is a regression and one is a new feature:
- strings are not working from the moment I removed the backed C library support, I will need to workaround some C speciffic using the CR's OpenRuntime design
- Delegates, and Delegates from PInvoke

If you find some areas that you think are more important to work at, write me privately at ciprian (dot) mustiata (at) gmail com.

There is still no release done (no 0.01) and if you know how to do installers and have time to pack the release, it would be great as I will publish it as a release. For now fixing bugs is for me more important than a release (not to say that release is not important, but as there are still critical features missing, some people will complain for it's state, so I don't think is releasable, excluding if someone wants a release to contribute back),

Thursday, August 22, 2013

Status Update: Part 4

This month was hectic for me, but either way, you can see some updates in the CR.

* first and foremost, there was a bug fix so that PInvoke calls to other than StdCall convention call to work. This bug is critical to have an working PInvoke. As PInvoke is fixed, the other big missing part is: Delegates and a way to call a callback using a delegate (a fairly common pattern in wrapper libraries like OpenTK)
* a simple purity checker for functions was written, so if you have simple math operations in a function, you may expect that the function if is called with constants will be evaluated at compile time
* a bit improved framework for optimizations analysis was added, though the code is not as good as the previous generated code, but willl catch up. The noteworthy optimization: "remove dead store" is implemented, which will make the dead code elimination more aggressive
* some bugs were discovered and they will be addressed in the next iteration
* I will define the optimization part as a thesis project, (a thesis which can be read inside the Git repo, for interested) which will mean that in many cases the optimizations (up to one year from now) will be more robust and more powerful

What I plan to do next:
* bug fixing
* catch up on optimization front
* make sure that (almost) all simple math expressions are evaluated at compile time. Right now, small misses are here and there, like:
x = (double)2;
is not rewritten as:
x = 2.0;
which in turn disables in some codes other optimizations