Tuesday, September 24, 2013

NBody: can C# be fast?

One question for blog readers is: "how fast could be my C# code if is written in C++?" And in many cases people come with their "numbers" like:
- Java JIT is faster than C (or C++)
- C# is faster than Java because it has structs
- C++ is faster because everything in this world is written in C++ for performance, for example games, etc.

Of course this blog will come with it's personal spin, so if you have a strong opinion about who expects to win, please ignore what is written next.

A benchmark I looked to optimize as it has many common operations for a computing kernel is the NBody benchmark, because it includes:
- mathematical operations
- math intrinsics
- somewhat complex array accesses and iterations
- is written in a cache friendly way(like the reads are mostly sequential) so if you have many memory indirections the performance will be hurt
- it doesn't depend on a complex library functionality (like Regex would do)

Given this, the NBody sourcecode which is hardcoded to 5.000.000 will run in:

(update, there is a last minute fix, .Net times were included with non-release builds, the best time is now 1550 ms)
Runtime     Time(ms)
.Net 4.5 64 bit 1550
MinGW 4.7 32bit 2860
MinGW 4.7 64bit 2840
Win JDK 6 -server 32 bit 1500
Linux JDK 7 -server 64 bit 1444
Linux G++ 4.7 64 bit (-O3) 1494
Linux G++ 4.7 64 bit (PGO) 1378

Some people will notice that I didn't test MS VC++ (on Windows) or Clang on Linux. In fact, I did, but maybe I was wrong in my setups and MS VC++ was slower than MinGW on Windows. Clang++ was also slower than GCC, like would run in 2 seconds (so was like on my up-to-date Mint Linux). The point is mostly to test (best) managed vs (best) native compilers. Also the test doesn't test C# vs C++, but C# against a C++ translation of this C# code made by CodeRefractor, which can be interpreted in any way a reader wants.

So what I found in this testing:
- if you do low level math, Java may save the day, it is by far the safest choice, and is very easy to setup and the conversion of the C# code
- if you know what you're doing, you can get much faster performance if you use the proper OS/compiler in C++ world. Even I didn't use Intel Compiler, or I have access for now just to MS VC++ 2010, using MinGW and Linux can get you a tad faster code (at least if you count the code is written somewhat low level but neutral on optimizations)
- MinGW will give virtually the same performance in NBody on 32 and 64 bit (which was a big surprise for me) at least on this code. Maybe is a bug in my setup, but I pick the best time for this test, and in general I was getting sometimes slower times on 64 bit
- using PGO, 64bit, GCC, -O3, Linux gave at least on my machine the best performance.

For people to reproduce these tests, I have the following machine:
i5 540M 2core @ 2.4 Ghz
The source is under this revision (of the output C++ file) which reflects the NBody benchmark which is the result of CodeRefractor
Windows 7 64bit
GCC is 4.7.2 under MinGW (part of DevC++ distribution)
VS 2010
.Net 4.5
For Linux: Mint 15 (with updates)/GCC 4.7.3
Best running GCC arguments:  -std=c++11 -ldl -O3 -flto -mtune=native -march=native -fprofile-use 
-O3 = level 3 of optimizations,
-flto = global optimizer
-mtune=native = optimize for my machine (I think it may not matter)
-march=native = use instructions of my machine (I think it may not matter)
-fprofile-use (using PGO running profiles)
In fact without mtune and march (on 64 bit) the performance is basically the same, but I put these parameters to make sure that users who read this blog and try to reproduce to get similar level of performance.

Sunday, September 22, 2013

Status update: Part 7

New instructions are implemented but somewhat are done partially:
- ldftn instruction is implemented (a bit hackish) for now, and is a crucial part of making delegates work
- similarly, ldelema instruction is implemented which is very important to make unsafe code (with pointers) to work

But on the same time, I reorganize the code to work around the type system and this will make that triggering this code will do a code that will (basically) work for these instructions but the program to fail by including partial code.

In time they will work, but are not in proper state (as for now).

As an experience of these implementations I found that CR needs a lot of small fixes and testing, and as I will add support for more instructions (or bigger features), will likely slow down the development because I will have to make sure it will not regress.

If you have time and you want as an user to invest it, try to do the following:
- take the sources from: https://github.com/ciplogic/CodeRefractor (you have: "download zip" option, if you don't know how to clone, or you don't want to clone it)
- make a small program inside SimpleAdditions solution
- you will likely (if you will do it right away) get some compiler errors
- report them either here (on the blog) or on the GitHub issues page: https://github.com/ciplogic/CodeRefractor/issues

It is very important for me which (small) programs you're running as I can focus testing them. If you know C#, even you don't understand how a compiler works, you can help this project in small or large.

I make this post of asking for help as CR is not backed financially by no entity (the single part is backed is only my free time), so just contributing to it will possible become a project useful for users (like for example some developer to write OpenGL games in C#, and later to recompile them to a platform with no support of C#, or to not pay a license of Xamarin Studio).

Tuesday, September 10, 2013

Status update: Part 6

This update is critical for performance of math code and it has also fixes in multiple ways, so I recommend you to play with the Git source in Github:

- the logic of mapped types (bootstrap types) is a better story. Strings are implemented right now directly in .Net;
- a simple Class Hierarchy Analysis with default support of devirtualization is done (CR doesn't support at all devirtualized calls for now, but if you have a sequence where .Net reports a virtual call, and CR can detect the type, will remove the virtual call and will make it a direct call). This is done per-instance cases;
This makes that this program to work:

            var charData = new[] {'H','e','l','l','o',' ','w','o','r','l','d'};
            var s = new string(charData);

Sure, not impressive, but it require many parts to work in place;
- all objects are constructed using make_shared, this means a speedup in loops that you allocate multiple objects
As more parts start to be in place, I would expect that in 2-3 months from now, I will make a first release (maybe on the New Year!?) milestone. I hope to make StringBuilder, String and File class basically working for some common operations.

A mini-roadmap for the next release:
- StringBuilder.Append(String) will work
- Bug fixes will be directed into higher level of optimizations with the programs using StringBuilder