Sunday, August 4, 2013

Opinion: C/C++ is the today's assembly on matter of performance


The world relies on C language, so people that care to inter-operate with external components has to consume C. This makes it easier to target C as an output. But targeting C (or C++), gives to you in many ways better performance than writing your own assembly: the today's C compiler optimizer pipeline include many advanced optimizations including but not limited to:
- very good register allocation
- very good local and dataflow optimizations
- very good math performance
- fairly weak (unless you enable link time optimizations) inter-method optimizations

So writing a compiler that targets C (or like CR which targets C++), means that the code can focus on other important parts which are fed to the C++ compiler:
- reduce redundancies of smart-pointers (the assignment of smart pointers is expensive, and a compiler will not simply do it for you - the C++ output without any optimizations in NBody is like 8 times slower than the equivalent .Net time, but after removing the redundancies, the code in C++ gets faster)
- simplify some logic items to remove some variables in the code, so C++ compiler will have less to thinker about
- do some basic dead-code elimination and dataflow analysis, so at least if the target C++ compiler is not that good, the C++ code to be compiled to not be fully inefficient

There are some cases when assembly was used as a performance benefit, and you didn't want to wait the compiler to add support for the instructions you were missing, or worse, the compiler will never generate the code using these instructions. I'm talking here about SSEx or AVX. But using an up-to-date compiler gives you this: Visual Studio 2012 gives to you support for AVX (or SSE2), GCC too, LLVM too. For free, for the loops you made them compiler friendly. In fact, not writing them in assembly, is really a great thing, because the compiler can inline your method, and most compilers will not inline assembly only methods.

Similarly, writing the things up-front in C++, will make your code work on platforms that maybe were never intended to work in the first place, like Arm64, or maybe are very little changes to be made.

The last death stroke in my view, is that today's CPUs are so different than they were let's say 20 years ago, and the main difference, is that the processors today are out-of-order, not in-order. This means that instructions are mostly executed speculative and you're most time you spend in code is "waiting", for memory to be written, for memory to be read, etc. This makes that optimizations of "shift left by 2" or similar minded optimizations to not give any performance benefit, but optimizing your data to fit into the L1, L2 and L3 cache, can give much more speedup sometimes than writing SSEx codes (look to this very interesting talk).

This is also why CodeRefractor at least for the following months will try to improve its capabilities with just a small focus on optimizations, and certainly they will be on the high level. The feature I'm working now is to merge strings all over the project, so they will give a better cache locality. Will speed up greatly the code? I'm not sure, but the performance that C++ gives it from let-go is good enough to start with.

No comments:

Post a Comment