Code Refractor - Virtual Machines/Compiler performance musings: November 2013

Thursday, November 21, 2013

Target 0.0.2: Status 1

Three small albeit important updates are made in the GitHub repository:
- auto-properties:
    class Test
    {
        public int Value { get; set; }
    }
Properties were working in the previous releases, but auto-properties generate invalid names (for C++), so is

- static constructors were skipped:
    class Test
    {
        public static int Value { get; set; }
        static Test()
        {
            Value = 2;
        }
    }
In the past the static constructor was skipped. Right now the code is generated.

- constant fields are not defined as a part of the final code:
    class Test
    {
        public const int Total = 2;
    }
In the CR 0.0.1 the Total field was a part of the memory usage of the Test class instance. Right now constants are removed, reducing the memory usage and making it consistent with the .Net usages.

Monday, November 11, 2013

Roadmap for 0.0.2

As all development for now was made by me, and as much as I can see, CR needs improvement and I would go into future planned features, I see some parts that do need improvement but as always, the way I see to improve CR is to improve the quality of the project.

First of all let's go into CR misses (and are a target for 0.0.2):
- generics are very limited. I would like to see more cases with generics and to make them working. Some code with generic classes is there, but not all combinations do work;
- some commits were done just after 0.0.1 in the Git repository, and I think there will be a primitive delegate support;
- the actual optimizer is inter-procedural but there is no global variable pool. I would hope to add support for it and try to add simple optimizations regarding this. In short if you will declare a global static int/double, even you don't specify that is it const, CR will try to infer this. If a global variable is not used, CR should remove it (this is important as CR doesn't support reflection or things of this sort)
- handle fields better: if you will use a class that the base class names a field the same as inherited class, CR will "merge" them incorrectly.
- it would be great that instance objects that are created and not used, to be removed (this is in fact a trivial optimization, but it has to take extra care for cases when static constructors do initialize states)
- try to compile a target application: an SDL/OpenGL application and fix all the found blockers.

What is not yet a target, and I encourage anyone interested into more tasks that are useful but not intersting at least for now for me:
- better command line handling: CR supports switching runtime by changing the assembly, switching the C++ compiler, etc.It would be really great if someone will make a consistent and nice command line handling
- integrate CR with C# (or VB.Net) compiler: create a small tool that will invoke C# compiler first (CSC.exe for Windows or MCS.exe for Linux/OS X) and CR later transparently for the user.
- support Linux/OS X/ 32-64 bit differences, ARM/MIPS CPUs: "just" checking if CR works on other platforms including various compilers takes time which I basically don't have. If you feel inclined to support a platform and you need my (minimal) support to setup it, I will be glad to give this minimal help to make it running. I will be glad to put patches to support various compilers or configurations, so if you want to use boost::shared_ptr instead of std::shared_ptr and you write on your end the patches, I will be glad to include them upstream, but I'm not interested to support by myself anything else than my machine
- better support for the mapped runtime: add complete String class, or List<T> or Dictionary<K, T>, etc; this is
- VTable support, exceptions, Reflection support, Linq and lambdas, "if you will implement this I would use CR..." kind of stuff. The reason for that is simply: I either don't have time or interest to support this (or most likely both), and many times I've noticed that fixing some small items at a time make some parts to work by themselves, like support for properties is working now, and even there is no VTable, better static code analysis can remove (for at least some usages) the need of VTables. Would it be great to have them? Yes, as long as a developer (aka "you" - the community) adds support for it.

As a timeline I hope that 0.0.2 will be released around March or April 2014, but it may be earlier. From time to time (in the past was like a bi-monthly issued) I will write status reports with prefix: "Target 0.0.2: Status ..." where the Git tracked development is described. This is good for you to follow them if you are technically inclined to read an "internals" digest of them.

Friday, November 8, 2013

Why not NGen?

A legitimate question, and I want to answer at least in the way I view it:

An honest question: what is the value of your tool over NGEN?

First of all, as I wrote just one blog entry before, I don't believe in "Native vs Managed", because every time people take these words differently:

- if people think by managed as being safe, STL from C++, or Qt's Tulip and QObject is a "managed" subset that keeps most of what .Net offers (basically bounds checking, strict typing, generics, somewhat consistent auto-free-ing of memory, etc.);

- if people think as Native as being "close to metal", even .Net compile to native all methods as they are executed, so excluding the first seconds of startup, all .Net applications are native

Given that, I see a tool as CR useful still as it has a similar compilation profile as C# and there is a lot of tooling of C#, but an execution profile of C++ with LTO (link-time optimization). As far as I see how CR will advance in future, it will support always a subset of what .Net will be able to offer, so people thinking into removing .Net (or Mono) at least as a toolset of CR, I cannot see it as being done to soon.

In the same time, even I see CR as "never catching" .Net, it will be still a great tool, and some use-cases I can see them as getting beyond what .Net can offer.

Let's take a medium program that I think it will be great case for CR (as for a version like 0.5 or 1.0): a developer writes a C#/SDL/OpenGL game and he/she will want to run it on a slower device (let's say for now is a Rasberri Pi, but it doesn't matter that much). First of all, he or she will improve the OpenGL calls, second of all will try to improve the execution profile using Mono.

Using Mono will see first of all that the application starts a bit slow. Also, some math routines are suboptimal for Mono's JIT. It has two options: to run Mono in AOT mode or to use LLVM's JIT. Using LLVM JIT will see that the startup is even slower. Using AOT mode will decrease a lot of the performance (as Mono --aot mode uses one CPU register for generating PIC code). At the end, it will notice that the game will have small hiccups because the SGen GC makes from time to time to skip a frame.

Using CR, in fact the things will be a bit different: there is no need to setup anything else than the optimization level. Considering that a developer may want to wait even let's say half of minute for the final executable, will pick the highest level of optimizations. This will mean that many global operations will happen like: devirtualizations, inlining and removal of global code, constant merging over the entire program, etc. The code will not be PIC code, but will use all registers and the optimizer can be as good as C++ compilers will get to that moment. Because the code is using reference counting, it will mean that pauses are much smaller (no "freeze the world" needed) and there are optimizations (already as of today in CR's codebase) to mitigate the updating of reference counting (CR is using escape analysis).

Some problems will still remain for the user: as is using reference counting, the developer has to look for memory cycles, but on the other hand, these cycles are easier to find, not because CR does anything special, but because C++ tools today do find really easy memory leaks (and CR does name the functions the same as the C# original name is). In fact is it easier to find a leak using ref-counting than a GC leak: start Visual Studio using Debug configuration of the generated C++ code, and at exit of the program all leaks are shown in the console.

At last, CR can add as many things as the developer community will contribute because CR is written in C#, is easier to handle high level optimizations than would be to hack them into Mono runtime (which is C) or in .Net (which is impossible, as the components are not public for modification). Some optimizations that can be done explicit and require much less work from the coding standpoint is marshaling and PInvokes (an area I would really love that CR to improve). When you call a method in a DLL/libSO, in .Net (or in Java for that manner), there is some extra "magic" like some pointers are pinned, and some conversions between pointers occur. In contrast, is it possible that this marshaling (and certainly there is no need of pinning) to be removed all-together in some cases. For example if the developer knows that it uses OpenGL32.dll (or libGL.so), to link using -lOpenGL32, using a special compiler flag. This is not a big win for some libraries, but is big for others, because it doesn't need an indirect call.

So in short, think about CR as a C# only VM that takes its time to compile CIL. At the end it outputs some optimized C++ which can be optimized further by some modern compilers. It is easy to hack (according to Ohloh is only 17K lines for now and it has support for more than 100 IL instructions, it includes a lot of reflection code, more than 20 optimization passes, etc.)

Wednesday, November 6, 2013

Code Refractor 0.0.1 - First release is available

After close of 7 months of development (the project I started it at the start of April, before making it public), first release of CodeRefractor is here.

What it contains compared with the very first introduction? Many CIL instructions and some generics support is there. Compared with introduction of CodeRefractor which was basically able to set fields, arrays and math, and compared with this, I will try to summarize what happened in between:
- most of development is documented, look into Documentation folder (using the .docx format) about the main subsystems
- compiler optimizations are not naive implementations: they are more powerful and they have been using a use-def so the optimizations are understanding the code more precisely; similarly the compiler branching is based on usages and definitions so many intermediary variables are removed naturally by the compiler; an inliner implementation works intelligently
- the runtime is mostly written in C# and is possible to add more methods by writting C# and C++ annotations
- many CIL instructions are implemented, making C# only implementations (that do not depend on System runtime too much) to work. The biggest not-implemented I can say that are delegates. Some partial implementations of generics (very limited), unsafe code is done
- a primitive logic of class hierarchy analysis is done, and as implementation will mature, expect that many devirtualization would be done safely and correctly by the compiler
- unique (in my knowledge) for CIL implementations, the purity and escape analysis allow more optimizations to be really aggressive: calling pure functions with constants is the same as calling the result constant, so Math.Sin(0) is always evaluated as 0 (zero), or for a program taking care of the fact that CR does escape analysis, that objects are allocated on stack or the smart pointers are converted (safely) to raw pointers improving runtime performance. This can generate final code that faster than .Net programs.
- the optimizer works like a global optimizer which makes possible some inter procedural (entire program) operations: program wide merging of array data, strings and PInvoke methods makes your final program to be smaller

Even many things work, the release has many cut corners, and some parts were written very bug prone, so expect that the resulted C++ code to not compile, and as the runtime has no classes in themselves, also expect that no non-trivial program to compile. If it compiles, it should run fast.

After you extract this release, which is just a .zip file, you should copy a GCC distribution. For simplicity I'm using the great Orwell's DevC++ and I copy the C:\Dev-Cpp\MinGW64 as: <CodeRefractorPath>\Lib\Gcc Please notice that you will have to rename the folder at the end, but other than this, it should work just fine.

Anyway, many things are missing and everyone is encouraged to test it and to implement small stuff starting from the GitHub project: every small piece in place makes your program to be closer to working, or if it is working already to work better and more stable.

For questions and feedback you can use Google Groups page.

Friday, November 1, 2013

Opinion: Native and Managed, what it really means?

Microsoft and Virtual Machines world do use many definitions and based on emphasis they can say things that do make no sense to some extent: "native performance", "performance per watt" and in my view is all based on undefined clearly terms.

Based on this I notice that this emphasis changed even more with phones and I cannot clarify without using definitions which again will break the purpose of this entry, so I will go on technology side:
- native is in many people mind associated with Ahead Of Time compilers meaning that you write the code in the language of choice, and finally will create a final executable code that runs directly on the CPU machine
- managed/virtual machine is when applications are written in something intermediary, and before executing there is a runtime that reads this "bytecode" and compiles on the fly

Because of how compilers work, compilation is expensive, and it means that most virtual machines do compromises to make possible to have interactive applications possible. This is why the virtual machines are somewhat lightweight on compiling code by reducing the analysis steps that they perform on the code. This means two things: big applications typically start slower than their "native" counter part, and in many cases the compiled code quality is a bit weaker so the code will run some percent slower up-to some times slower (more in this later).

Based on this view, is it true that there is a managed application world that is so slow than a native performance world? Of course, but as many answers are in life it depends:
- most of the things you see on your screen depend on GPU (video card) to be drawn so even the slowness of a virtual machine is there, if it is done on a separate thread, the animations may work independently
- most of virtual machines do compile hottest of the hottest (including JavaScript VMs which today tend to use a separate thread/process to compile JS) so for simple/interactive applications you will get good (enough) performance
- some VMs do have parts of code compiled into "native" code, for example using NGen, and even the NGen is not a high quality code generator, is good enough, but also it makes your application to start fast
- VMs do allow to use native code directly so if a loop is not well optimized by the virtual machine, the developer can use native code that runs as fast as native code
- VMs tend to have a fast memory allocator, so an allocation heavy application may run faster than a native application, if the native application doesn't write memory pools or other caches to speedup the application

In this hybrid world, performance is less meaningful than it would be if we talk about full Java applications 10 years ago. Also, it is even less meaningful as GPUs and computation on GPU do matter a lot in computation.

This is why when Microsoft launched: "Going Native" campaign puzzled me... the easiest way to achieve this (in "managed" world) is to compile the bytecodes upfront using NGen. They are using this in Windows Phone 8 as your MSIL code is compiled in the cloud.

People were using C# not because performance was not good, but because it was good enough. C++ started to be used because Microsoft did not invest into improving their quality of generated code for a long long time, and in comparison, C++ always did this at least with work into Visual Studio's Phoenix backend, by GCC team and of course Clang/LLVM team.

The last issue of Managed vs Native is that people use it just to make it as a marketing pintch, like here: https://www.youtube.com/watch?v=3vGV4fF4KCM (minute 34:40), it is contrasted Web technologies like JavaScript with "True Native". And even we disregard the word "Scripted", which is the performance profile for a JS application? If you use Canvas, it will be hardware accelerated today, if you load code, most of it will be set as dead-code, and it will run like in 5x of the speed of fully optimized code, which if it is run like 1 time, to sum all items in a column is really instant.

Code Refractor is using some decisions found in other opensource projects like Vala or ObjectiveC (to use smart-pointers instead of GC) and from Java (escape analysis and Class Hierarchy Analysis) or even from GCC (pure function annotation) and Asm.JS (use a subset of code and optimize this properly, then add another feature, and optimize this new one properly) because sometimes performance matter and is done by design. The importance (in long term, as right now is just a very limited pre-alpha) of CR is in my view the approach: the "native" step is basically spending long times optimizing upfront.

What CR can bring is not always performance, but more capabilities: using Emscripted (a C++ to JS) or Duetto you can start from C#. As for myself I hope it will be used at least to migrate some Mono applications like Pinta or Tomboy and not use something like GNote (a C++ translation by hand of Tomboy) where Mono is not desired (Fedora Linux anyone!?).

Code Refractor - Virtual Machines/Compiler performance musings