Monday, July 6, 2015

Resharper 9 - a blast

Disclaimer: I've received an opensource license from JetBrains of Resharper for the second time. Thank you JetBrains!

I've been fairly critical sometimes with R# (Resharper) as is somewhat not accessible for some users, in the same time I've been using it. But I want to say why also code analysis in general and coding in particular is crucial with using today with a Resharper like tool.

So first of all, I want to make some criticism of Resharper and especially R# 9 as I've received:
- I've had a not updated R# 8 (it expired somewhere around October) and upgrading to 9.0 (which happen to be out of date because I didn't use R# for some time) made R# to report a lot of errors in code which were not there. Clearing the caches did fix all the known errors I had. But it was really strange (Google pointed me directly to the right place)
- Resharper doesn't default to use Solution Wide analysis. Maybe for low end machines is to be desired, or for very big projects, but as it is, at least for medium projects is a boon. I am sure that for big solutions (I'm thinking here programs like SharpDevelop or bigger) maybe Resharper runs slow to update the analysis (which in itself is a fair point) but the missing of the information that R# provides (like compilation errors you may have) by default, I found it as a big miss

Ok, so small bugs and not so great defaults. But in context of CodeRefractor's project it was so great feature because it made possible to make possible to big rewrites and right now it undergones the third rewrite. Every rewrite was justifiable for various reasons:
- the first and (as for me) very important one was that the internal representation was shaped very close to SSA form (or at least to LinearIL from Mono project). A subsequent almost as a full rewrite made the project to use an index of these instructions so optimizations will not do their job well, but they do it fast
- the second rewrite allowed a much refined way to find all methods (like virtual methods) so many more programs do run now (try it, it will do wonders)
- the third rewrite (that is currently going) that I will not write the details now

What I found great working features:
- creating property is automatic and fast with good defaults:
myValue.Width = 30;
//R# will suggest to create Width as an automatic property of int type
- creating automatic empty class taking into account of constrains:
BaseClass a = new MyNotDefinedClass();
//R# will suggest to create MyNotDefinedClass as BaseClass and will also implement some required data
- the Solution Wide analysis which takes into account when your code compiles. This feature is so awesome because you can combine it with two features: "Code cleanup" (which removes for example a lot of redundancies and reformats nicely the whole code) and "Find Code Issues".
- a R# 9.0 feature: code completion filters with various criteria (like: "properties only" or "extension methods only").
- unused parameters and the refactor to remove them globally is really a huge time saver of developer time

So in short, I have to say that if you start with Resharper from scratch, or you do want to use productively C#, I warmly recommend it to you. Also, don't forget the first thing after you open your solution to enable by default the Solution wide analysis (you have a "gray circle" on bottom-right: double click on it and click "OK" to the dialog it appears").

Also, please note that I tried to be as unbiased as I can, so I didn't point things that I'm sure that are invaluable for other projects like MVC3 or Xaml features (CR usage of Xaml is very limited), so here is only what I used (and enjoyed!) but some features may be for you closer to heart .

Improve performance for your selects in Linq

A think I learned inside CodeRefractor is how loops do work inside .Net. One thing I learned fairly quick is that the fastest loop is by far on arrays. It is documented also by Microsoft.

In short, especially using .Net on 64 bit, you will see high performance code over arrays so I strongly recommend if you have data that you read it often out of it (for example for using Linq), you should use ToArray() function.

So let's say you need out of your "tradeData" variable your names out of it.
The code may look like this:
return tradeData.Select(it => it.Id).ToArray();
What's wrong with this code? Let's say "tradeData" variable can have 1.000.000 items and tradeData can be itself an array or a List<T> and when you profile, you can see that iteration takes little time, but most of the time you will see like 16-18 allocations inside of ToArray(), the reason being that ToArray itself keeps an internal array which is resized for more times.

So it should be possible to write a "SelectToArray" method that will have much lower overhead:
     public static class UtilsLinq
        public static TResult[] SelectToArray<TValue, TResult>(this IList<TValue> items, Func<TValue, TResult> func)
            var count = items.Count;
            var result = new TResult[count];
            for (var i = 0; i < result.Length; i++)
                result[i] = func(items[i]);
            return result;

As T[] implements IList<T> makes this code to work for both arrays and List<T>. This code will run as fast as possible and there are no hidden allocations.

And you code becomes:
return tradeData.SelectToArray(it => it.Id);

Strong recommendation for fast(er) code: when you use Select or SelectToArray do NEVER allocate inside it "class" objects but struct objects. If you want to keep a result with multiple data fields, create "struct" types which incapsulate them.

How fast is it? It it fairly fast.

For this code:
            var sz = 10000000;
            var randData = new int[sz];
            var random = new Random();
            for(var i = 0; i<sz; i++)
                randData[i] = random.Next(1, 10);
            var sw = Stopwatch.StartNew();
            for(int t = 0; t<5;t++){
                var arr = randData.SelectToArray(i => (double)i);
            var time1 = sw.ElapsedMilliseconds;
            for(int t = 0; t<5;t++){
                var arr = randData.Select(i => (double)i).ToArray();
            var time2 = sw.ElapsedMilliseconds;
You have
 time1 = 798 ms vs time2 = 1357 (Debug configuration)
 time1 =  574 ms vs time2 = 1003 (Release configuration)

Not sure about you, but this is significant and also it is crucial of you have multiple Linq/Select statements and you want also the resulting items to be fast iterable. Similarly, you will have bigger speedup if you don't do the cast to double, but I wanted to show a more realistic code where the Linq it is doing something light (like typically happens as sometimes there is an indexer involved, or a field access).

NB/ This test is artificial, and use these results at your own risk.

Friday, May 15, 2015

Calcium - a Mirah like language for .Net

Hi readers, not sure if anyone is following my GitHub page, but I did fix some of bugs with Calcium language. What is Calcium? Calcium is a Mirah like language (which itself is a Ruby like language) for .Net platform. If you write your code in Ruby using mostly IronRuby conventions (and as much as the minimal features are working), you should get at the end a C# file without any other overhead (excluding the .Net one). For now a simple program is supported, the Mandelbrot fractal generator but the more types/fixes are included. The slowest part of the fractal generator is in fact writing to console.

Want to have quick a C# program that writes to screen and is compiled with Ruby syntax? This mini-compiler could help you.

A code like this one does what you would expeect: writes 10 times "Hello from Calcium" then it counts the time in milliseconds that was required to do so:

def run
   print "Hello from Calcium"

start = Environment.TickCount
i = 0
while i < 10
  i += 1

endTime = (Environment.TickCount - start) / 1000.0
print "Time: "
puts endTime

The generated C# is the following:

using System;
using System.Collections.Generic;
using Cal.Runtime;
public class _Global {

static public void Main ()
Int32 start;
Int32 i;
Double endTime;
start = Environment . TickCount;
i = 0;
i += 1;
endTime = (Environment . TickCount-start)/1000.0;
Console.Write("Time: ");;
static public void run ()
Console.Write("Hello from Calcium");;

As you can see it could be a time saver, and if it will be extended enough, it can replace in future some cases where you used IronRuby and you quit because it felt to slow. I plan to fix and extend this transpiler to make it functional enough to support very common cases.

If you are interested, please take a look and try to extend or report as minimal bug reports as possible.

Monday, April 27, 2015

Can RyuJIT beat Java -server?

The very short answer is always: depends.

RyuJIT is the new compiler architecture to compile .Net. It is supposedly branched out of „x86” branch of .Net and is modernized. And there are benchmarks and the startup performance got better, but did the throughput improved to beat Java?

The good part is that this month Microsoft opensourced many parts of the .Net stack as part of CoreCLR stack, and one of them is also the RyuJIT part of it. So we can look inside it. The code can be found here.
First of all, what RyuJIT seems to do is to give a fairly lightweight as high level optimizations which I think they are the minimal optimization set on Debug configuration:
- it builds SSA form (a form that improves precision of the compiler to remove aggressively data)
- it does fairly aggressive dead code elimination (based on liveness)
- it does Linear Scan Register Allocation

More optimizations can be enabled, which they mostly consist into common subexpression elimination, bounds check eliminations, a more aggressive dead code elimination (global liveness analysis).

Initially I was really surprised on how few optimizations look to be available inside RyuJIT, and looking a bit more into the code, some new gems appear, like it looks that there are in special a aggressive inlining and ”loop cloning” (which if I understood the code right, should make a loop to 1000 to be in fact split in 250 loops of 4 times repeating the iteration). This optimization I think is also important as RyuJIT supports SIMD intrinsics, so it can make a CPU specialized code.

Of course these optimizations all help and if you profile and tweak code, your code will be good enough, but still, it can beat Java -server?

At this stage, the answer is clearly no. In fact, if you write your code using Firefox's JIT for JavaScript, this optimizer has more advanced optimizations exposed, like Global Variable Numbering (GVN) and a better register allocator. I would not be surprised if you write "use asm" and this code to run much faster on Firefox JIT.

There are two items why RyuJIT should not run faster than Java and they are:
- it doesn't have many and more complex high level optimizations (I didn't even find Loop-invariant-code-motion, an optimization that even CodeRefractor has). Of course adding them will slow down the JIT time
- as RyuJIT will likely inline small functions/properties and duplicate parts of loops, will increase CPU registers (especially on x86) pressure and the LSRA allocator gives a fairly good performance, but is 10-20% slower than the full register allocator (used by Java server, still is the same with the warmup Java client register allocator)

Where RyuJIT can work faster is to allocate on stack faster than Java does, but eventually the code will get into tight loops and this code will run slower than Java by around 20%, if you don't make the mistake of making a hidden allocation on Java side. Also Dictionary<K,T> in .Net is much CPU cache friendly so if you run big dictionaries and you don't use Java optimized dictionaries like Google's Guava but the default JDK libraries, you will also run slower (even 10x slower), but why not use Guava, you will also have slowdowns for the wrong reasons.

At last, there is an area that even Java can generate 20% faster code, that you don't allocate memory in your tight loop, at last Java can still run slower, and this is when you call native libraries. This is not Java's JIT fault, but is simply that the .Net's mapping to "metal" is much clearer, including in-memory layout, automatic 1:1 marshaling (that is done just with one big memory copy of an array of structures for example) which is simply done better.

One last note about JIT and SIMD: Java doesn't have intrinsics because it does automatically rewrite your code to use SIMD and use proper instructions automatically. This in my mind is the best way to do stuff, so Java can run times faster just because a loop is vectorized, but certainly you have to write your loop SIMD-friendly. This is very similar with autovectorization promised in Visual Studio 2012.

Friday, April 17, 2015

Rant (and offtopic) AMD 16 way Zen core?

No way! This piece of news on Fudzilla is silly at best.

At least not with all combined. And this is not because I dream a conspiracy or I don't like AMD. In fact my last hardware I bought was AMD (yet a GPU, but it was only because I didn't need a CPU for a long time)

Let's clarify why, there is no area in itself on CPU even with AMD dense libraries that is used for GPU to fit all. There are CPUs with very good packing of transistors and have a very similar specification with this new future CPU, it is using even a worse litography (22 nm compared with the 14 nm FinFET for AMD case) and this is a Xeon CPU.

But the worse part in my mind is that even the specifications are in the reachability of the smaller transistors the following parts give to me doubts that there will be in 2016 (even in December, the launch date) a full 16 core CPU:
- AMD has no experience with 16 core, their previous designs were 8x2 cores designs, not to say that they are not impressive, but maybe the tradition of late and underwhelming delivery of AMD (likely because it lost some of key engineers when the company shrunk) makes me skeptical that they have a good working prototype already (as Zen is expected to be launched in 1 year from now, it requires at least some prototypes to be made with some time before, AMD Carizzo for instance had good CPUs sampling around 6 months ago and is still not launched)
- 14 nm FinFET is not necessarily that good compared with Intel's 14 nm, because some parts of the interconnect are using a bigger process.
- the design is an APU and in general CPU cache and APUs do require a lot of CPU real estate. You cannot scale infinitely an CPU to all directions because the heat for instance can break it really fast

At last, but not at least, is: who cares? The benchmarks and real numbers in applications are what matter. AMD Bulldozer is a great CPU, especially if you count the core count, and the initial real life delivery was bad, really bad. When Intel Haswell CPUs were launched, you can assume realistically that 2 AMD cores (or one "module") of AMD runs basically in the same as 1 Intel Core.

Even here on the blog, a desktop CPU (AMD 6 core - or 3 cores with 2 modules - read into comments) can run maybe a bit worse with 1 core and probably it will run very similar with a  dual core laptop with similar generations (i5M first gen vs AMD 6150 BE desktop),

I am sure that no AMD engineer is here, but what it looks to me is that the best architectures AMD have are probably Puma/Jaguar based (which themselves I think are based to a simplified Phenom cores) which run inside tablets/slow laptops/consoles. They don't have HSA, but they do run damn well. If there would be a low-power cluster of 2 x 8 cores Pumas, I would likely want a APU like this: it is starved on memory front, but all than this, many algorithms that are CPU intensive are CPU cache friendly, so the CPU will fine on those, and the non-CPU intensives maybe will run fine just because there are many cores to split the workload.

Good luck AMD!

Monday, February 9, 2015

Reviving CR

There is some interest into CR, and this is mostly regarding improving and making a stable compiler for embedded. More for this will follow but is good to know that if you will take the latest Git implementation some harder to find bugs were addressed:
- (major) strings do marshal automatically for PInvokes to wchar_t*. This basically means that if you map methods from Dlls/libSO and they use strings on .Net side, they will call (correctly) the underlying library, and it works also as it should (and Mono or .Net does it)
- (minor) Ldflda instruction is working correctly (this is used often when you use structs)

- (minor) Escape Analysis will work reliable with virtual methods' return value. It made the code to fail otherwise for some trivial programs
- (medium) Bugs in devirtualization and remove methods optimizations were also fixed
- (major) try/finally blocks do work now: CR does not support exceptions and will likely never will, but the code of "happy path" will work. This also makes that code using IDisposable to work - also known as "using block".

Feel free to take the code and work on it and of course, if you have any fixes, help us by making them upstream.

CR has a new home also, please redirect your links to here:

Monday, December 29, 2014

CR is fully stalled, new things to come in 2015

As New Year resolutions appear I want to state publicly my interests and my willing to work into my free/OSS projects.

I had working for two projects in the last year: CodeRefractor and at the end of the year to a more interesting (details will follow later) Java game.

This other project is JHeroes 2 a port of FHeroes 2 to Java. FHeroes 2 is a C++ project on top of SDL library and the main advantage of using Java for JHeroes 2 was to make it run in high resolutions laptops with weak CPUs (like mine :) ). For this reason I had as advantage Java's JavaFX technology, which is similar with WPF from .Net world.

I succeed to load a level, graphics from original Heroes 2 game (by porting FHeroes 2 core code to Java) but when it comes to displaying, the frames are visibily higher (FHeroes 2 in 1920x1080P has like 1-2 FPS on Windows, when JHeroes 2 has in my specific testing around 20 FPS) but the CPU is to high! I mean it uses more than 100% of one CPU even I don't do any separate background thread. Profiling I found that Java does create a separate thread in JavaFx and all commands go to this thread and I think this thread creates OpenGL commands which are executed back on the main thread. The bad part is that CPU usage is really high, and there is no way to put sleeps or anything to reduce this CPU usage. Or at least not obvious way to do it.

So I did what every other developer would do if it doesn't want to spend another 1 year of his free time (to get here I spent my 2 months of free time) to see that it doesn't work as intended at the end, I did test similar technologies using a VM and the single two technologies that do seem to work well enough (on my machine) are:
- OpenGL using OpenTK
- Mono + Cairo on Linux (on Windows it runs at around the 10-20 FPS, when on Linux I get between 30-60 FPS but also I have the possibility to put sleeps as all code runs on the main thread)

So, based on this results as I don't want to rewrite all my code to use OpenGL and from Java to rewrite it into C# (maybe in future I will have this will, but for now not so much), I also suspended my work on JHeroes 2.

Also I want to thank for all contributors to CR and for JetBrains offering a free license of Resharper for supporting OpenSource.

So as New Year resolution, I hope to do something else, maybe I will look into small items to improve on CR (very unlikely) or I will want to continue a Mirah like language implementation for .Net (named Calcium, but more on this I will write later).

But what can I say about 2014 is that I found some reassuring things which make work on CR to not be necessary:
- Microsoft did opensource a lot of its stack
- it has a "native compiler" which will be always better supported than CR will ever be
- Mono matured also a lot in the last year. As for me it still doesn't have full optimizations (CR still have in my understanding some optimizations that neither .Net or Mono has: especially Escape Analysis, "pure function evaluation", and this like that) but using LLVM backend, especially on Linux is really a great platform