Sunday, July 19, 2015

Using .Net for Developing Games, a 2015 review

Before talking about game development, as a disclaimer I'm not a game developer even though I do have some (working) experience with older versions of OpenGL and DirectX and hands on experience with C++ and C#. Also, I kept track of current technologies (as much as time allows).

First of all, let's clarify the terms: there are obviously games which can write and run in C#, I'm thinking here like most board games like Chess, Go, even strategy games, or similar. Even more, you can do more than these games, and I'm saying the best of my knowledge game written in C# which is Magicka, but again people will sneeze and will say: but this game doesn't use Havok (the physics engine) or if a C# game would use it, people will say: but the Havok is not written in C#, but is it written in C++.

Given this, I want to make as fair as possible review of .Net platform as a game development tool.

Here are some really great pluses
+ C#'s peak performance (after the application starts up), especially if you avoid as plague to work with strings, but using mostly arrays and integers/double types, will make your code to run adequate (typically around 70-90% of C++ code, even better match up if you use 64 bit .Net)
+ C# allows for the hottest of the code to be written in C++ and also allows to let you use no bounds checking using "unsafe" code. This makes that if you need a specific code to be autovectorized and you notice that C++ compiler does it but the C# one does not (and you don't want to use Mono.SIMD code to write your own matrix multiply code) to be very highly optimized
+ the call speed from PInvoke is adequate as .Net "natively" maps COM calls and C calls, meaning that if you use either DirectX or OpenGL, you are covered
+ having complex game logic can be more easily written in C# than in C++, especially as some C++ game engines use Lua as a backend. Writing it into C# should give some times speedups
+ you can use struct types so you can reduce the times the memory collection is happening

Here are really bad minuses:
- coding recklessly will create a lot of garbage in memory making pressure on GC. It can take sometimes seconds (for huge heaps, like multi GB heaps) which is unacceptable even in a board game
- allocation by default is on heap, meaning that if you create a List<T>, in fact always it will create on heap 2 objects, the first is the List<T> itself, and the internal array which stores the actual data. This is really bad because when you add to List<T> items, the internal array is "resized' which in the .Net (or Mono or CodeRefractor) implementations mean that a new array is allocated, meaning that a lot of more GC pressure happens. In C++ by default objects are allocated on stack with no hidden costs. If you use std::vector<T>, the internal array is on heap, but the vector itself is on stack.
- Linq can create without noticing a lot of objects: especially when you use: ".ToArray()", or ".ToList" or for a statement that wants to return a pair of values.
This code:
var playerAndLifes = players.Select(player => new Tuple<Player, int>(player, player.Life)).ToArray();
Looks really innocent, but in fact "Tuple" is a class, so is allocated on heap, and also ToArray will resize in power of two for the length of your "players" object. So for an 1300 players will be around 8 reallocations, and for 2600 players will be 9 reallocations and so on.
For the previous code, make a struct STuple in your codebase and use it. Also, if you know the size of players, do not forget to read ways to improve your Linq performance article.
- Objects in .Net are huge, so if you keep a single byte or integer index (even it has its own more complex associated logic) consider using struct or enum types. The reason why objects in .Net are huge, is that they contain much more information in the object header, including typeId, some space to be able to lock on them, If you have a class which stores 1 integer, on 32 bits .Net is 12 bytes, but on 64 bit machine is 24 bytes. So for every single allocation of an object, you will waste an extra 8 or 20 bytes. In C++ if you don't use virtual calls, the overhead is zero for object internals, but can be bigger if the memory allocator is not efficient. For virtual method classes, the overhead is typically the size of the pointer (4 bytes on 32 bit machines and 8 bytes for a 64 bit machines).
- Texts are UTF16, which very often is a good thing, but when you want high(er) performance, if you write them on disk, they occupy 2 times more space. Even worse, they do increase memory usage and again will create presure on GC. Try to work with UTF8 encoded strings internally and do interning (meaning to merge strings all over your application) so at least when GC happens will have less work to do
- Even is not necessarily an issue of .Net in itself, an easy way to support Save/Load inside games is to use a serializer that stores or restores your entities on disk. Using the default "DataContract" or even BinarySerializer are slow. Use protobuf-net (ProtoBuffers) as it is a very easy to use library to do this part and it can run many times faster. Similarly, try to not use any xml/json or alike for levels where is it expected to have many enties of any kind
- the JIT (Just in time) compiler sometimes make things ugly! The JIT time is typcally very small, but it is happening every time a new method in code is hit. If you have big methods and/or a bigger logic, you may expect to see "frame-skips", especially as per frame there is the "tyranny" of 16.6 ms per frame. Making methods small and try to remove duplicate code should make that when you get a new item or you see a new enemy which has a new game logic which is exposed to the player, which would require for .Net to analyse it, should be faster to optimize. But the even better way is simply to NGen your application.

What is weird as for me, is that the biggest factor into responsive games is not itself the compilation's performance (which .Net has it right from year 2009 I would say, with .Net 3.5 SP1), but the hidden overhead(s) of GC. You can get screwed many times and the ugly part of GC is that you don't know when it will hit you, even worse, you may not know which code creates classes (like System.Tuple or Linq's ToArray/ToList).

To wrap up, it looks to me that GC is the biggest factor for user to see freezes and as .Net improved as output of generated code (with initiatives like RyuJIT or CoreCLR) the elephant remains mostly to work with structs and to use an efficient serializer. This code can be very often improved by other means, typically by forcing a full GC at steps user waits already. After a game loads a full level into memory, a developer can force a full GC, after a round is finished and is written "Victory", another full GC can be forced. This style of coding is fine, but of course, if the game was expected to have a full round ended in 10 minutes but finished in 40 minutes, and the user has let's say a full GC of 3 seconds in the middle of the minute 35, this will ruin the experience.

Monday, July 6, 2015

Resharper 9 - a blast


Disclaimer: I've received an opensource license from JetBrains of Resharper for the second time. Thank you JetBrains!

I've been fairly critical sometimes with R# (Resharper) as is somewhat not accessible for some users, in the same time I've been using it. But I want to say why also code analysis in general and coding in particular is crucial with using today with a Resharper like tool.

So first of all, I want to make some criticism of Resharper and especially R# 9 as I've received:
- I've had a not updated R# 8 (it expired somewhere around October) and upgrading to 9.0 (which happen to be out of date because I didn't use R# for some time) made R# to report a lot of errors in code which were not there. Clearing the caches did fix all the known errors I had. But it was really strange (Google pointed me directly to the right place)
- Resharper doesn't default to use Solution Wide analysis. Maybe for low end machines is to be desired, or for very big projects, but as it is, at least for medium projects is a boon. I am sure that for big solutions (I'm thinking here programs like SharpDevelop or bigger) maybe Resharper runs slow to update the analysis (which in itself is a fair point) but the missing of the information that R# provides (like compilation errors you may have) by default, I found it as a big miss

Ok, so small bugs and not so great defaults. But in context of CodeRefractor's project it was so great feature because it made possible to make possible to big rewrites and right now it undergones the third rewrite. Every rewrite was justifiable for various reasons:
- the first and (as for me) very important one was that the internal representation was shaped very close to SSA form (or at least to LinearIL from Mono project). A subsequent almost as a full rewrite made the project to use an index of these instructions so optimizations will not do their job well, but they do it fast
- the second rewrite allowed a much refined way to find all methods (like virtual methods) so many more programs do run now (try it, it will do wonders)
- the third rewrite (that is currently going) that I will not write the details now

What I found great working features:
- creating property is automatic and fast with good defaults:
myValue.Width = 30;
//R# will suggest to create Width as an automatic property of int type
- creating automatic empty class taking into account of constrains:
BaseClass a = new MyNotDefinedClass();
//R# will suggest to create MyNotDefinedClass as BaseClass and will also implement some required data
- the Solution Wide analysis which takes into account when your code compiles. This feature is so awesome because you can combine it with two features: "Code cleanup" (which removes for example a lot of redundancies and reformats nicely the whole code) and "Find Code Issues".
- a R# 9.0 feature: code completion filters with various criteria (like: "properties only" or "extension methods only").
- unused parameters and the refactor to remove them globally is really a huge time saver of developer time

So in short, I have to say that if you start with Resharper from scratch, or you do want to use productively C#, I warmly recommend it to you. Also, don't forget the first thing after you open your solution to enable by default the Solution wide analysis (you have a "gray circle" on bottom-right: double click on it and click "OK" to the dialog it appears").

Also, please note that I tried to be as unbiased as I can, so I didn't point things that I'm sure that are invaluable for other projects like MVC3 or Xaml features (CR usage of Xaml is very limited), so here is only what I used (and enjoyed!) but some features may be for you closer to heart .

Improve performance for your selects in Linq

A think I learned inside CodeRefractor is how loops do work inside .Net. One thing I learned fairly quick is that the fastest loop is by far on arrays. It is documented also by Microsoft.

In short, especially using .Net on 64 bit, you will see high performance code over arrays so I strongly recommend if you have data that you read it often out of it (for example for using Linq), you should use ToArray() function.

So let's say you need out of your "tradeData" variable your names out of it.
The code may look like this:
return tradeData.Select(it => it.Id).ToArray();
What's wrong with this code? Let's say "tradeData" variable can have 1.000.000 items and tradeData can be itself an array or a List<T> and when you profile, you can see that iteration takes little time, but most of the time you will see like 16-18 allocations inside of ToArray(), the reason being that ToArray itself keeps an internal array which is resized for more times.


So it should be possible to write a "SelectToArray" method that will have much lower overhead:
     public static class UtilsLinq
    {
        public static TResult[] SelectToArray<TValue, TResult>(this IList<TValue> items, Func<TValue, TResult> func)
        {
            var count = items.Count;
            var result = new TResult[count];
            for (var i = 0; i < result.Length; i++)
            {
                result[i] = func(items[i]);
            }
            return result;
        }
    }

As T[] implements IList<T> makes this code to work for both arrays and List<T>. This code will run as fast as possible and there are no hidden allocations.

And you code becomes:
return tradeData.SelectToArray(it => it.Id);

Strong recommendation for fast(er) code: when you use Select or SelectToArray do NEVER allocate inside it "class" objects but struct objects. If you want to keep a result with multiple data fields, create "struct" types which incapsulate them.

How fast is it? It it fairly fast.

For this code:
            var sz = 10000000;
            var randData = new int[sz];
            var random = new Random();
            for(var i = 0; i<sz; i++)
            {
                randData[i] = random.Next(1, 10);
            }
            var sw = Stopwatch.StartNew();
            for(int t = 0; t<5;t++){
                var arr = randData.SelectToArray(i => (double)i);
            }
            var time1 = sw.ElapsedMilliseconds;
            sw.Stop();
            sw.Restart();
            for(int t = 0; t<5;t++){
                var arr = randData.Select(i => (double)i).ToArray();
            }
            var time2 = sw.ElapsedMilliseconds;
You have
 time1 = 798 ms vs time2 = 1357 (Debug configuration)
 time1 =  574 ms vs time2 = 1003 (Release configuration)

Not sure about you, but this is significant and also it is crucial of you have multiple Linq/Select statements and you want also the resulting items to be fast iterable. Similarly, you will have bigger speedup if you don't do the cast to double, but I wanted to show a more realistic code where the Linq it is doing something light (like typically happens as sometimes there is an indexer involved, or a field access).

NB. This test is artificial, and use these results at your own risk.
Later, I found there is a method: Array.ConvertAll which has very similar internals with this extension method (the limitation is that doesn't work with non-array implementations, but if this is not a big incovenience for you, is better to use the BCL classes).

     public static TResult[] SelectToArray<TValue, TResult>(this TValue[] items, Func<TValue, TResult> func)
        {
            return Array.ConvertAll(items, it => func(it));
        }

Method changed to this and is a bit even faster, because the iteration of items variable si a bit faster this time.

Friday, May 15, 2015

Calcium - a Mirah like language for .Net

Hi readers, not sure if anyone is following my GitHub page, but I did fix some of bugs with Calcium language. What is Calcium? Calcium is a Mirah like language (which itself is a Ruby like language) for .Net platform. If you write your code in Ruby using mostly IronRuby conventions (and as much as the minimal features are working), you should get at the end a C# file without any other overhead (excluding the .Net one). For now a simple program is supported, the Mandelbrot fractal generator but the more types/fixes are included. The slowest part of the fractal generator is in fact writing to console.

Want to have quick a C# program that writes to screen and is compiled with Ruby syntax? This mini-compiler could help you.

A code like this one does what you would expeect: writes 10 times "Hello from Calcium" then it counts the time in milliseconds that was required to do so:

def run
   print "Hello from Calcium"
end

start = Environment.TickCount
i = 0
while i < 10
  run
  i += 1
end

endTime = (Environment.TickCount - start) / 1000.0
print "Time: "
puts endTime

The generated C# is the following:

using System;
using System.Collections.Generic;
using Cal.Runtime;
public class _Global {

static public void Main ()
{
Int32 start;
Int32 i;
Double endTime;
start = Environment . TickCount;
i = 0;
while(i<10)
{
run();
i += 1;
}
endTime = (Environment . TickCount-start)/1000.0;
Console.Write("Time: ");;
Console.WriteLine(endTime);;
}
static public void run ()
{
Console.Write("Hello from Calcium");;
}
}

As you can see it could be a time saver, and if it will be extended enough, it can replace in future some cases where you used IronRuby and you quit because it felt to slow. I plan to fix and extend this transpiler to make it functional enough to support very common cases.

If you are interested, please take a look and try to extend or report as minimal bug reports as possible.

Monday, April 27, 2015

Can RyuJIT beat Java -server?

The very short answer is always: depends.

RyuJIT is the new compiler architecture to compile .Net. It is supposedly branched out of „x86” branch of .Net and is modernized. And there are benchmarks and the startup performance got better, but did the throughput improved to beat Java?

The good part is that this month Microsoft opensourced many parts of the .Net stack as part of CoreCLR stack, and one of them is also the RyuJIT part of it. So we can look inside it. The code can be found here.
First of all, what RyuJIT seems to do is to give a fairly lightweight as high level optimizations which I think they are the minimal optimization set on Debug configuration:
- it builds SSA form (a form that improves precision of the compiler to remove aggressively data)
- it does fairly aggressive dead code elimination (based on liveness)
- it does Linear Scan Register Allocation

More optimizations can be enabled, which they mostly consist into common subexpression elimination, bounds check eliminations, a more aggressive dead code elimination (global liveness analysis).


Initially I was really surprised on how few optimizations look to be available inside RyuJIT, and looking a bit more into the code, some new gems appear, like it looks that there are in special a aggressive inlining and ”loop cloning” (which if I understood the code right, should make a loop to 1000 to be in fact split in 250 loops of 4 times repeating the iteration). This optimization I think is also important as RyuJIT supports SIMD intrinsics, so it can make a CPU specialized code.

Of course these optimizations all help and if you profile and tweak code, your code will be good enough, but still, it can beat Java -server?

At this stage, the answer is clearly no. In fact, if you write your code using Firefox's JIT for JavaScript, this optimizer has more advanced optimizations exposed, like Global Variable Numbering (GVN) and a better register allocator. I would not be surprised if you write "use asm" and this code to run much faster on Firefox JIT.

There are two items why RyuJIT should not run faster than Java and they are:
- it doesn't have many and more complex high level optimizations (I didn't even find Loop-invariant-code-motion, an optimization that even CodeRefractor has). Of course adding them will slow down the JIT time
- as RyuJIT will likely inline small functions/properties and duplicate parts of loops, will increase CPU registers (especially on x86) pressure and the LSRA allocator gives a fairly good performance, but is 10-20% slower than the full register allocator (used by Java server, still is the same with the warmup Java client register allocator)

Where RyuJIT can work faster is to allocate on stack faster than Java does, but eventually the code will get into tight loops and this code will run slower than Java by around 20%, if you don't make the mistake of making a hidden allocation on Java side. Also Dictionary<K,T> in .Net is much CPU cache friendly so if you run big dictionaries and you don't use Java optimized dictionaries like Google's Guava but the default JDK libraries, you will also run slower (even 10x slower), but why not use Guava, you will also have slowdowns for the wrong reasons.

At last, there is an area that even Java can generate 20% faster code, that you don't allocate memory in your tight loop, at last Java can still run slower, and this is when you call native libraries. This is not Java's JIT fault, but is simply that the .Net's mapping to "metal" is much clearer, including in-memory layout, automatic 1:1 marshaling (that is done just with one big memory copy of an array of structures for example) which is simply done better.

One last note about JIT and SIMD: Java doesn't have intrinsics because it does automatically rewrite your code to use SIMD and use proper instructions automatically. This in my mind is the best way to do stuff, so Java can run times faster just because a loop is vectorized, but certainly you have to write your loop SIMD-friendly. This is very similar with autovectorization promised in Visual Studio 2012.

Friday, April 17, 2015

Rant (and offtopic) AMD 16 way Zen core?

No way! This piece of news on Fudzilla is silly at best.

At least not with all combined. And this is not because I dream a conspiracy or I don't like AMD. In fact my last hardware I bought was AMD (yet a GPU, but it was only because I didn't need a CPU for a long time)

Let's clarify why, there is no area in itself on CPU even with AMD dense libraries that is used for GPU to fit all. There are CPUs with very good packing of transistors and have a very similar specification with this new future CPU, it is using even a worse litography (22 nm compared with the 14 nm FinFET for AMD case) and this is a Xeon CPU.

But the worse part in my mind is that even the specifications are in the reachability of the smaller transistors the following parts give to me doubts that there will be in 2016 (even in December, the launch date) a full 16 core CPU:
- AMD has no experience with 16 core, their previous designs were 8x2 cores designs, not to say that they are not impressive, but maybe the tradition of late and underwhelming delivery of AMD (likely because it lost some of key engineers when the company shrunk) makes me skeptical that they have a good working prototype already (as Zen is expected to be launched in 1 year from now, it requires at least some prototypes to be made with some time before, AMD Carizzo for instance had good CPUs sampling around 6 months ago and is still not launched)
- 14 nm FinFET is not necessarily that good compared with Intel's 14 nm, because some parts of the interconnect are using a bigger process.
- the design is an APU and in general CPU cache and APUs do require a lot of CPU real estate. You cannot scale infinitely an CPU to all directions because the heat for instance can break it really fast

At last, but not at least, is: who cares? The benchmarks and real numbers in applications are what matter. AMD Bulldozer is a great CPU, especially if you count the core count, and the initial real life delivery was bad, really bad. When Intel Haswell CPUs were launched, you can assume realistically that 2 AMD cores (or one "module") of AMD runs basically in the same as 1 Intel Core.

Even here on the blog, a desktop CPU (AMD 6 core - or 3 cores with 2 modules - read into comments) can run maybe a bit worse with 1 core and probably it will run very similar with a  dual core laptop with similar generations (i5M first gen vs AMD 6150 BE desktop),

I am sure that no AMD engineer is here, but what it looks to me is that the best architectures AMD have are probably Puma/Jaguar based (which themselves I think are based to a simplified Phenom cores) which run inside tablets/slow laptops/consoles. They don't have HSA, but they do run damn well. If there would be a low-power cluster of 2 x 8 cores Pumas, I would likely want a APU like this: it is starved on memory front, but all than this, many algorithms that are CPU intensive are CPU cache friendly, so the CPU will fine on those, and the non-CPU intensives maybe will run fine just because there are many cores to split the workload.

Good luck AMD!

Monday, February 9, 2015

Reviving CR

There is some interest into CR, and this is mostly regarding improving and making a stable compiler for embedded. More for this will follow but is good to know that if you will take the latest Git implementation some harder to find bugs were addressed:
- (major) strings do marshal automatically for PInvokes to wchar_t*. This basically means that if you map methods from Dlls/libSO and they use strings on .Net side, they will call (correctly) the underlying library, and it works also as it should (and Mono or .Net does it)
- (minor) Ldflda instruction is working correctly (this is used often when you use structs)

- (minor) Escape Analysis will work reliable with virtual methods' return value. It made the code to fail otherwise for some trivial programs
- (medium) Bugs in devirtualization and remove methods optimizations were also fixed
- (major) try/finally blocks do work now: CR does not support exceptions and will likely never will, but the code of "happy path" will work. This also makes that code using IDisposable to work - also known as "using block".

Feel free to take the code and work on it and of course, if you have any fixes, help us by making them upstream.

CR has a new home also, please redirect your links to here:
http://coderefractor.ciplogic.com/index.php/blog/