Friday, May 15, 2015

Calcium - a Mirah like language for .Net

Hi readers, not sure if anyone is following my GitHub page, but I did fix some of bugs with Calcium language. What is Calcium? Calcium is a Mirah like language (which itself is a Ruby like language) for .Net platform. If you write your code in Ruby using mostly IronRuby conventions (and as much as the minimal features are working), you should get at the end a C# file without any other overhead (excluding the .Net one). For now a simple program is supported, the Mandelbrot fractal generator but the more types/fixes are included. The slowest part of the fractal generator is in fact writing to console.

Want to have quick a C# program that writes to screen and is compiled with Ruby syntax? This mini-compiler could help you.

A code like this one does what you would expeect: writes 10 times "Hello from Calcium" then it counts the time in milliseconds that was required to do so:

def run
   print "Hello from Calcium"
end

start = Environment.TickCount
i = 0
while i < 10
  run
  i += 1
end

endTime = (Environment.TickCount - start) / 1000.0
print "Time: "
puts endTime

The generated C# is the following:

using System;
using System.Collections.Generic;
using Cal.Runtime;
public class _Global {

static public void Main ()
{
Int32 start;
Int32 i;
Double endTime;
start = Environment . TickCount;
i = 0;
while(i<10)
{
run();
i += 1;
}
endTime = (Environment . TickCount-start)/1000.0;
Console.Write("Time: ");;
Console.WriteLine(endTime);;
}
static public void run ()
{
Console.Write("Hello from Calcium");;
}
}

As you can see it could be a time saver, and if it will be extended enough, it can replace in future some cases where you used IronRuby and you quit because it felt to slow. I plan to fix and extend this transpiler to make it functional enough to support very common cases.

If you are interested, please take a look and try to extend or report as minimal bug reports as possible.

Monday, April 27, 2015

Can RyuJIT beat Java -server?

The very short answer is always: depends.

RyuJIT is the new compiler architecture to compile .Net. It is supposedly branched out of „x86” branch of .Net and is modernized. And there are benchmarks and the startup performance got better, but did the throughput improved to beat Java?

The good part is that this month Microsoft opensourced many parts of the .Net stack as part of CoreCLR stack, and one of them is also the RyuJIT part of it. So we can look inside it. The code can be found here.
First of all, what RyuJIT seems to do is to give a fairly lightweight as high level optimizations which I think they are the minimal optimization set on Debug configuration:
- it builds SSA form (a form that improves precision of the compiler to remove aggressively data)
- it does fairly aggressive dead code elimination (based on liveness)
- it does Linear Scan Register Allocation

More optimizations can be enabled, which they mostly consist into common subexpression elimination, bounds check eliminations, a more aggressive dead code elimination (global liveness analysis).


Initially I was really surprised on how few optimizations look to be available inside RyuJIT, and looking a bit more into the code, some new gems appear, like it looks that there are in special a aggressive inlining and ”loop cloning” (which if I understood the code right, should make a loop to 1000 to be in fact split in 250 loops of 4 times repeating the iteration). This optimization I think is also important as RyuJIT supports SIMD intrinsics, so it can make a CPU specialized code.

Of course these optimizations all help and if you profile and tweak code, your code will be good enough, but still, it can beat Java -server?

At this stage, the answer is clearly no. In fact, if you write your code using Firefox's JIT for JavaScript, this optimizer has more advanced optimizations exposed, like Global Variable Numbering (GVN) and a better register allocator. I would not be surprised if you write "use asm" and this code to run much faster on Firefox JIT.

There are two items why RyuJIT should not run faster than Java and they are:
- it doesn't have many and more complex high level optimizations (I didn't even find Loop-invariant-code-motion, an optimization that even CodeRefractor has). Of course adding them will slow down the JIT time
- as RyuJIT will likely inline small functions/properties and duplicate parts of loops, will increase CPU registers (especially on x86) pressure and the LSRA allocator gives a fairly good performance, but is 10-20% slower than the full register allocator (used by Java server, still is the same with the warmup Java client register allocator)

Where RyuJIT can work faster is to allocate on stack faster than Java does, but eventually the code will get into tight loops and this code will run slower than Java by around 20%, if you don't make the mistake of making a hidden allocation on Java side. Also Dictionary<K,T> in .Net is much CPU cache friendly so if you run big dictionaries and you don't use Java optimized dictionaries like Google's Guava but the default JDK libraries, you will also run slower (even 10x slower), but why not use Guava, you will also have slowdowns for the wrong reasons.

At last, there is an area that even Java can generate 20% faster code, that you don't allocate memory in your tight loop, at last Java can still run slower, and this is when you call native libraries. This is not Java's JIT fault, but is simply that the .Net's mapping to "metal" is much clearer, including in-memory layout, automatic 1:1 marshaling (that is done just with one big memory copy of an array of structures for example) which is simply done better.

One last note about JIT and SIMD: Java doesn't have intrinsics because it does automatically rewrite your code to use SIMD and use proper instructions automatically. This in my mind is the best way to do stuff, so Java can run times faster just because a loop is vectorized, but certainly you have to write your loop SIMD-friendly. This is very similar with autovectorization promised in Visual Studio 2012.

Friday, April 17, 2015

Rant (and offtopic) AMD 16 way Zen core?

No way! This piece of news on Fudzilla is silly at best.

At least not with all combined. And this is not because I dream a conspiracy or I don't like AMD. In fact my last hardware I bought was AMD (yet a GPU, but it was only because I didn't need a CPU for a long time)

Let's clarify why, there is no area in itself on CPU even with AMD dense libraries that is used for GPU to fit all. There are CPUs with very good packing of transistors and have a very similar specification with this new future CPU, it is using even a worse litography (22 nm compared with the 14 nm FinFET for AMD case) and this is a Xeon CPU.

But the worse part in my mind is that even the specifications are in the reachability of the smaller transistors the following parts give to me doubts that there will be in 2016 (even in December, the launch date) a full 16 core CPU:
- AMD has no experience with 16 core, their previous designs were 8x2 cores designs, not to say that they are not impressive, but maybe the tradition of late and underwhelming delivery of AMD (likely because it lost some of key engineers when the company shrunk) makes me skeptical that they have a good working prototype already (as Zen is expected to be launched in 1 year from now, it requires at least some prototypes to be made with some time before, AMD Carizzo for instance had good CPUs sampling around 6 months ago and is still not launched)
- 14 nm FinFET is not necessarily that good compared with Intel's 14 nm, because some parts of the interconnect are using a bigger process.
- the design is an APU and in general CPU cache and APUs do require a lot of CPU real estate. You cannot scale infinitely an CPU to all directions because the heat for instance can break it really fast

At last, but not at least, is: who cares? The benchmarks and real numbers in applications are what matter. AMD Bulldozer is a great CPU, especially if you count the core count, and the initial real life delivery was bad, really bad. When Intel Haswell CPUs were launched, you can assume realistically that 2 AMD cores (or one "module") of AMD runs basically in the same as 1 Intel Core.

Even here on the blog, a desktop CPU (AMD 6 core - or 3 cores with 2 modules - read into comments) can run maybe a bit worse with 1 core and probably it will run very similar with a  dual core laptop with similar generations (i5M first gen vs AMD 6150 BE desktop),

I am sure that no AMD engineer is here, but what it looks to me is that the best architectures AMD have are probably Puma/Jaguar based (which themselves I think are based to a simplified Phenom cores) which run inside tablets/slow laptops/consoles. They don't have HSA, but they do run damn well. If there would be a low-power cluster of 2 x 8 cores Pumas, I would likely want a APU like this: it is starved on memory front, but all than this, many algorithms that are CPU intensive are CPU cache friendly, so the CPU will fine on those, and the non-CPU intensives maybe will run fine just because there are many cores to split the workload.

Good luck AMD!

Monday, February 9, 2015

Reviving CR

There is some interest into CR, and this is mostly regarding improving and making a stable compiler for embedded. More for this will follow but is good to know that if you will take the latest Git implementation some harder to find bugs were addressed:
- (major) strings do marshal automatically for PInvokes to wchar_t*. This basically means that if you map methods from Dlls/libSO and they use strings on .Net side, they will call (correctly) the underlying library, and it works also as it should (and Mono or .Net does it)
- (minor) Ldflda instruction is working correctly (this is used often when you use structs)

- (minor) Escape Analysis will work reliable with virtual methods' return value. It made the code to fail otherwise for some trivial programs
- (medium) Bugs in devirtualization and remove methods optimizations were also fixed
- (major) try/finally blocks do work now: CR does not support exceptions and will likely never will, but the code of "happy path" will work. This also makes that code using IDisposable to work - also known as "using block".

Feel free to take the code and work on it and of course, if you have any fixes, help us by making them upstream.

CR has a new home also, please redirect your links to here:
http://coderefractor.ciplogic.com/index.php/blog/

Monday, December 29, 2014

CR is fully stalled, new things to come in 2015

As New Year resolutions appear I want to state publicly my interests and my willing to work into my free/OSS projects.

I had working for two projects in the last year: CodeRefractor and at the end of the year to a more interesting (details will follow later) Java game.

This other project is JHeroes 2 a port of FHeroes 2 to Java. FHeroes 2 is a C++ project on top of SDL library and the main advantage of using Java for JHeroes 2 was to make it run in high resolutions laptops with weak CPUs (like mine :) ). For this reason I had as advantage Java's JavaFX technology, which is similar with WPF from .Net world.

I succeed to load a level, graphics from original Heroes 2 game (by porting FHeroes 2 core code to Java) but when it comes to displaying, the frames are visibily higher (FHeroes 2 in 1920x1080P has like 1-2 FPS on Windows, when JHeroes 2 has in my specific testing around 20 FPS) but the CPU is to high! I mean it uses more than 100% of one CPU even I don't do any separate background thread. Profiling I found that Java does create a separate thread in JavaFx and all commands go to this thread and I think this thread creates OpenGL commands which are executed back on the main thread. The bad part is that CPU usage is really high, and there is no way to put sleeps or anything to reduce this CPU usage. Or at least not obvious way to do it.

So I did what every other developer would do if it doesn't want to spend another 1 year of his free time (to get here I spent my 2 months of free time) to see that it doesn't work as intended at the end, I did test similar technologies using a VM and the single two technologies that do seem to work well enough (on my machine) are:
- OpenGL using OpenTK
- Mono + Cairo on Linux (on Windows it runs at around the 10-20 FPS, when on Linux I get between 30-60 FPS but also I have the possibility to put sleeps as all code runs on the main thread)

So, based on this results as I don't want to rewrite all my code to use OpenGL and from Java to rewrite it into C# (maybe in future I will have this will, but for now not so much), I also suspended my work on JHeroes 2.

Also I want to thank for all contributors to CR and for JetBrains offering a free license of Resharper for supporting OpenSource.

So as New Year resolution, I hope to do something else, maybe I will look into small items to improve on CR (very unlikely) or I will want to continue a Mirah like language implementation for .Net (named Calcium, but more on this I will write later).

But what can I say about 2014 is that I found some reassuring things which make work on CR to not be necessary:
- Microsoft did opensource a lot of its stack
- it has a "native compiler" which will be always better supported than CR will ever be
- Mono matured also a lot in the last year. As for me it still doesn't have full optimizations (CR still have in my understanding some optimizations that neither .Net or Mono has: especially Escape Analysis, "pure function evaluation", and this like that) but using LLVM backend, especially on Linux is really a great platform

Saturday, December 6, 2014

How to speedup your cold application startup?

This entry is only about how to do practical solutions to improve startup especially on virtual machines code.

One of the heaviest things that start with your application is the virtual machine itself. Your VM will extract JAR or Assemblies and will scan them for integrity, will compile the hot parts and so on. So the easiest thing to do is either to: put a splash as fast as possible and let in the background time to download all these files. Also, you can use NGen and in future Java's Jigsaw or Excelsior JET or for very courageous to use CodeRefractor.

But this entry is not about this still. Let's say your application starts (at least most of the code is in memory), but users still notice lag because you process XMLs or you have some UI, and processing all the log is really expensive, or sending data over the wire is expensive. So, in short; your application seems not to be responsive.

So right now I want to stress what to do to make your application start faster.

I. "Index and merge everything"

I. 1. Merge XML data

In my experience you have very often data that repeats itself. Especially if you serialize a lot of items (like a database table that you send through the wire over a web service, but is also true on disk), many items repeat.

Even for cases when you unlikely see a line and repeated patterns, data repeats itself, for example in this file: http://upload.wikimedia.org/wikipedia/commons/f/fd/Ghostscript_Tiger.svg

This is a section of the SVG code (around the big green eye):

<g id="g242" stroke-width="0.5" stroke="#000" fill="#FFC"></g>
<g id="g246" stroke-width="0.5" stroke="#000" fill="#FFC"></g>
(...)
<g id="g262" fill="#e5e5b2"></g>
(...)
<g id="g290" fill="#FFF"></g>
<g id="g294" fill="#CCC"></g>
<g id="g298" fill="#000">
    <path id="path300" d="m94.401,0.6s-29.001,12.2-39.001,11.8c0,0-16.4-4.6-24.8,10,0,…0,0,4-11.2,4-13.2s21.2-7.6,22.801-8c1.6-0.4,8.2-4.6,7.8-7.8z"></path>
</g>
<g id="g302" fill="#99cc32">
    <path id="path304" d="m47,36.514c-6.872,0-15.245-3.865-15.245-10.114,0-6.248,8.373….446,5.065,12.446,11.313,0,6.249-5.572,11.314-12.446,11.314z"></path>
</g>

You can see that a lot of "g" tags, a lot of "id" flags, and a lot of "fill" elements and attributes are all over the XML. If you make your tree data-structure that keeps a list of strings (and a dictionary from these strings to the index of data), will make a lot of data smaller.

So instead of a:
class XmlNode { Name, InnerText: String;
Attributes: Dictionary<string, string>();
Children: List<XmlNode>();
}
Define:
class Texts { Data : List<string>();
QuickIndex: Dictionary<string, int>();
}
class XmlNode {
  Name, InnerText: Integer;
  Attributes: Dictionary<int, int> (); //it can be just an array of integers, which every 2 items, means key/value
  Children: List<XmlNode>();
}

And to get any text, you will do soemthing like:
string name = texts.Data[xmlNode.Name];

It sounds probably silly at first and overly verbose (of course it can be wrapped with a lot of design patters, here is to show the principle at work!).

Someome may think that: "it doesn't matter, "g" as a string is just 1 character long so we waste 2 bytes because a string is Unicode", but in reality the memory usage for small strings is much much bigger. A string contains the length (4 bytes), the hash (+ 4 bytes), and some implementations add an extra null finishing character. So minimum string of 1 character length is in fact around 12 bytes. Even more, the object ovherhead is if I recall right of 16 bytes. In Java the memory is padded at 8 bytes, so even if you have 12 + 16 = 28 bytes, in fact you will use 32 bytes. If you use an index, for every duplicate you trade an integer (4 bytes) for 32 bytes.
Look for EXI format and OpenEXI implementation.

I. 2. Merge tree like data

This is in a way a similar construction, and you can use the same tricks using a tree of data, but this can be useful not so much on startup but minimizing the memory usage later,

The solution is to use Reflection and to use a shared dictionary of strings, and every time you find a string that is already used, to replace with the previous one. This patch improves hugely a parse tree of NRefactory, but the code can be changed minimally for any other data structures: https://github.com/icsharpcode/NRefactory/issues/448
(NB: the patch was not accepted upstream, but the idea remains true).

I.3. Merge pixels using palettes

20 years ago the pictures were using a palette of 256 colors (sometimes just 16). This by default merges a lot of data, in fact just if you have more than 20x20 pixels (which is virtually every picture you may use) you will save bytes of memory and disk.

II. Pack everything


Packing of data is know for taking a long long time and is used in special for big data, but in fact, when did you do the last measurement?

As I've wrote some months ago, I'm trying to improve my Java skills porting a C++ application to Java (a FHeroes 2 OSS clone).

Heroes 2 has two files which are basically 2 big indexed files, one named "Heroes2.agg" and "heroes2x.agg" where there are included all the assets. The original "Heroes2.agg" is like 40 MB on disk, and after I compressed using Gzip (which is included in almost any virtual machine, and I found it included also with Qt), it reduces to 20 MB.

So, how much does it take to read from the SSD the unpacked file, and how much would it take to read the index, and unpack all indivudual items which are packed?

On my machine (an fairly old i5M first gen) will take 140 ms for reading a 40 MB file (so, at least in this "benchmark" the disk would read theoretically 285 MB/s, but using the packed file and unpacking all the items (the logic of the game allows to unpack them only when needed, so it may be faster, but I wanted to compare as much as possible apples with apples) is 100 ms. This is almost a 30% time reduction to load all the assets that may be needed at least in this old game. Or counting it differently, it reads the original data at a speed of 399 MB/s.

III. Combine them

When you give an application to your customers or in general when you have your own applications that people want to think about them that they are snappy, make sure you optimize startup by increasing the CPU load (by calculating what data is common) and read fast this merged data and a big chunk of indexes to the original data. If you have a big data set, use any compression at save time and at startup do decompression.

The same principle happens to be the same for 3D objects (where most of 3D coordinates are shared: vertex, normals, texture coordinates, etc.).

In fact the more you merge the data and use binary data you will be surprised how fast will your application go. Mostly because even on SSD (but it is even worse on "traditional" hard drives) the disk is the slowest component by far, so optimizing it, at expense of CPU is the best way to wait less when you run your code.

IV. Conclusions

These solutions are practical but they should be used only when you measure at least before and after to see if there is a speedup. Maybe before optimizing your image loading algorithm, maybe you should not load the picture at all! But if you load it and you don't care in absolute terms if your picture uses just 256 colors.

It is also important to not forget about hashing if you have a huge repository of assets. This is a different talk altogether, but sometimes not the loading of data is expensive but the disk seek of it. If this is the case, try to split your data using hashing and dictionaries. In simpler terms, if you never created hashing do just this: create 3 level directories based on the first three letters. So: "asbest.pic" should be under folder: a/s/b/asbest.pi This can reduce greately the search overhead of your application, as the OS will have to search much less if a file/folder really exists.

Saturday, November 29, 2014

Why Static Code Analysis is important... and why we don't care

Static code analysis as of today is crucial as the software today is much bigger than ever. It is even more important as you have many people considering better code based on their previous experience, but they don't agree what means better.

So, based on this, is very likely that a static code analysis tool will help you a lot to see these errors, and these errors are found early, sometimes in a live session (like: Resharper tool - also written as R# - a very great extension of C# from JetBrains).

But what is the main issue of companies not using them (or not at a big scale)? I found very often (in my usage of R#, but I think is similar for many other companies and tools) more systemic causes:
I - (some) developers don't like to be nagged with a lot of errors/warnings when they want to finish the task fast. Having variables not named consistent or the namespace doesn't follow the location on disk is less of a concern. Even more, some libraries are for example with deprecated fields, and tools like R# will warn you that the code is broken (or at least using obsolete fields) which is a distraction given that users will not change libraries which are "legacy"
- these tools offer solutions which sometimes are at best strange or even harder to be understood for the user who doesn't understand/accept the technology. For example, Linq syntax in C# is weird for some users who don't know it, and even applying best coding practices, it makes your code less prone with side-effects errors, but the developers need to know Linq
- there is a cost upfront you have to make that these tools do not complain (much), but some companies focus on: let's implement this feature, then let's implement the other feature, but when you say: "I want to clean the code based on a tool" is a very hard sell. Is more likely a question of: "is this code working better/faster after this?" And answering honest (like: "no, it will make the code just more readable and less likely to have bugs in feature") will not change the decision.
- they are fairly costly - price is always relative, but if you take a license for every developer, things get expensive really fast. And there is no visibile advantage. Sometimes the tool is an easier sell if you say: "we need a better refactoring tool", and management may understand it (as R# is a very good refactoring tool), but otherwise is hard sell for management.
- small codebases very unlikely need these tools, the bigger the codebase is, the likely you will need these kinds of checks. The reason is that in small codebases you can put all the tool does in memory, and even you can forget a null pointer, this will be found early. If you need a tool that for example processes a batch of pictures and makes the a document, you may need very little refactors, but most likely is enough to use just default Visual Studio's facilities.

As for me, when I was a single freelance developer and I was paid (mostly) per finished feature, a license of R# or Visual Studio was very justified, as full navigation of Resharper and not wanting to look for all weird bugs in my code, was a understandable price I have to justify (for myself).

But later, when I worked in bigger companies, I did not see people using R#. The closest to R# is NRefactory, which brings a similar (but much buggier) experience and with many many limitations. So if you don't mind to work with a much buggier experience of Xamarin Studio or SharpDevelop, please go for it, at least to know what is this static code analysis about!

What I suggest to you anyway is to use a static code analysis, not at least because is easier to let a tool to look if you forget a null pointer check, than to you to do it, or even worse the user to notice it on their machine.