Code Refractor - Virtual Machines/Compiler performance musings: 2015

Sunday, December 20, 2015

Fx2C - small tweaks

Fx2C transpiler does get a bit smarter but also in a meaningful way. If you plan to use Kotlin mixed with Fxml (a feature which I think neither Oracle or JetBrains thought supporting) you will see that there is no (straight forward) way to support @Fxml fields.

Do not worry, if you have a Kotlin controller class, you can write something like inside fxml file:

<!--   Flags: IsKotlin-->

And the Kotlin code will work seamlessly:

package desktool.views.kotlinView

import javafx.event.ActionEvent
import javafx.fxml.FXMLimport javafx.scene.control.Button
import javafx.scene.control.Label

class LineToolControler {
    var thickness = 1

@FXML

var label: Label? = null

@FXML

var button: Button? = null

@FXML

fun handleButtonAction(event: ActionEvent) {
        println("You clicked me!")
        label?.setText("Hello World!")
    }
}

This is not to say that right now it is better to use Kotlin, but the idea is that maybe in future the code will be extended to support various other languages (Scala is another language which comes to mind) with JavaFx.

Another important part is that right now the compiler will generate a preloader class (Fx2CPreloader) which can load all classes in memory so for second time the JVM will start the dialogs in a fraction of the second. Under my unscientific testing, using a slow Atom CPU (Baytrail) machine, a medium sized dialog first time loading of classes into JVM could take something like 600 ms, but with preloading, the time can be reduced into 2-5 ms.

So, is it important to use this preloader? I think that yes, especially if you have many UI elements, using this preloader under a splash screen will warm up the JVM and will make the 2nd run to make your dialogs to be seen (close to) instantly.

This Fx2C tool is for me mature enough to be in maintenance mode for now, as it works really well enough for my taste/usages and I will likely use to make more JavaFx applications and just to feel responsive, a feature which I was feeling as missing going from a .Net WPF environment.

Wednesday, December 16, 2015

Vampire Logic - my team's entries

I was participating to create 3 apps using Microsoft Universal Apps APIs over a zombie competition (hackatlon). My ad-hoc team (Vampire Logic) did really well there.

Source code for these 3 (working) apps (in categories Productivity/Games/Educational) were created in 18 hours combined (around 6 hours per app, give or take).

Productivity: an image editor with camera support, Modern UI, live preview and infinite Undo levels:
https://github.com/ciplogic/zombie_productivity

Game: a zombie ship fights in a top-down shooter with animated background. It uses efficiently a Canvas control and fluent animations using only standard Universal Apps code
https://github.com/ciplogic/zombie_game

Educational: An interactive math game where the harder and harder math problems are given in limited time. How long will you survive?
https://github.com/ciplogic/zombiedu

The coding practices may be spotty at time, but excluding the fact that that the applications were written in only 6 hours (and was our first ever Universal App coding experience), all applications had no known bugs in the way coding was done (like no crashing, or solving errors just with big try-catches to hide them) or similar stuff.

Coding language: C#

Team members: Dāvis Sparinskis, myself, Linda Legzdiņa, Rudolf Petrov

Some photos with my team:

Friday, December 11, 2015

Finding non-duplicates in an array (using Java)

I had a job interview and I will do a lot more Java stuff in day-to-day coding (yay!) and one part of the job interview was about searching non-duplicated values from an array. There is a technical solution (which I would not divulge) which is with O(n) complexity (meaning that the maximum it should scale up linearly with the size of the array) but can it run faster?

Faster than O(n) complexity wise is kinda impossible, because you have to traverse the once the array of data. But as we've seen in an older post, if you can get the constant down, you can get a faster algorithm. So, what about searching it naively the chunk of the data and put it into an int array?

This is not my best benchmark (and I didn't try all combinations) but up-to 100, an native int[] (int array) structure would run faster than using the O(n) complexity.

The code that would search an index (and it will return Integer.MIN_VALUE if no value is not duplicated) is the following:

public static int GetFirstNonRepeatingIntArray(int[] values) {
if (values == null) {
throw new NullPointerException("values");
}
int[] counters = new int[values.length * 2];
int counterSize = 0;
for (int index = 0; index < values.length; index++) {
int toFind = values[index];
int foundIndexValue = getIndexOf(counters, counterSize, toFind);
if (foundIndexValue == -1) {
counters[counterSize * 2] = toFind;
counters[counterSize * 2 + 1] = 1;
counterSize++;
} else {
counters[foundIndexValue * 2 + 1]++;
}
}

for (int index = 0; index < counterSize; index++) {
if (counters[index * 2 + 1] == 1) {
return counters[index * 2];
}
}
return Integer.MIN_VALUE;
}

public static int getIndexOf(int[] counters, int counterSize, int toFind) {
for (int foundIndex = 0; foundIndex < counterSize; foundIndex++) {
if (counters[foundIndex * 2] == toFind) {
return foundIndex;
}
}
return -1;
}

For example for 10,000 repetitions of the algorithm if arrays are of 50 items (which are randomly generated in a range of 1 to 25) it would give the following output:
Total execution time: 100ms
Total IntArray execution time: 31ms

Does it have any implication in your day to day life this kind of coding? I would say yes: when you work typically with "Settings" kind of classes, you will better work using arrays/Lists than dictionaries, even it is counter-intuitive: there is very unlikely you will get hundreds of settings, but both as debugging experience and performance may be better. Memory usage (especially compared with Java's HashMap implementation) is also much better.

Monday, November 30, 2015

Fx2C - Bug Fixes

JavaFX seems to be kind of a black sheep for some Java developers but I wouldn't feel so negative. The reason is that to some extend I feel the idea that many applications migrate to an online environment and for them server side exposes basically a JavaScript/Html5/CSS UI but on the same time I would expect to be a a kind of desire of using offline applications at least at times.

And here I see that JavaFx can shine. Few things still need to be done by Java runtime in my opinion, but Java if deployed, it is a great offline runtime which runs on the 3 major OSes and this is not a small deal. You can literally write once one UI and with no recompilation will run on all (desktop) platforms and if you consider JavaFxPorts, you can add Android to it.

So based on this I fixed some bugs with JavaFx types handling and use Java only (remove the Kotlin language) so it is easier to fix the code for non-Kotlin developers.

Code is here:
https://github.com/ciplogic/Fx2C

Monday, November 23, 2015

New useful Kotlin tool of the day: JavaFx 2 FXML Compiler

Do you write by any change Java? Do you use a Desktop client that is Swing and you didn't use JavaFx because is to slow or you use JavaFX and it "feels" slow even for very simple dialogs?

There is a tool now with a demo:

If you want to contribute to it, please install IntelliJ Idea (Community is just fine), clone the repository: https://github.com/ciplogic/Fx2C and after this load the project.

Legal notice: use it at your own risk. I'm not responsible for your problems. But with all seriousness, the code is simple to be maintained.

Thursday, November 19, 2015

Tricks of the trade for quick sorting in .Net

I rarely do huge dataset sorting, but I want to share some tricks to sort huge datasets. I am surprised that very few people learned things like it even in university.

The reason is that as students you are thought to think that the CPUs are ideal and you learn big-O notation. This sorting stuff is in the most cases O(N*logN) meaning that for a list of 1,000,000 items you will have let's say 1 second, but for 4,000,000 items, you will have a very small growth to something like 4.1 seconds.

So how to improve the sorting speed? Reduce the constants.

The point of big-O notation is that it considers the "scale" of growing the time, but it doesn't take into account the individual constants.

So, let's say you have 3.000.000 items to sort inside a List<String>. This is a hypothetical but not so hypothetical, in the sense that are huge lists of items that you may need to sort in few milliseconds, and increasing the item count can show much clear where you have speedup.

Let's say you add those items inside a list, and you use List.Sort(). I created the items semi-random distributed (most of ASCII characters are setup random) of these items. On a 64 bit .Net (4.6) on an oldish Intel i5-2500K, it would run in fairly short time (13067 ms) or 13 seconds.

Can you speedup more? The simplest trick to check first char first, as a separate check. If the first char is the same, you use full string comparison. This should go in 10189 ms. Another small improvement is to sort over arrays. This is fairly small speedup, but is still quicker (9940 ms).

The comparing sorting key class would be like following:
struct CompareStr : IComparable<CompareStr>
    {
        public char FirstChar;
        public string Text;
        public CompareStr(string text)
        {
            if (string.IsNullOrEmpty(text))
                FirstChar = '\0';
            else
                FirstChar = text[0];
            Text = text;
        }

        public int CompareTo(CompareStr other)
        {
            if (FirstChar != other.FirstChar)
                return FirstChar.CompareTo(other.FirstChar);
            return Text.CompareTo(other.Text);
        }
    }

And the sort routine of those texts is:

var combinedKeys = texts2.Select(text => new CompareStr(text)).ToArray();
            Array.Sort(combinedKeys);
            var resultList = combinedKeys.Select(combined=>combined.Text).ToList();

But can we do it better? I think that yes, so let's change FirstChar to pack the first two cars padded as a 32 unsigned int (char itself is kind of equivalent with UInt16). The times also improve greatly (6220 ms) which is less than half of original time:

struct CompareStr2 : IComparable<CompareStr2>
    {
        public uint FirstChar;
        public string Text;
        public CompareStr2(string text)
        {
            if (text.Length <= 2)
                FirstChar = 0;
            else
                FirstChar = (uint)((text[0] << 16) + (text[1]));
            Text = text;
        }

        public int CompareTo(CompareStr2 other)
        {
            if (FirstChar != other.FirstChar)
                return FirstChar.CompareTo(other.FirstChar);
            return Text.CompareTo(other.Text);
        }
    }
And the sorting routine is very similar with first code:
var combinedKeys = new List<CompareStr2>(texts2.Count);
            combinedKeys.AddRange(texts2.Select(text => new CompareStr2(text)));
            var items = combinedKeys.ToArray();
            Array.Sort(items);
            var resultList = new List<string>(texts2.Count);
            resultList.AddRange(items.Select(combined => combined.Text));

Can be written to be really faster?

    struct CompareStrDouble : IComparable<CompareStrDouble>
    {
        double _firstChar;
        public string Text;

        static double ComputeKey(string text)
        {
            var basePow = 1.0;
            var powItem = 1.0 / (1 << 16);
            var result = 0.0;
            foreach (char ch in text)
            {
                result += basePow * ch;
                basePow *= powItem;
            }

            return result;
        }
        public CompareStrDouble(string text)
        {
            _firstChar = ComputeKey(text);
            Text = text;
        }

        public int CompareTo(CompareStrDouble other)
        {
            if (_firstChar != other._firstChar)
                return _firstChar.CompareTo(other._firstChar);
            return Text.CompareTo(other.Text);
        }
    }

For reference this is the sorting code:
List<string> SortSpecialDouble(List<string> texts)
        {
            var combinedKeys = new List<CompareStrDouble>(texts.Count);
            combinedKeys.AddRange(texts.Select(text => new CompareStrDouble(text)));
            var items = combinedKeys.ToArray();
            Array.Sort(items);
            var resultList = new List<string>(texts.Count);
            resultList.AddRange(items.Select(combined => combined.Text));
return resultList;
        }

This sorting key is really, really fast, 2292 ms which is over 5 times quicker than the original List.Sort (for strings).

Some things to consider:
- Sorting huge datasets may show a design flaw in your application: You should always filter data before sorting it. This use-case may be important though if you get local data to sort. This algorithm of sorting can be used though with Linq's OrderBy or friends. If you have data from database as a source, the database can sort the data for you, so it is no point sometimes to sort yourself the data if you get it from your database.
- This algorithm considers that sorting only when you need it: I am considering here cases when you can have a grid with many thousands of items and you can sort over the header of the column. If you have 300,000 items (this is a huge grid BTW), the sorting using List.Sort is around 1030 ms vs the fastest key used here which is 200 ms
- At last, this algorithm is not culture aware, but is damn quick! This is a tradeoff you have to consider that may break some people's language. For instance, in Romanian (my native language) S is near Ș char, but using this sorting algorithm, Ș char will be after Z. The same will happen with Russian, or all other chars. So if you care this kind of sorting, make sure you can afford a bit worse precision for non English words.

Running on Mono (I was really curious, weren't you?) gave really horrendous times (for only 30,000 items):
.Net times:
List.Sort: Time: 51
Sort by first char: Time: 65
Sort by first two chars: Time: 33
Sort by double key: Time: 30

Mono time:

List.Sort: Time: 411

Sort by first char: Time: 329

Sort by first two chars: Time: 106

Sort by double key: Time: 39

The fastest implementation was consistently faster, but the things got very quickly worse. For 300.000 items:
.Net times:
List.Sort: Time: 1073
Sort by first char: Time: 709
Sort by first two chars: Time: 420
Sort by double key: Time: 193

Mono times:
List.Sort: Time: 5825
Sort by first char: Time: 4917
Sort by first two chars: Time: 2268
Sort by double key: Time: 409

And for 3000000 items the times are really huge:
.Net times:
List.Sort: Time: 12939
Sort by first char: Time: 9851
Sort by first two chars: Time: 6143
Sort by double key: Time: 2259

Mono times:
List.Sort: Time: 81496
Sort by first char: Time: 70736
Sort by first two chars: Time: 38233
Sort by double key: Time: 5886

So, using latest algorithm presented here, on Mono at least seem to be 10 times faster using this primitive comparison between various strings. I can assume that the string comparison under Mono is not well optimized (like to compare two strings in Mono looks to be around 6-8x slower than .Net) and the tricks used in this algorithm can give to you the user a huge speedup. So if you use string comparison on a Xamarin supported platform for sorting purposes, you may really use a similar techniques as described here.

Saturday, November 14, 2015

PC Hardware is Boring...

I am in fact amazed for the power of your typical computer, you can buy quad core laptops which can run Crysis. Yes, you spin some money out of your pocket but you can run Crysis. But even so, the hardware remains boring. You have the same display, the same OS, the same software mostly centered around browsing, some video playback or some Office or image processing.

Based on this, I want to recommend for your next computer not to buy your next Intel i7, or AMD Zen equivalent. And this is not because they are not fast enough, but only because they are boring. Really, your computer which you bought 5 years ago could do almost anything your can do, excluding maybe to play Crysis. I had an 4 GB PC with quad core CPU, a 10000 RPM (no SSD though) and a good dedicated video card in 2009 and it was priced around 1000 EUR (should be around 1000 USD in US). And if I would dare I could use it as a development machine even now.

Today I have a laptop with similar specifications in the same price range, a but more powerful and consuming around 1/5 of wattage, but other than this, is basically the same hardware. Sure, mobility of laptops is more desirable, but still, I think you can see the point: you can buy some hardware that have basically the same specs that excluding you don't play high end games, you really throw your money out of your window. And no, Civilisation V is not a high end game, neither Dota 2 (excluding you play it professionally).

So, can be found fun in the hardware landscape which is geeky enough but doesn't involve you to buy an overly pricey device? As for now I found two devices which I bough myself (I will point to similar products, to not be direct advertisments): 150-200 USD NUC PC and sub 200 USD (Android) tablets.

As tablets are not necessarily the topic of this post, still they are interesting, especially as tablets are easy to test your software, or also you an make programs and push on them. And even if you don't use for anything else, they will push you notifications (from friends to see an Youtube clip) and you can see it properly. Where I live now (in a Baltic country, where prices are a bit higher than the rest of EU), I could find an 10 inch Atom based tablet which can play 720p video, it has quad core and it is really more than responsive, easy to program and in short, decent.

The NUC PC I see it the most compelling of all: let's look to a device like this. You add to it 2 or 4 GB DDR3L and an hard drive and you put Linux on it. It starts decently fast, it runs browser, it is programable, you can run all software I was curious to run. I tried (just for fun) to run Windows 10 (I use though a 8 GB module) and it ran not that far from my quad core laptop. I could see the lag for example when navigating, but nothing aggravating.

What is so compelling about these devices:
- with a zero cost OS (I am recommending Linux Mint, Ubuntu, etc. ) you can have a very low cost legal machine that can do most of things you would do it anyway with the computer: like emails, youtube, etc., and videos will play (on Linux at least as Full HD)
- if you are a developer which doesn't require Windows (you can use Mono if you want C#) you can use really everything you want to test. I am not sure about OpenCL (there is a library named POCL, but I don't know how stable is it), but if you want to test how to code using 4 cores and check the scalability, you're right at home; if you want to check how to make a small web server using any technology stack, you can do it
- if you care about simulating most user's computers, again, you are safe: most users do not live with super high-end computers at home, so targeting your software to run on these Atom-class CPU machines, you will in fact make it run on a huge number of other machines. I used "Atom-class", because sometimes you can find AMD Kabini CPUs
- a less talked item, which is important, the full system, even in full load will require much less than a typical laptop. I estimate that excluding the display in full-load the machine would use something like 15W, making it more friendly to do even processing over night or to be a server in it's own right. I know we talk watts in a marking way, but let's be pragmatic about it: if you let it your expensive computer over night as a web server in your organization, you have two risks: the power spikes can add to your electricity bill, the second is that having an electricity power shock can burn your pricier PC. Losing a 220 USD (estimated) PC is less risky compared with a full more than 2x times pricier PC.
- kind of a last for me, this machine is powerful enough and compatible enough: you can run full Windows on it (not sure about XP, but definetly Vista, 7, 8 and 10) and Linux.

The single part which is a bit strange is that the raw CPU power of a 10W part is it around what in 2009 a dual-core CPU could do at 65W (if all 4 cores are used, and most software today supports all cores). This means that if you use Gimp (or Photoshop), given memory it would finish in reasonable time (if you are not professional video editor). And this with a cool (both as temperatures and as status) device!

Monday, November 2, 2015

Vulkan, DirecX 12 and the Low Level API craze

A bit of history

There was a long time competition between PCs and consoles for giving the best visual and experience inside games. Also, typically the consoles had high end specifications when they were launched, but later they age fairly quickly because the PC market had bigger competition, but still they offered a consistent and higher frame rates. How was it possible? In part there were two factors: there was no fragmentation so programmers could fully use the hardware components without coding workarounds if a specific hardware component which does no offer hardware acceleration and the second part: the hardware was open to developers (after you sign an NDA) with lower access than classical OpenGL/DirectX API.

Enter Mantle

Mantle was the idea that AMD had to offer this low level for their hardware, and they did work with a game developer (Dice) to make it more usable for "mere mortals". Mantle had a fairly small impact overall for games but a big impact for industry as big (theoretical) potential. Later Mantle was offered as starting API for Vulkan, and Microsoft's DirectX 12, Apple's Metal following suit to offer similar functionality on their (propertary) OSes.

So what is it so special about these low level APIs? (I will do my analysis based on mostly Vulkan based presentations/documentation and my (limited) understanding of DirectX 12 (and assuming that many things are similar)).

Three items are the most important:
- don't do most of "rendering" on the main thread
- move part of driver code in user-space
- don't do (any) validation in "release mode"

Don't render on the main thread

Typically rendering in a classical OpenGL/DirectX application is basically issuing drawing commands against a driver and these commands are processed on a pipeline. Also, there are pixel/vertex shaders which they do pre-post processing of pixels and geometry. For historical reason most of developers are used to draw using main thread, so the drawing has to be done waiting basically on drivers to finish all drawing.

Right now the drawing commands are right now named: Command Buffers and these command buffers can be processed in separate CPU threads, and they can be reused! Follow this presentation for more (high-level) details.

VK_CMD_BUFFER_BEGIN_INFO info = { ... };

vkBeginCommandBuffer(cmdBuf &info);

vkCmdDoThisThing(cmdBuf, ...);

vkCmdDoSomeOtherThing(cmdBuf, ...);

vkEndCommandBuffer(cmdBuf);

This thing in itself can scale horizontally on both higher spec machines but also on lower (yet multi-core) machines as ARM or Atom CPUs which is really great thing for many cores which are not that fast.

Moving the driver code in user space

These command buffers are combined in rendering pipelines. These rendering pipelines which include the pixel/vertex shaders are prepared themselves can be setup on separate threads. Pixel/vertex shaders are right now compiled from a bytecode (named SPIR-V), which makes the scripts loading and processing faster. This item is not for importance in DirectX world because Microsoft was doing it as far as I understood from DirectX 10, so if you think that your game (Dota 2, chough, chough) because it has a lot of pixel shaders to precompile, it would not gonna happen.

Moving most of processing in userspace means both good and bad things. The good thing is that good developers will not have to wait for a driver developer to optimize a specific code path which the game needs. Another good part is that having most code in user-space the code should run faster as many drivers do "Ring" switches (jumping into kernel mode) which is a very expensive call (low microseconds level, but still significant if happens tens or hundreds of times per frame draw, as a rendering time per frame should be around 16 ms). The ugliest thing I can imagine is that very often driver developers for the main video card vendors do a good job. So in this scenario I would expect that driver developers will have fewer ways to improve all games.

Don't do validation

This is why you will hear things like: even if is using one core, the processing is still 30% faster using DirectX 12 (or Vulkan). This is of course a double edged sword: you can get very weird things happening and no one can assist the developers of what went wrong.

The good thing is that Vulkan come with many validation tools in "debug" mode, so you can check the weird mismatches in the code.

Should you install Windows 10 or find a Vulkan driver?

If you are a developer working with graphics, the answer may be yes, otherwise, not sure. Try not to get hyped!! Windows 10 had huge problems at launch with some older NVidia cards (like series 500 or lower). Having DirectX 12 which theoretically would run your future unlaunched game in one year from now means very little for your today usage of your computer.

If you don't play a lot, the situation is even worse, as for most interfaces I'm aware the most time in processing is mostly: font metrics calculation, styling, layouting and like it and sadly none of them are to GPU taxing.

Would Vulkan or DirectX 12 have a big impact? I would expect that in 2-3 yers from now yes, but not because anything changed for the user, but only because the industry will upgrade naturally the software stack.

Wednesday, October 28, 2015

The Monkey mastering .Net!?

For readers of the blog, they may notice that I am kinda big fan of Java's optimizer and environment but on my day to day tasks I'm still using .Net. Also, to do something interesting, I look into loading big data sets and a good way to work with this data is to look into the Heroes 2 data and the FHeroes 2 algorithms of processing them.

The previous post was how it is possible to read the game data of few compressed MB in few seconds. But the original algorithm extracting all data data from Heroes2.Agg (kind of a "global file system" for graphics) and compression them into a zip, and this zip full extraction was later benchmarked.

But how much does it take to run it from command line?
Time: 18349 ms
So it takes 9 times to extract the graphics of the original game using .Net than to extract them from a plain zip from Java. As the algorithm is very slow, I was suspecting something went wrong. Obviously I checked to be sure that I set "Optimize code" into the assembly's properties. Checked...

After digging, I found a smoking gun: by default Visual Studio 2015 sets the: "prefer 32 bit code".

Chosing 64 bit code, the answer changed drastically:
Time: 13396 ms.
I tried to switch also from 4.5.2 to 4.6.0 (maybe is related with RyuJit) but the times were fairly consistent.

The last surprise? I tried Mono even knowing that is a 32 bit only environment on Windows.

The result? In part a bit shocking: time: 8554 ms, so more than 2 times faster compared with .Net. I measured twice, the results are consistent. Also, Mono comes with many options which is in fact amazing if you ask me, but they did not impact almost at all the performance:
- mono --llvm: Time: 8297 ms
- mono -O=unsafe: Time: 8325 ms

Disclaimers: these tests can show a pathological case of .Net. To try to reproduce the test, you have to get Heroes 2 Demo, take heroes2.agg and copy it inside "data" folder, and run the revision of NHeroes2 on GitHub comitted just around the blog entry was written.

But some rules to keep in mind when you run your application in .Net environment:

- make sure you have "Optimize code" and "Any CPU" (without Prefer 32 bit) or x64 binary checked on Release. Otherwise you can lose 25% in a bitmap processing code.
- if your code runs in Mono, try it, it may run faster than 2x times and maybe this is what you need

- try Mono even for other reasons: this will make your code more future-proof, as you can migrate at least some sections of your code to your server with Linux or with your OS X. Even more, if you can afford, you can buy tools to build .Net applications with Xamarin. To me they look a bit overpriced, but if you need to support starting from a C# code an iPhone application, why not to pay to Xamarin
- at last: I found some functionality I was using, was not working optimal with the latest Mono distribution, but there are many workarounds for it: the .Net default Zip compressing library is not supported and it crashed on my machine in Mono. But this was not an issue, as there is ICSharpCode.SharpZipLib which runs on Mono just fine. Xamarin are fairly good on catching up, so I will not hold my breath if Xamarin would have maybe in few months some Zip file support compatible with .Net Compression frameworks

Ah, and I forgot to say, I don't know fully why the performance was so bad, I would expect to be a bit related with GC behavior, and is possible that the GC of Mono to have a bit higher troughput but a worse "worst case scenario" than .Net. This may explain maybe the difference from 13.4 seconds to 8.5 seconds. Or maybe a weird bounds check that the .Net optimizer may not optimize it nicely and Mono would do it... I really don't know. If there is someone wanting to investigate and make the code much faster than 8 seconds with .Net using a custom profiler, so be it, but just don't use this as either a "definitive proof that Mono is faster than .Net". Similarly, don't forget that most of the time, a well optimized code can run faster just if it has better organized data. .Net would likely extract all data (estimated, not absolute numbers, as I never wrote the .Net code to extract from zip) compared with Java previous post in around 6 seconds, and it would be faster than Mono with the actual 8.3 seconds. And Mono would not be able to run faster because it crashes with the zip format for now.

Sunday, September 20, 2015

Optimizations on bigger data sets

I started around one year ago jHeroes2, wanting to be a port of FHeroes2 from C++ to Java. As I had the code around, I also tried to do part of code in C#. The ugly part was that using a standard toolkit (I'm saying about JavaFX or WPF in C#) the drawing is fairly slow. Is faster than let's say Windows Forms on displaying multiple bitmaps, but it is still fairly slow (or I don't know myself how to optimize the code for either toolkits).

But one interesting part of the bunch is that I learned some tricks that were done originally by the team of the Heroes 2 when they did their "big archive with all content" file. They did a full index, and later the file content is made as a simple algorithm of "painting". The algorithm of painting is really nebulous and I'm impressed by the idea that the FHeroes2 guys succeeded to uncompress all pictures.

So based on this and also of my working place experience, I thought that it would be handy to take all pictures of Heroes2 main Agg file, decode all images and (re)compress them as full bitmap data. As data is compressed, as a result I did make a full zip with all pictures are inside Heroes 2 that could be decoded and I repacked them. I did not use indexed colors (even they would reduce the bitmap size) and I did not save them as native bitmaps because I wanted to check some things I will elaborate in this entry.

So what the compressor does:
- iterate over all pictures in heroes2.agg file (a 43 MB file)
- extracts them in memory as bitmaps
- take every bitmap and saves it either in a text format, where every pixel is a hex value, or a binary integer array
- compress every array of text/bytes to a zip entry

What the benchmark test does:
- take the zip format entries one by one
- decode width/height first then creates an integer array and reads/decode the hex strings/byte arrays
- converts this int array into an Image format.
- ouputs the time.

First of all the zip file for binary data looks like following:

This in short presents 15.000 files (pictures) that if are extracted would be like 250 MB of data that if decoded to disk would look like this:

Given this much data I'm sure that I would be interested to see how quick it would work to decode all photos.

So first of all, having these two zips, I would want to have a baseline. So for this I started with .Net code to extract all zip content micro files in byte arrays into memory. The timing was the following:

Time: 2604 ms

Time text: 3276 ms

This means in short that if you would want to uncompress with the latest .Net on a 64 bit machine on a fairly quick laptop it would take to you to around 2.6 seconds for a binary compacted data and around 3.3 seconds just to uncompress the data.

I was running the same code with Java for extracting, and it was running in around half of time. So using hex data, the decompression time will be closer to 1.5 seconds, but the times are like following.

Time (extract text): 1678

Time with string split: 7012

Time no alloc: 4474

Time (binary data): 1685

Time (extract data): 943

A graph (with shorter bars, are better):

So, what I eventually found was that you can write quicker conversion from binary data to image in Java meaning: extracting 15.000 files in memory, make them int arrays then convert them to pictures, in less than 1.7 seconds on my machine, than .Net time to extract the pictures.

This is great, but also, I've wanted to see another more real-life use-cases: if the users for example compress hex text files, the code to extract it, even is split in a fairly GC friendly by splitting text into lines, then splitting text into tokens, and then using Java's string to hex formatting it would run very slow, in around 7 seconds. Another interesting code, was that instead of splitting strings per row, it can be written most of the code, even with plain text with zero allocations on pixel drawing (or close to zero, there are allocations for image itself, or the integer array, and so on, but not on the processing of small pixels) and with this you can get into 4.5 seconds range.

At last, you see, .Net was really very slow, really? Yes and no, in this code Java was faster for many reasons, the simplest of them being that Java optimizer is more advanced that the .Net optimizer, so on highly tight loops when extracting code, Java was shining. Also, the code with zero memory allocation for example, or the one with binary processing, I was using lambdas knowing that Java 8 took into account the idea to optimize this code.

Could it be reduced the time to be less than 1.7 seconds? Yes, but not by much, excluding, and here is the main part: that Heroes 2 has a 256 color palette. Reducing the full bitmap data into a palette code, would reduce the 250 MB of data to around 85 MB of data, this meaning that extracting would require around 1/3 of time, and similarly, the uncompressing of data, would be also very friendly to memory allocator. I would expect that extracting of 85 MB of data compressed (which would be very likely under 10 MB mark) would take maybe a bit less than one second.

So what I've learn myself and also people curious how to improve their application performance:
- if they have bigger projects/documents that their application should load: compact the data into binary. If you need to save plain text data, save plain text data, and make a binary copy of it in a temporary folder and a hashed key file to make sure the original file is not modified. If is not modified, load the cached binary file.
- use Java over .Net if you want to have very big batch processing
- reduce memory allocation/reallocation. This can reduce even for text based format to just 2.5 times slower than the full binary format.

Thursday, September 10, 2015

Apple's September Keynote CPU claims review: Why You Should Not Buy iPad Pro

First of all, you may buy your Apple products for other reasons than this post, but the single reason why I will make here this post is simply: Apple definetly lies customers over the keynote and if you care about honesty, I would not buy at least the iPad Pro.

So, first of all watch the keynote if you didn't watch it already! and after this we will have to define our terms. This is second time (after launching of iPhone 5S) when they state that the CPU inside the Apple product is "desktop class CPU", "Console level graphics" and things like it. If you want to be "selled" please take their message in writing.

So why Apple is dishonest, let's take claim by claim:

- the CPU is 1.8x times faster than their last tablet (iPad Air 2) and it will be faster than 80% of CPUs found today in mobile devices (like laptops). This is hugely dishonest at best. Their lowest cost model is at 800 USD, and at this price it is definitely on the lowest performing devices and lower spec. For 800 USD (or in EU will be like 800+ EUR) you will be able to buy more than 32 GB of storage SSD and a fairly beefy laptop. I bought Lenovo Y50 but previous year model with quad-core CPU (compared with very likely just 3 core iPad), 8 GB RAM, 1 TB storage (SSDH, but I would opt in for a 250 GB SSD), UHD ("4K") for 950 USD.

So let's benchmark it in a one core CPU, and let's use a benchmark that is not optimized for Apple (neither for Intel): Kraken benchmark. Apple iPad Air 2 (their fastest iPad) would give like 4000 ms. Let's say that Apple would achieve 1.8x (not "up-to" but true 1.8x) speedup. This would mean that this newest tablet in one thread performance would have like: 2200 ms to finish the JS Kraken benchmark.

Running today on this Lenovo laptop? 1134.2ms in Google Chrome and 1118.3ms in Firefox. In my book it means that a typical powerful laptop in the same class with the "iPad Pro" price wise should be at least 2x faster in single core performance and in a multicore scenario would be like: 2x * 4/3 (4 cores in laptop vs 3 core in iPad) = which would be more than 2.5x times slower.

2.5 times CPU maybe is a small inconvenience, but what about the rest?

GPU benchmarks?
3DMark IceStorm has around 210K points on the Lenovo Y50 laptop because of powerful NVidia 860M. IPad Air 2 is around 22K. If iPadPro has 2 times the performance, will be still around 5 times slower than a mobile GPU.

Memory?
2GB LPDDR3 vs 8 GB (low power LDDR3) memory. Not sure about you, but having less memory for a productivity device is for me a big differentiation factor.

Screen?
4K (3840x2160) is bigger than iPad Pro resolution (2,732 x 2,048). And the DPI should be comparable (the Lenovo has 15 inch vs close to 13 on iPad Pro).

Other factors?
Maybe there is another factor that you would want a laptop. Maybe later expand your storage space or memory. I did this in fact, I switched from my HDD to SSD and from 8 GB to 16 GB. I could do it by myself in fact, it was really easy enough.

I am sure that if you care about your money and you are not wanting to show your Apple elitism, I'm sure that with a quad core laptop you can get better on the high end.

But who knows, maybe it is desktop class CPU, if you take the crappiest CPUs from the market. For example, again, if Apple is given 1.8x speedup, would match most of integrated AMD APUs, but this is again dishonest, because their (integrated) GPUs should be at least similar with the class of the 860M graphics, maybe a half. And the entire AMD systems cost in around half of price, or Intel lower cost laptops.

In fact a system which is comparable with iPad Pro should be one like this one. Of course, not on the screen size, but on the "Pro" computing specifications.

Conclusions
Of course probably you decided before reading the article and a rant kind of a blog entry should not make you change your view. Still, if you care of a company being honest, and you don't mind to get in fact a kinda Atom Quad-Core performance kinda laptop but in a tablet form factor sold on the price of entry level i7 Quad-core laptops, then go ahead...

Wednesday, August 12, 2015

Write Your Desktop Application in a VM for Your Security!

Today when I arrive to my computer at work I receive the Windows Updates. 1 GB... 1 GB! Most of them are of course security patches which go over Windows, Office and things like it. If you look deeper, you can see that is not only in regular code but it is all over the place. This happens also for Windows 10 like ZDNet confirms.

The updates are in .Net framework, graphics drivers, mounting devices (and Office as told previously) and so on.

These components are as we can guess mostly in C or C++, in part because it is harder to look to all buffer overflows in all Windows codebase, but it is also in part because lower level languages require a hard(er) time for developer brain so it makes harder without very deep code review to get these things fully right.

I hope that most readers could understand this and I would also expect that most of readers are also writing code in .Net (and Java and JavaScript) but I want to express only one idea which in most of the time the security as being hard in itself, adding the concerns of low level bounds checking, makes the security to be very hard to achieve. So it is more economical (and logical) to externalize those risks for other companies (like the OS vendor, the VM creator(s) and so on).

But the latest reason why I do think that is also important to use a VM is the simple fact that is visibly easier to patch your code. If it is JavaScript or Flash, you do upload new application on site, and you're already patched. Users have to refresh the browser.

If you run your code in Java or .Net, if is a very low level security vulnerability, you ask users to upgrade, if it is in your application, you have functionality more or less built in. It is very easy to download files using either Java or .Net and to extract them if it is used a zip format.

But if you use C++ you have to compile the application, have the updater a bit awkward written (as there are some Windows APIs supposedly to do some C++ code), you have to make sure that it supports the right machine (like x86 or x64) and "you're good to go".

With the world of AppStores there is an argument that C++ can be deployed as easily, but in part I still don't think so for one reason or two: if you deploy your Android Java code, you don't bother with which CPU has the tablet, for example a MIPS32 or MIPS64 one. For iOS you have to support basically two platforms because Apple environment is tight, and for Windows by far the easiest way to work is still with C#. Also, an argument that the iOS environment it is itself like a virtual machine now,

Tuesday, August 4, 2015

Premature Optimization Is (Almost) Mandatory

"Premature optimization is the root of all evil" was told by Donald Knuth, right? Right, but he was misquoted. He said in full: that for 97% the premature optimization is not necessary. You can access the full paper where the quote is taken from here. Even more, he said so in context of using ugly constructs (he was refering on GOTO statements). And more, he did point out that statistically the hottest of the code is in 3% of the code, and the full statement of him was: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.". So, he doesn't say about stop to optimize (don't forget, that "premature" is a loaded word, having already a negative connotation) but the reverse, to optimize the code that is hot.

Based on this, I found many misconceptions regarding optimizations and at least this is my view on it (from the most important to the weakest ones):

1 - "You should not optimize in your game/application the loading time, this happen just once, after this application runs fast/properly". There is some truth to this statement, for example if you watch a movie, you should not care if the movie player starts in 0.1 seconds or 2 seconds. But what if the movie player starts in 30 seconds? Would you want to watch using this movie player to watch a 2 minutes clip? Many developers have quite powerful machines, like they have SSDs, 4 cores with at least 8 GB of RAM, but their application will arrive to users that do not have these cool components and the application will run visibily slower

2 - "The redesigning of our architecture will bring the application performance by 4x, and optimizing the current design will give to us only 2x speedup, so there is no need of small optimizations" - but very often this 2x speedup would mostly be transferred in many places to the new redesign, and the architecture redesigns are very expensive (as developer time) to do them. Let's say the company switches from any SQL kind of servers to NoSQL but the application logic is the same. Let's assume that in that scenario jumping from let's say MySQL to MongoDB will give a 4x speedup for queries, and let's say the queries were using 2/3 of the time of the entire system. The final speedup will be exactly 2x. But let's say that the company optimized only the part which is not related with SQL and the speedup is 2x and for the entire system the speedup is 33%. Even it is slow, customers will have a tangible benefit next week/month not after some months when maybe they quit already as potential clients. Even more, when the migration from MySQL to Mongo happens, the system will work 3x faster. But as real life happens, sometimes can be a pathological case that for a customer Mongo runs in fact 20% slower when migrated, but because the optimizations are done on the system level, the system would run still more than 10% faster. There is a lot of math here. done on the back of the envelope, but it simply states that it never hurts to have small optimizations done.

3 - "When I develop I don't need optimized workflow, my machine is really fast": this is kinda true, but sometimes is not that true. Many big applications take long time to start which is again kind of normal, for example it needs to get updates from server, and you as a developer you can pay (because you have SSD and 8 GB of RAM) at least when you are developing to wait for 10 seconds to get the real case data. But if you have to reproduce a bug, imagine that every second counts. It counts because it annoys development, it interrupts it. Especially if your build system takes minutes you will notice that you go to a blog and you read the news and rants (like this blog) but you lose the focus of which bug you were really working on. This is not a fault necessarily on your organization, but this is how human mind works. This is kind of the first point ("you should not optimize loading time") but is directed to developers. You as a developer have to make 1 step build if possible and focus to not do crazily much stuff.

4 - "You don't need to optimize your code because compiler knows better" - I am sure that micro-optimizations like using minimal numbers of variables a compiler will do always a better job than you, but this is blatantly false. Compiler can optimize your code but most code is not run within the compiled code, especially in managed languages (C# or .Net ones, Java, JavaScript) you will see that the compiler runs a lot of code with libraries. Most compilers cannot optimize string concatenation, even though Java will use StringBuilder when you use + to concatenate strings. And the reason it does it in this way it is because compilers don't work well with strings. Every time your code does read from files, a compiler will not know a lot about your file format, duplicates of data, or the fact that you could read less data and rebuild the information. No compiler cannot know if you load 2 times the same image, that it should load it once and cache it, and so on. Even worse, is that even we allow to think that your environment is well optimized, it means that only your code remains the slow part.

5 - "I should not speedup my web service, I will put it on Azure (replace this word with your Cloud solution)" Not sure about you, but having a faster web service means that you have a simpler administration as you need to spawn fewer instances, smaller costs, even the improvements of code could be a bigger upfront cost.

6 - "You don't need to optimize allocation, GC does it fast(er)" Did you measure this? GC definetly has quicker allocator than let's say C++ one, but every time when you do a "new" for a heap object, the object has to zeroed, it also moves the allocation pointer and it means that it makes the CPU cache line "dirty". If you have some code that reads from a file line by line and you have your own "read line method" (I'm saying especially if you want to improve the load time performance, see point 1), you may make a reactive interface, and instead of allocating a new buffer, it looks to me a fair design to just recycle the buffer. The speedups on .Net side are fairly significant, and I would expect the same on Java. Allocating more seldom these small objects will make the GC to be called less frequently.

A bit wider problem of architecture redesign. Today in most companies I work they do use Agile methodology, which is in itself an incremental methodology. This makes almost impossible to make in big systems architecture refactors and even they do them, they are done by the most senior team members, which they know "the core" well. This means that it is possible that an architecture refactor can take not months, but years sometimes, because you cannot risk to break existing iterations, so the code is prepared with small small steps to accomplish this redesign.

In conclusion, this post is not specially to use GOTOs, which both me and Knuth would disagree, but the idea is that every time when you can isolate (in a profiler) some slow code, optimize it now, not tomorrow. The later you do it, you will suffer it in testing it, having a bad application experience (and users will feel it also very often!).

Monday, August 3, 2015

Visual Studio 2015/C# 6/.Net 4.6 (RyuJIT) review

This maybe it is in context of Windows 10 launch when the impression was a bit of a buggy release (and with the fact that some families of video cards are not even supported, like NVidia 4xx cards or older) Visual Studio got a much less attention.

I think this is right for most users, but on the other hand I do feel that this release is extraordinarily... strange. It is outstanding with some features like including of profiling tools even in Community Edition (the profiling tools are limited but still much better ones than the previous not- included ones).

The first impression I had was many fold positive:

- C# 6 which looks to me like a streamline version of itself which was forgot basically from the times of .Net 3.5/VS 2008 (that come with Linq and var keyword). Making code to be less repetitive is an amazing stuff. If you have time to listen for more than one hour, this presentation is excellent. Please push in your company to use C# 6, that excluding if you use string interpolation, doesn't require any .Net support. I'm not a VB.Net guy and I cannot comment much, but I expect to be good stuff here also.

- Roslyn idea even it was as a part of NRefactory for years, it is really well implemented at least that as you type you can see very reliably if your code has errors. No full build to see if are failures. This is really a huge timesaver in itself. This "language service" which is exposed as an open API will make that C# will not have strange behaviors in completion, especially if you will use future versions of CodeRush or JustCode. I love Resharper, but it is still great to know that Roslyn will be part of future SharpDevelop and MonoDevelop release

- .Net 4.6 comes with awesome improvements, I would expect in future to see releases like Paint.Net or photo image manipulation programs or some entry level video games to support SIMD libraries. They come for free, but there is a caveat for now. It still has some obscure bugs (which to be fair, are to be expected) especially if you run F#. The reason why only F# appears to be affected is in part natural, it is because F# requires to allow "tail call optimization" which in turn changes recursive calls into loops. Without it many F# programs can either run with "stack overflow" or have very ugly performance profile. So don't rush for now to run it into your production server, or do it only for your VB.Net/C# code

- even I'm not a C++ developer, it looks that Visual Studio supports very well C++ standards, which again is a great achievement, so you can target with one C++ codebase basically all platforms (like iOS, Android and Windows) without strange #define

As a .Net developer I am still disappointed with .Net which looks today excluding for web stacks (and even there the solution was mostly made as a response of NodeJS/small web servers from Ruby or Java world) so it looks as a desktop tool incoherent. I honestly don't know a Microsoft stack that I can support more than one platform, even in Microsoft's ecosystem. WPF is decent, they patch it, but it looks to me is like an MFC which runs on top of DirectX9. Not DirectX12.

Even more strange is when you install Visual Studio it comes with no package to develop with .Net on other platforms (like Mono) so up to the point that NRefactory is stable enough, your C#6 code you run will run only on Windows or on Linux as an CoreCLR .Net distribution, but not on Mono. This is kind of a bummer if you ask me.

Even more, and this is in fact not a rant against WPF, but as they improved VB and C# (and C++ for that matter, and F#) why they didn't improve Xaml. Xaml is an horrible language, if you can name it so. It has various framework conventions which are almost always broken. You add on this that WPF platform without (and even with) custom controls runs slow with more than some thousands of items. The reason is not that is not GPU accelerated or are GPU drivers faults, or that DirecX9 drivers are not to the snuff, but because when you profile WPF applicaiton, you will see that the internal layouting is hogging the CPU.

If you add other and other issues, it looks to me that if you want to written an application that is for example cross platform, you have mostly Xamarin solutions (MonoGame, Xwt, Gtk#, Xamarin.Forms, and so on) which is at least for me a bit strange.

What I would hope that the VS+1 will support in no particular order:

- polish the software more: it looks to me that Microsoft has right now quality issues all over the products. Complex software is hard, but working little by little and releasing with two features less will make the environment more nice. Not sure about other uses, but at least under Windows 10 but with latest updates, I had fairly many freezes and crashes. I definitely had much fewer under latest releases, but from time to time I still have "blue screen ;) " in Windows or VS hanging sometimes. Especially under debugging situations

- give a clear vision about which frameworks are supported by Microsoft. I'm talking here WPF in particular, but I think that many other frameworks (which include WCF, Silverlight, even the original WinRT code) are either not well exposed or not clear when or how they are supported. This makes very hard for some developers (like myself) if I would have an idea of a startup to start with Windows for a two years project. Java even it is worse technically (in many ways it is worse), I know that they don't let freeze some features, and most of them are in the open. Visual Studio comes with tools from editing Html, to C++ coding for Android. It looks to me like a dinosaur, but maybe is my limited judgement

- should not try to put under one IDE all languages/platforms. And the reason why is that VS is not an open platform like Eclipse. People will not extend it to make CAD modelling out of it. Even it lets you unselect them, by default are to many things included. Features do not matter only by count, but by making a sane experience for users. Use NuGet for adding language services.

- this maybe is easier to say than to do: start with TypeScript and make a .Net language that resemble it. Make a very light language similar with Swift to work for both "Desktop" and "Web" world. C# is really better in my view than Java (which was competing with) but to be fair JetBrains' Kotlin language is definitely more usable. Ruby (excluding that Ruby is not strongly typed) is again more usable than C#. But the "static version" of Mozilla Rust looks really promising and is clearly high performance. Maybe the starting point should be Visual Basic.Net but remove the legacy and make similarly a C# without the legacy. To be forced for example to not iterate without IEnumerable, and you will have to create a separate code (similar with what C# developers write with "unsafe" code) for people who still want C#.

Sunday, July 19, 2015

Using .Net for Developing Games, a 2015 review

Before talking about game development, as a disclaimer I'm not a game developer even though I do have some (working) experience with older versions of OpenGL and DirectX and hands on experience with C++ and C#. Also, I kept track of current technologies (as much as time allows).

First of all, let's clarify the terms: there are obviously games which can write and run in C#, I'm thinking here like most board games like Chess, Go, even strategy games, or similar. Even more, you can do more than these games, and I'm saying the best of my knowledge game written in C# which is Magicka, but again people will sneeze and will say: but this game doesn't use Havok (the physics engine) or if a C# game would use it, people will say: but the Havok is not written in C#, but is it written in C++.

Given this, I want to make as fair as possible review of .Net platform as a game development tool.

Here are some really great pluses
+ C#'s peak performance (after the application starts up), especially if you avoid as plague to work with strings, but using mostly arrays and integers/double types, will make your code to run adequate (typically around 70-90% of C++ code, even better match up if you use 64 bit .Net)
+ C# allows for the hottest of the code to be written in C++ and also allows to let you use no bounds checking using "unsafe" code. This makes that if you need a specific code to be autovectorized and you notice that C++ compiler does it but the C# one does not (and you don't want to use Mono.SIMD code to write your own matrix multiply code) to be very highly optimized
+ the call speed from PInvoke is adequate as .Net "natively" maps COM calls and C calls, meaning that if you use either DirectX or OpenGL, you are covered
+ having complex game logic can be more easily written in C# than in C++, especially as some C++ game engines use Lua as a backend. Writing it into C# should give some times speedups
+ you can use struct types so you can reduce the times the memory collection is happening

Here are really bad minuses:
- coding recklessly will create a lot of garbage in memory making pressure on GC. It can take sometimes seconds (for huge heaps, like multi GB heaps) which is unacceptable even in a board game
- allocation by default is on heap, meaning that if you create a List<T>, in fact always it will create on heap 2 objects, the first is the List<T> itself, and the internal array which stores the actual data. This is really bad because when you add to List<T> items, the internal array is "resized' which in the .Net (or Mono or CodeRefractor) implementations mean that a new array is allocated, meaning that a lot of more GC pressure happens. In C++ by default objects are allocated on stack with no hidden costs. If you use std::vector<T>, the internal array is on heap, but the vector itself is on stack.
- Linq can create without noticing a lot of objects: especially when you use: ".ToArray()", or ".ToList" or for a statement that wants to return a pair of values.
This code:
var playerAndLifes = players.Select(player => new Tuple<Player, int>(player, player.Life)).ToArray();
Looks really innocent, but in fact "Tuple" is a class, so is allocated on heap, and also ToArray will resize in power of two for the length of your "players" object. So for an 1300 players will be around 8 reallocations, and for 2600 players will be 9 reallocations and so on.
For the previous code, make a struct STuple in your codebase and use it. Also, if you know the size of players, do not forget to read ways to improve your Linq performance article.
- Objects in .Net are huge, so if you keep a single byte or integer index (even it has its own more complex associated logic) consider using struct or enum types. The reason why objects in .Net are huge, is that they contain much more information in the object header, including typeId, some space to be able to lock on them, If you have a class which stores 1 integer, on 32 bits .Net is 12 bytes, but on 64 bit machine is 24 bytes. So for every single allocation of an object, you will waste an extra 8 or 20 bytes. In C++ if you don't use virtual calls, the overhead is zero for object internals, but can be bigger if the memory allocator is not efficient. For virtual method classes, the overhead is typically the size of the pointer (4 bytes on 32 bit machines and 8 bytes for a 64 bit machines).
- Texts are UTF16, which very often is a good thing, but when you want high(er) performance, if you write them on disk, they occupy 2 times more space. Even worse, they do increase memory usage and again will create presure on GC. Try to work with UTF8 encoded strings internally and do interning (meaning to merge strings all over your application) so at least when GC happens will have less work to do
- Even is not necessarily an issue of .Net in itself, an easy way to support Save/Load inside games is to use a serializer that stores or restores your entities on disk. Using the default "DataContract" or even BinarySerializer are slow. Use protobuf-net (ProtoBuffers) as it is a very easy to use library to do this part and it can run many times faster. Similarly, try to not use any xml/json or alike for levels where is it expected to have many enties of any kind
- the JIT (Just in time) compiler sometimes make things ugly! The JIT time is typcally very small, but it is happening every time a new method in code is hit. If you have big methods and/or a bigger logic, you may expect to see "frame-skips", especially as per frame there is the "tyranny" of 16.6 ms per frame. Making methods small and try to remove duplicate code should make that when you get a new item or you see a new enemy which has a new game logic which is exposed to the player, which would require for .Net to analyse it, should be faster to optimize. But the even better way is simply to NGen your application.

What is weird as for me, is that the biggest factor into responsive games is not itself the compilation's performance (which .Net has it right from year 2009 I would say, with .Net 3.5 SP1), but the hidden overhead(s) of GC. You can get screwed many times and the ugly part of GC is that you don't know when it will hit you, even worse, you may not know which code creates classes (like System.Tuple or Linq's ToArray/ToList).

To wrap up, it looks to me that GC is the biggest factor for user to see freezes and as .Net improved as output of generated code (with initiatives like RyuJIT or CoreCLR) the elephant remains mostly to work with structs and to use an efficient serializer. This code can be very often improved by other means, typically by forcing a full GC at steps user waits already. After a game loads a full level into memory, a developer can force a full GC, after a round is finished and is written "Victory", another full GC can be forced. This style of coding is fine, but of course, if the game was expected to have a full round ended in 10 minutes but finished in 40 minutes, and the user has let's say a full GC of 3 seconds in the middle of the minute 35, this will ruin the experience.

Monday, July 6, 2015

Resharper 9 - a blast

Disclaimer: I've received an opensource license from JetBrains of Resharper for the second time. Thank you JetBrains!

I've been fairly critical sometimes with R# (Resharper) as is somewhat not accessible for some users, in the same time I've been using it. But I want to say why also code analysis in general and coding in particular is crucial with using today with a Resharper like tool.

So first of all, I want to make some criticism of Resharper and especially R# 9 as I've received:
- I've had a not updated R# 8 (it expired somewhere around October) and upgrading to 9.0 (which happen to be out of date because I didn't use R# for some time) made R# to report a lot of errors in code which were not there. Clearing the caches did fix all the known errors I had. But it was really strange (Google pointed me directly to the right place)
- Resharper doesn't default to use Solution Wide analysis. Maybe for low end machines is to be desired, or for very big projects, but as it is, at least for medium projects is a boon. I am sure that for big solutions (I'm thinking here programs like SharpDevelop or bigger) maybe Resharper runs slow to update the analysis (which in itself is a fair point) but the missing of the information that R# provides (like compilation errors you may have) by default, I found it as a big miss

Ok, so small bugs and not so great defaults. But in context of CodeRefractor's project it was so great feature because it made possible to make possible to big rewrites and right now it undergones the third rewrite. Every rewrite was justifiable for various reasons:
- the first and (as for me) very important one was that the internal representation was shaped very close to SSA form (or at least to LinearIL from Mono project). A subsequent almost as a full rewrite made the project to use an index of these instructions so optimizations will not do their job well, but they do it fast
- the second rewrite allowed a much refined way to find all methods (like virtual methods) so many more programs do run now (try it, it will do wonders)
- the third rewrite (that is currently going) that I will not write the details now

What I found great working features:
- creating property is automatic and fast with good defaults:
myValue.Width = 30;
//R# will suggest to create Width as an automatic property of int type
- creating automatic empty class taking into account of constrains:
BaseClass a = new MyNotDefinedClass();
//R# will suggest to create MyNotDefinedClass as BaseClass and will also implement some required data
- the Solution Wide analysis which takes into account when your code compiles. This feature is so awesome because you can combine it with two features: "Code cleanup" (which removes for example a lot of redundancies and reformats nicely the whole code) and "Find Code Issues".
- a R# 9.0 feature: code completion filters with various criteria (like: "properties only" or "extension methods only").
- unused parameters and the refactor to remove them globally is really a huge time saver of developer time

So in short, I have to say that if you start with Resharper from scratch, or you do want to use productively C#, I warmly recommend it to you. Also, don't forget the first thing after you open your solution to enable by default the Solution wide analysis (you have a "gray circle" on bottom-right: double click on it and click "OK" to the dialog it appears").

Also, please note that I tried to be as unbiased as I can, so I didn't point things that I'm sure that are invaluable for other projects like MVC3 or Xaml features (CR usage of Xaml is very limited), so here is only what I used (and enjoyed!) but some features may be for you closer to heart .

Improve performance for your selects in Linq

A think I learned inside CodeRefractor is how loops do work inside .Net. One thing I learned fairly quick is that the fastest loop is by far on arrays. It is documented also by Microsoft.

In short, especially using .Net on 64 bit, you will see high performance code over arrays so I strongly recommend if you have data that you read it often out of it (for example for using Linq), you should use ToArray() function.

So let's say you need out of your "tradeData" variable your names out of it.
The code may look like this:
return tradeData.Select(it => it.Id).ToArray();
What's wrong with this code? Let's say "tradeData" variable can have 1.000.000 items and tradeData can be itself an array or a List<T> and when you profile, you can see that iteration takes little time, but most of the time you will see like 16-18 allocations inside of ToArray(), the reason being that ToArray itself keeps an internal array which is resized for more times.

So it should be possible to write a "SelectToArray" method that will have much lower overhead:
public static class UtilsLinq
    {
        public static TResult[] SelectToArray<TValue, TResult>(this IList<TValue> items, Func<TValue, TResult> func)
        {
            var count = items.Count;
            var result = new TResult[count];
            for (var i = 0; i < result.Length; i++)
            {
                result[i] = func(items[i]);
            }
            return result;
        }
    }

As T[] implements IList<T> makes this code to work for both arrays and List<T>. This code will run as fast as possible and there are no hidden allocations.

And you code becomes:
return tradeData.SelectToArray(it => it.Id);

Strong recommendation for fast(er) code: when you use Select or SelectToArray do NEVER allocate inside it "class" objects but struct objects. If you want to keep a result with multiple data fields, create "struct" types which incapsulate them.

How fast is it? It it fairly fast.

For this code:
var sz = 10000000;
            var randData = new int[sz];
            var random = new Random();
            for(var i = 0; i<sz; i++)
            {
                randData[i] = random.Next(1, 10);
            }
            var sw = Stopwatch.StartNew();
            for(int t = 0; t<5;t++){
                var arr = randData.SelectToArray(i => (double)i);
            }
            var time1 = sw.ElapsedMilliseconds;
            sw.Stop();
            sw.Restart();
            for(int t = 0; t<5;t++){
                var arr = randData.Select(i => (double)i).ToArray();
            }
            var time2 = sw.ElapsedMilliseconds;
You have
time1 = 798 ms vs time2 = 1357 (Debug configuration)
time1 = 574 ms vs time2 = 1003 (Release configuration)

Not sure about you, but this is significant and also it is crucial of you have multiple Linq/Select statements and you want also the resulting items to be fast iterable. Similarly, you will have bigger speedup if you don't do the cast to double, but I wanted to show a more realistic code where the Linq it is doing something light (like typically happens as sometimes there is an indexer involved, or a field access).

NB. This test is artificial, and use these results at your own risk.
Later, I found there is a method: Array.ConvertAll which has very similar internals with this extension method (the limitation is that doesn't work with non-array implementations, but if this is not a big incovenience for you, is better to use the BCL classes).

public static TResult[] SelectToArray<TValue, TResult>(this TValue[] items, Func<TValue, TResult> func)
        {
            return Array.ConvertAll(items, it => func(it));
        }

Method changed to this and is a bit even faster, because the iteration of items variable si a bit faster this time.