Code Refractor - Virtual Machines/Compiler performance musings: November 2015

Monday, November 30, 2015

Fx2C - Bug Fixes

JavaFX seems to be kind of a black sheep for some Java developers but I wouldn't feel so negative. The reason is that to some extend I feel the idea that many applications migrate to an online environment and for them server side exposes basically a JavaScript/Html5/CSS UI but on the same time I would expect to be a a kind of desire of using offline applications at least at times.

And here I see that JavaFx can shine. Few things still need to be done by Java runtime in my opinion, but Java if deployed, it is a great offline runtime which runs on the 3 major OSes and this is not a small deal. You can literally write once one UI and with no recompilation will run on all (desktop) platforms and if you consider JavaFxPorts, you can add Android to it.

So based on this I fixed some bugs with JavaFx types handling and use Java only (remove the Kotlin language) so it is easier to fix the code for non-Kotlin developers.

Code is here:
https://github.com/ciplogic/Fx2C

Monday, November 23, 2015

New useful Kotlin tool of the day: JavaFx 2 FXML Compiler

Do you write by any change Java? Do you use a Desktop client that is Swing and you didn't use JavaFx because is to slow or you use JavaFX and it "feels" slow even for very simple dialogs?

There is a tool now with a demo:

If you want to contribute to it, please install IntelliJ Idea (Community is just fine), clone the repository: https://github.com/ciplogic/Fx2C and after this load the project.

Legal notice: use it at your own risk. I'm not responsible for your problems. But with all seriousness, the code is simple to be maintained.

Thursday, November 19, 2015

Tricks of the trade for quick sorting in .Net

I rarely do huge dataset sorting, but I want to share some tricks to sort huge datasets. I am surprised that very few people learned things like it even in university.

The reason is that as students you are thought to think that the CPUs are ideal and you learn big-O notation. This sorting stuff is in the most cases O(N*logN) meaning that for a list of 1,000,000 items you will have let's say 1 second, but for 4,000,000 items, you will have a very small growth to something like 4.1 seconds.

So how to improve the sorting speed? Reduce the constants.

The point of big-O notation is that it considers the "scale" of growing the time, but it doesn't take into account the individual constants.

So, let's say you have 3.000.000 items to sort inside a List<String>. This is a hypothetical but not so hypothetical, in the sense that are huge lists of items that you may need to sort in few milliseconds, and increasing the item count can show much clear where you have speedup.

Let's say you add those items inside a list, and you use List.Sort(). I created the items semi-random distributed (most of ASCII characters are setup random) of these items. On a 64 bit .Net (4.6) on an oldish Intel i5-2500K, it would run in fairly short time (13067 ms) or 13 seconds.

Can you speedup more? The simplest trick to check first char first, as a separate check. If the first char is the same, you use full string comparison. This should go in 10189 ms. Another small improvement is to sort over arrays. This is fairly small speedup, but is still quicker (9940 ms).

The comparing sorting key class would be like following:
struct CompareStr : IComparable<CompareStr>
    {
        public char FirstChar;
        public string Text;
        public CompareStr(string text)
        {
            if (string.IsNullOrEmpty(text))
                FirstChar = '\0';
            else
                FirstChar = text[0];
            Text = text;
        }

        public int CompareTo(CompareStr other)
        {
            if (FirstChar != other.FirstChar)
                return FirstChar.CompareTo(other.FirstChar);
            return Text.CompareTo(other.Text);
        }
    }

And the sort routine of those texts is:

var combinedKeys = texts2.Select(text => new CompareStr(text)).ToArray();
            Array.Sort(combinedKeys);
            var resultList = combinedKeys.Select(combined=>combined.Text).ToList();

But can we do it better? I think that yes, so let's change FirstChar to pack the first two cars padded as a 32 unsigned int (char itself is kind of equivalent with UInt16). The times also improve greatly (6220 ms) which is less than half of original time:

struct CompareStr2 : IComparable<CompareStr2>
    {
        public uint FirstChar;
        public string Text;
        public CompareStr2(string text)
        {
            if (text.Length <= 2)
                FirstChar = 0;
            else
                FirstChar = (uint)((text[0] << 16) + (text[1]));
            Text = text;
        }

        public int CompareTo(CompareStr2 other)
        {
            if (FirstChar != other.FirstChar)
                return FirstChar.CompareTo(other.FirstChar);
            return Text.CompareTo(other.Text);
        }
    }
And the sorting routine is very similar with first code:
var combinedKeys = new List<CompareStr2>(texts2.Count);
            combinedKeys.AddRange(texts2.Select(text => new CompareStr2(text)));
            var items = combinedKeys.ToArray();
            Array.Sort(items);
            var resultList = new List<string>(texts2.Count);
            resultList.AddRange(items.Select(combined => combined.Text));

Can be written to be really faster?

    struct CompareStrDouble : IComparable<CompareStrDouble>
    {
        double _firstChar;
        public string Text;

        static double ComputeKey(string text)
        {
            var basePow = 1.0;
            var powItem = 1.0 / (1 << 16);
            var result = 0.0;
            foreach (char ch in text)
            {
                result += basePow * ch;
                basePow *= powItem;
            }

            return result;
        }
        public CompareStrDouble(string text)
        {
            _firstChar = ComputeKey(text);
            Text = text;
        }

        public int CompareTo(CompareStrDouble other)
        {
            if (_firstChar != other._firstChar)
                return _firstChar.CompareTo(other._firstChar);
            return Text.CompareTo(other.Text);
        }
    }

For reference this is the sorting code:
List<string> SortSpecialDouble(List<string> texts)
        {
            var combinedKeys = new List<CompareStrDouble>(texts.Count);
            combinedKeys.AddRange(texts.Select(text => new CompareStrDouble(text)));
            var items = combinedKeys.ToArray();
            Array.Sort(items);
            var resultList = new List<string>(texts.Count);
            resultList.AddRange(items.Select(combined => combined.Text));
return resultList;
        }

This sorting key is really, really fast, 2292 ms which is over 5 times quicker than the original List.Sort (for strings).

Some things to consider:
- Sorting huge datasets may show a design flaw in your application: You should always filter data before sorting it. This use-case may be important though if you get local data to sort. This algorithm of sorting can be used though with Linq's OrderBy or friends. If you have data from database as a source, the database can sort the data for you, so it is no point sometimes to sort yourself the data if you get it from your database.
- This algorithm considers that sorting only when you need it: I am considering here cases when you can have a grid with many thousands of items and you can sort over the header of the column. If you have 300,000 items (this is a huge grid BTW), the sorting using List.Sort is around 1030 ms vs the fastest key used here which is 200 ms
- At last, this algorithm is not culture aware, but is damn quick! This is a tradeoff you have to consider that may break some people's language. For instance, in Romanian (my native language) S is near Ș char, but using this sorting algorithm, Ș char will be after Z. The same will happen with Russian, or all other chars. So if you care this kind of sorting, make sure you can afford a bit worse precision for non English words.

Running on Mono (I was really curious, weren't you?) gave really horrendous times (for only 30,000 items):
.Net times:
List.Sort: Time: 51
Sort by first char: Time: 65
Sort by first two chars: Time: 33
Sort by double key: Time: 30

Mono time:

List.Sort: Time: 411

Sort by first char: Time: 329

Sort by first two chars: Time: 106

Sort by double key: Time: 39

The fastest implementation was consistently faster, but the things got very quickly worse. For 300.000 items:
.Net times:
List.Sort: Time: 1073
Sort by first char: Time: 709
Sort by first two chars: Time: 420
Sort by double key: Time: 193

Mono times:
List.Sort: Time: 5825
Sort by first char: Time: 4917
Sort by first two chars: Time: 2268
Sort by double key: Time: 409

And for 3000000 items the times are really huge:
.Net times:
List.Sort: Time: 12939
Sort by first char: Time: 9851
Sort by first two chars: Time: 6143
Sort by double key: Time: 2259

Mono times:
List.Sort: Time: 81496
Sort by first char: Time: 70736
Sort by first two chars: Time: 38233
Sort by double key: Time: 5886

So, using latest algorithm presented here, on Mono at least seem to be 10 times faster using this primitive comparison between various strings. I can assume that the string comparison under Mono is not well optimized (like to compare two strings in Mono looks to be around 6-8x slower than .Net) and the tricks used in this algorithm can give to you the user a huge speedup. So if you use string comparison on a Xamarin supported platform for sorting purposes, you may really use a similar techniques as described here.

Saturday, November 14, 2015

PC Hardware is Boring...

I am in fact amazed for the power of your typical computer, you can buy quad core laptops which can run Crysis. Yes, you spin some money out of your pocket but you can run Crysis. But even so, the hardware remains boring. You have the same display, the same OS, the same software mostly centered around browsing, some video playback or some Office or image processing.

Based on this, I want to recommend for your next computer not to buy your next Intel i7, or AMD Zen equivalent. And this is not because they are not fast enough, but only because they are boring. Really, your computer which you bought 5 years ago could do almost anything your can do, excluding maybe to play Crysis. I had an 4 GB PC with quad core CPU, a 10000 RPM (no SSD though) and a good dedicated video card in 2009 and it was priced around 1000 EUR (should be around 1000 USD in US). And if I would dare I could use it as a development machine even now.

Today I have a laptop with similar specifications in the same price range, a but more powerful and consuming around 1/5 of wattage, but other than this, is basically the same hardware. Sure, mobility of laptops is more desirable, but still, I think you can see the point: you can buy some hardware that have basically the same specs that excluding you don't play high end games, you really throw your money out of your window. And no, Civilisation V is not a high end game, neither Dota 2 (excluding you play it professionally).

So, can be found fun in the hardware landscape which is geeky enough but doesn't involve you to buy an overly pricey device? As for now I found two devices which I bough myself (I will point to similar products, to not be direct advertisments): 150-200 USD NUC PC and sub 200 USD (Android) tablets.

As tablets are not necessarily the topic of this post, still they are interesting, especially as tablets are easy to test your software, or also you an make programs and push on them. And even if you don't use for anything else, they will push you notifications (from friends to see an Youtube clip) and you can see it properly. Where I live now (in a Baltic country, where prices are a bit higher than the rest of EU), I could find an 10 inch Atom based tablet which can play 720p video, it has quad core and it is really more than responsive, easy to program and in short, decent.

The NUC PC I see it the most compelling of all: let's look to a device like this. You add to it 2 or 4 GB DDR3L and an hard drive and you put Linux on it. It starts decently fast, it runs browser, it is programable, you can run all software I was curious to run. I tried (just for fun) to run Windows 10 (I use though a 8 GB module) and it ran not that far from my quad core laptop. I could see the lag for example when navigating, but nothing aggravating.

What is so compelling about these devices:
- with a zero cost OS (I am recommending Linux Mint, Ubuntu, etc. ) you can have a very low cost legal machine that can do most of things you would do it anyway with the computer: like emails, youtube, etc., and videos will play (on Linux at least as Full HD)
- if you are a developer which doesn't require Windows (you can use Mono if you want C#) you can use really everything you want to test. I am not sure about OpenCL (there is a library named POCL, but I don't know how stable is it), but if you want to test how to code using 4 cores and check the scalability, you're right at home; if you want to check how to make a small web server using any technology stack, you can do it
- if you care about simulating most user's computers, again, you are safe: most users do not live with super high-end computers at home, so targeting your software to run on these Atom-class CPU machines, you will in fact make it run on a huge number of other machines. I used "Atom-class", because sometimes you can find AMD Kabini CPUs
- a less talked item, which is important, the full system, even in full load will require much less than a typical laptop. I estimate that excluding the display in full-load the machine would use something like 15W, making it more friendly to do even processing over night or to be a server in it's own right. I know we talk watts in a marking way, but let's be pragmatic about it: if you let it your expensive computer over night as a web server in your organization, you have two risks: the power spikes can add to your electricity bill, the second is that having an electricity power shock can burn your pricier PC. Losing a 220 USD (estimated) PC is less risky compared with a full more than 2x times pricier PC.
- kind of a last for me, this machine is powerful enough and compatible enough: you can run full Windows on it (not sure about XP, but definetly Vista, 7, 8 and 10) and Linux.

The single part which is a bit strange is that the raw CPU power of a 10W part is it around what in 2009 a dual-core CPU could do at 65W (if all 4 cores are used, and most software today supports all cores). This means that if you use Gimp (or Photoshop), given memory it would finish in reasonable time (if you are not professional video editor). And this with a cool (both as temperatures and as status) device!

Monday, November 2, 2015

Vulkan, DirecX 12 and the Low Level API craze

A bit of history

There was a long time competition between PCs and consoles for giving the best visual and experience inside games. Also, typically the consoles had high end specifications when they were launched, but later they age fairly quickly because the PC market had bigger competition, but still they offered a consistent and higher frame rates. How was it possible? In part there were two factors: there was no fragmentation so programmers could fully use the hardware components without coding workarounds if a specific hardware component which does no offer hardware acceleration and the second part: the hardware was open to developers (after you sign an NDA) with lower access than classical OpenGL/DirectX API.

Enter Mantle

Mantle was the idea that AMD had to offer this low level for their hardware, and they did work with a game developer (Dice) to make it more usable for "mere mortals". Mantle had a fairly small impact overall for games but a big impact for industry as big (theoretical) potential. Later Mantle was offered as starting API for Vulkan, and Microsoft's DirectX 12, Apple's Metal following suit to offer similar functionality on their (propertary) OSes.

So what is it so special about these low level APIs? (I will do my analysis based on mostly Vulkan based presentations/documentation and my (limited) understanding of DirectX 12 (and assuming that many things are similar)).

Three items are the most important:
- don't do most of "rendering" on the main thread
- move part of driver code in user-space
- don't do (any) validation in "release mode"

Don't render on the main thread

Typically rendering in a classical OpenGL/DirectX application is basically issuing drawing commands against a driver and these commands are processed on a pipeline. Also, there are pixel/vertex shaders which they do pre-post processing of pixels and geometry. For historical reason most of developers are used to draw using main thread, so the drawing has to be done waiting basically on drivers to finish all drawing.

Right now the drawing commands are right now named: Command Buffers and these command buffers can be processed in separate CPU threads, and they can be reused! Follow this presentation for more (high-level) details.

VK_CMD_BUFFER_BEGIN_INFO info = { ... };

vkBeginCommandBuffer(cmdBuf &info);

vkCmdDoThisThing(cmdBuf, ...);

vkCmdDoSomeOtherThing(cmdBuf, ...);

vkEndCommandBuffer(cmdBuf);

This thing in itself can scale horizontally on both higher spec machines but also on lower (yet multi-core) machines as ARM or Atom CPUs which is really great thing for many cores which are not that fast.

Moving the driver code in user space

These command buffers are combined in rendering pipelines. These rendering pipelines which include the pixel/vertex shaders are prepared themselves can be setup on separate threads. Pixel/vertex shaders are right now compiled from a bytecode (named SPIR-V), which makes the scripts loading and processing faster. This item is not for importance in DirectX world because Microsoft was doing it as far as I understood from DirectX 10, so if you think that your game (Dota 2, chough, chough) because it has a lot of pixel shaders to precompile, it would not gonna happen.

Moving most of processing in userspace means both good and bad things. The good thing is that good developers will not have to wait for a driver developer to optimize a specific code path which the game needs. Another good part is that having most code in user-space the code should run faster as many drivers do "Ring" switches (jumping into kernel mode) which is a very expensive call (low microseconds level, but still significant if happens tens or hundreds of times per frame draw, as a rendering time per frame should be around 16 ms). The ugliest thing I can imagine is that very often driver developers for the main video card vendors do a good job. So in this scenario I would expect that driver developers will have fewer ways to improve all games.

Don't do validation

This is why you will hear things like: even if is using one core, the processing is still 30% faster using DirectX 12 (or Vulkan). This is of course a double edged sword: you can get very weird things happening and no one can assist the developers of what went wrong.

The good thing is that Vulkan come with many validation tools in "debug" mode, so you can check the weird mismatches in the code.

Should you install Windows 10 or find a Vulkan driver?

If you are a developer working with graphics, the answer may be yes, otherwise, not sure. Try not to get hyped!! Windows 10 had huge problems at launch with some older NVidia cards (like series 500 or lower). Having DirectX 12 which theoretically would run your future unlaunched game in one year from now means very little for your today usage of your computer.

If you don't play a lot, the situation is even worse, as for most interfaces I'm aware the most time in processing is mostly: font metrics calculation, styling, layouting and like it and sadly none of them are to GPU taxing.

Would Vulkan or DirectX 12 have a big impact? I would expect that in 2-3 yers from now yes, but not because anything changed for the user, but only because the industry will upgrade naturally the software stack.

Code Refractor - Virtual Machines/Compiler performance musings