Code Refractor - Virtual Machines/Compiler performance musings

Thursday, February 18, 2016

Question: "Does Java run faster than C and C++ today?"

As I was writing this allocation free parser, I ported the code (90%, in the sense that I did not use smart-pointers) to C++ with hoping that bounds checking or other hidden wins will show off.

The single problem is that C++ is very tricky to optimize.I tried all my best, I did not use any bounds checking (so I skipped using STL all-together), I send as much as I understood everything as const-reference when it was not an integer but a data buffer, and so on. So I did all low-level optimizations I knew and the code was having the same level of abstraction as Java. For very curious people and if requested, I will be glad to give it as a zipped file (the code leaks memory, but when the loop is executed with zero memory allocation - exactly like Java).

But the biggest bummer for C++ is that it ran slower than Java.

Most of the time Java code would achieve a bit more than 800 iterations, rarely 900, and rarely something like 770 iterations (there are fluctuations because of CPU's Turbo, which is very aggressive on a laptop, like it has a stated 2.5 GHz but it operates at 3.5 when is using 1 core). With C++ I could iterate all QuickFix's test suite in 700 to 800 range of iterations. This happened with MinGW GCC 4.9 (32 bit) with -Ofast -flto (as for now being the fastest configuration). The part where C++ wins hands down comparing with Java is memory usage, where the C++ implementation was using just a bit over 5 MB, when Java implementation was using 60 MB. So there are differences, but still, Java was running visibly faster. I tried also using GCC on Ubuntu. But Ubuntu uses GCC 4.8 (64 bit) and at least this code seems not to optimize well and I get just 440 iterations.

But you know what? The Java code was really straight forward, no configuration/ runtime optimization settings. Everything was running just faster. There is not even a debug/release configuration. Java runs as quick (like equivalent with GCC -O3) up to the point it hits a breakpoint. If you hit a breakpoint, it will go back to interpreter mode.

Even it seems kind of stupid, I think that I can see some conclusions of it, if it is kind of possible in many situations for Java to run as smooth, an office suite, like let's say LibreOffice were better off if they were gradually rewritten in Java, instead of removing it because it starts a bit slower. I could imagine a hypothetical future where JavaFX were the dialogs, later the canvas and it would work on almost all platforms where JavaFX runs, including but not limited to: iPhone (it would require RoboVM though, which today is proprietary), Android (GluOn) and would have support for common databases (because of JDBC which has a very wide support) to fill data in the "Excel" (tm) component of the suite.

At last, let's not forget the tooling and build times. Java takes really a fraction in compilation, most of the build time is copying Jars.

But as it is, if you think you have at least high volume and you require a high throughput for your program, try Java, you may really break records.

Tuesday, February 16, 2016

Scanning FIX at 3 Gbps

Have you heard about FIX protocol? It is a financial exchange protocol. It is used extensively as a de-facto format to process in many areas and the format itself it is kind of many dictionary key-value pairs.

So, can you make a quick parser to process FIX files? I did write a mini FIX parser in Java and it uses FlatCollections for tokenizing and the final numbers are really great. But let's clear the ground: most of the ideas are in fact not mine, and they are based on talks about "Mechanical Sympathy" (I recommend presentations of Martin Thomson) meaning that if you understand the hardware (or at least the compilers and the internal costs of it) you can achieve really of high numbers.

So I looked around to QuickFix library, a standard and opensource (complete) implementation of FIX protocol, but it also has some problems of how the code is running so I took all example of FIX protocol sample files. Files: around 450 files combined at 475KB of ASCII files and I setup my internal benchmark as following: considering that I will have them in memory, how quick can I parse them, give full tag to user and it is good enough info to recreate the data. As the code for one file should be really quick (if there is no allocation in file row splitting, which I already did), I made the following "benchmark": how many times in a second I can iterate these files (if they are already saved in memory), split them into rows and tokenize them. The short answer: between 700 to 895 iterations (using one core of Intel Core i7-4710HQ CPU @ 2.50GHz). The variation I think is related with CPU's Turbo. I am not aware of code having hidden allocations (so is allocation free). If there are few allocations (which were done before usage Flat Collections) you will get in 500-700 iterations range (or 2.5 Gbps processing speed)

So, if you have (on average) 800 iterations per second, you can parse around 380 MB/s FIX messages (or around 3 Gbps) using just one core of one laptop using Java (Java 8u61/Windows). If you want another statistic, most messages are few tens of bytes, so, it is safe assume that this parsing code scans 20 million messages/second.

I don't endorse switching your QuickFix to this minimal Fix implementation, but who knows, if you need a good starting point (and who knows, support ;) ) to write a very fast Quick parser, this is a good point to start.

So, if you want to look inside the implementation:
https://github.com/ciplogic/FlatCollection/tree/master/examples/MiniFix

Saturday, February 13, 2016

Java's Flat Collections - what's the deal? (Part II)

I thought about cases when people would want to use flat collections. The most obvious are for example an "point array", "Tuple array", but as thinking more I found some interesting case which is also kind of common: "rectangle", "triangle" or similar constructs.

Typically when people define a circle for instance, would build it as:
class Circle{
Point2f center = new Point2f();
float radius;
}

Without noticing maybe, if you have to store for a 32bit machine one hundred of circles, you will store in fact much more data than the: center.x, center.y, radius x 4 bytes = 12 bytes per circle, and for 100 circles is 1.2 KB (more or less), but more like:
- 100 entries for the reference table: 400 bytes
- 100 headers of Circle object: 800 bytes
- 100 references to Point: 400 bytes
- 100 headers of Circle.center (Point2F): 800 bytes
- 100 x 3 floats: 1200 bytes

So instead of your payload of 1.2 KB, you are into 3.6 KB, so there is a 3X memory usage compaction.

If you have 100 Line instances which themselves have 2 instances of Point2f, you will have instead of 1600 B: (refTable) 400 + (object headers) 2400 bytes + (references to internal points) 800 + (payload) 1600 = 5200 B which is a 3.25X memory compaction.

A simple benchmark shows that not only memory is saved, but also the performance. So, if you use Line (with internal 2 points in it) and you would populate flat collections instead of plain Java objects, you will get the following numbers:

Setup values 1:12983 ms.
Read time 1:5086 ms. sum: 2085865984

If you will use Java objects, you will have a big slowdown on both reading and writing:
Setup values 2:62346 ms.
Read time 2:18781 ms. sum: 2085865984

So, you will get more than 4x speedup on write (like populating collections) and 3x speedup on read by flattening most types.

Last improvement? Not only that reflection works, but sometimes it is ugly to create a type, reflect it and use it later for this code generator of flatter types. So right now, everything the input config is JSon based, and you can create on the fly your own "layouts" (meaning a "flat object"):

{
  "typeName": "Point3D",
  "fields": ["X", "Y", "Z"],
  "fieldType": "float"}

This code would create a flat class Point3D with 3 floats in it named X, Y, Z (meaning the cursor will use a "getX/setX" and so on).

Here is the attached formatted input of the code generator file named: flatcfg.json.

Wednesday, January 27, 2016

Java's Flat Collections - what's the deal? (Part I)

I've been moving my job to Java environment and as my previous interest in performance is still there, I thought why sometimes Java runs slow (even typically - after you wait your code to warm up) and I described some solutions that are around Java's ecosystem.

There are 3 solutions for now which are competing to give high performance Java code by allowing quick performance or OS integration:

- PackedObjects, a way to remove object headers to object collections and it works sadly for now only with IBM JVMs, It should be primarily used by JNI like code to speed it up and removing individual copying. It requires medium changes in compiler, garbage collectors but no language changes (or minimal ones)

- ObjectLayout, a way to give hints for JVM to allocate continuously arrays in a structured manner which may be implemented. It requires GC changes, very few annotations but no language changes

- Array 2.0 (or Project Panama) is the project which basically plans to bring .Net's struct type to Java, This is the most extensive of all because it has to change: bytecodes, internal changes inside compiler, inside GC

So, I'm here to present a new solution which I found it handy, but it is in very early stage and requires no language changes (still, to take advantage of this, you require yourself some few code changes), it should work with any Java at least newer than 5.0 (maybe Java 1.2, but I'm not 100%) or if it is not fully possible to work with this solution, it will be very easy to patch.

Enter FlatCollection, a code generator which flattens your collections and can make it very easy to work with high(er) performance code for many common cases.

How does it work:
- you find any types it has the same type of fields (for now I think the coding supports only primitive types, as the fully working prototype works with Point of x,y integer fields, but very likely at the time you may read this code, it will work as a generator for any field type)

- you add all types with full namespace inside: input.flat file
- you run the project to create two flat classes out of it: an ArrayListOfYourType, and a CursorOfYourType
- copy all these files inside package you will add in your project: "flatcollections"

Look inside a "realistic" benchmark code to see the same code using an array of Point and this mapped ArrayList inside RunBench.java .

In this kind of real life, the memory consumption for this collection is in range of a half of a full array of points, and the performance of both populating it and reading it is at least 2x-4x in performance.

How does it work: it merges all fields in a continuous array of "primitive" types and removes basically one indirection and many allocation costs.

I will extend in future the samples to show parsing of CSV files and operations like it. If you reuse the collection using .clear() call, no reallocation is needed, excluding the new "per-row" code allocates more memory than previous implementations.

Why is important to flatten the data layout? Basically, you can reduce the GC hits or you can map naturally code that was ugly otherwise: let's say to have a Tuple class. Also, the full GC cost (which involves visiting all small objects in your heap) is very low on these collections. So I would assume at least for batch processing or maybe games written in Java it could be a really friendly tool of trade.

What should be done:
- it should be tested to work with collections of any type and to support specializations
- it should work with POJO which are not exposed as fields including Tuple classes
- not mandatory but it should support iterators or any other friendly structures

Sunday, December 20, 2015

Fx2C - small tweaks

Fx2C transpiler does get a bit smarter but also in a meaningful way. If you plan to use Kotlin mixed with Fxml (a feature which I think neither Oracle or JetBrains thought supporting) you will see that there is no (straight forward) way to support @Fxml fields.

Do not worry, if you have a Kotlin controller class, you can write something like inside fxml file:

<!--   Flags: IsKotlin-->

And the Kotlin code will work seamlessly:

package desktool.views.kotlinView

import javafx.event.ActionEvent
import javafx.fxml.FXMLimport javafx.scene.control.Button
import javafx.scene.control.Label

class LineToolControler {
    var thickness = 1

@FXML

var label: Label? = null

@FXML

var button: Button? = null

@FXML

fun handleButtonAction(event: ActionEvent) {
        println("You clicked me!")
        label?.setText("Hello World!")
    }
}

This is not to say that right now it is better to use Kotlin, but the idea is that maybe in future the code will be extended to support various other languages (Scala is another language which comes to mind) with JavaFx.

Another important part is that right now the compiler will generate a preloader class (Fx2CPreloader) which can load all classes in memory so for second time the JVM will start the dialogs in a fraction of the second. Under my unscientific testing, using a slow Atom CPU (Baytrail) machine, a medium sized dialog first time loading of classes into JVM could take something like 600 ms, but with preloading, the time can be reduced into 2-5 ms.

So, is it important to use this preloader? I think that yes, especially if you have many UI elements, using this preloader under a splash screen will warm up the JVM and will make the 2nd run to make your dialogs to be seen (close to) instantly.

This Fx2C tool is for me mature enough to be in maintenance mode for now, as it works really well enough for my taste/usages and I will likely use to make more JavaFx applications and just to feel responsive, a feature which I was feeling as missing going from a .Net WPF environment.

Wednesday, December 16, 2015

Vampire Logic - my team's entries

I was participating to create 3 apps using Microsoft Universal Apps APIs over a zombie competition (hackatlon). My ad-hoc team (Vampire Logic) did really well there.

Source code for these 3 (working) apps (in categories Productivity/Games/Educational) were created in 18 hours combined (around 6 hours per app, give or take).

Productivity: an image editor with camera support, Modern UI, live preview and infinite Undo levels:
https://github.com/ciplogic/zombie_productivity

Game: a zombie ship fights in a top-down shooter with animated background. It uses efficiently a Canvas control and fluent animations using only standard Universal Apps code
https://github.com/ciplogic/zombie_game

Educational: An interactive math game where the harder and harder math problems are given in limited time. How long will you survive?
https://github.com/ciplogic/zombiedu

The coding practices may be spotty at time, but excluding the fact that that the applications were written in only 6 hours (and was our first ever Universal App coding experience), all applications had no known bugs in the way coding was done (like no crashing, or solving errors just with big try-catches to hide them) or similar stuff.

Coding language: C#

Team members: Dāvis Sparinskis, myself, Linda Legzdiņa, Rudolf Petrov

Some photos with my team:

Friday, December 11, 2015

Finding non-duplicates in an array (using Java)

I had a job interview and I will do a lot more Java stuff in day-to-day coding (yay!) and one part of the job interview was about searching non-duplicated values from an array. There is a technical solution (which I would not divulge) which is with O(n) complexity (meaning that the maximum it should scale up linearly with the size of the array) but can it run faster?

Faster than O(n) complexity wise is kinda impossible, because you have to traverse the once the array of data. But as we've seen in an older post, if you can get the constant down, you can get a faster algorithm. So, what about searching it naively the chunk of the data and put it into an int array?

This is not my best benchmark (and I didn't try all combinations) but up-to 100, an native int[] (int array) structure would run faster than using the O(n) complexity.

The code that would search an index (and it will return Integer.MIN_VALUE if no value is not duplicated) is the following:

public static int GetFirstNonRepeatingIntArray(int[] values) {
if (values == null) {
throw new NullPointerException("values");
}
int[] counters = new int[values.length * 2];
int counterSize = 0;
for (int index = 0; index < values.length; index++) {
int toFind = values[index];
int foundIndexValue = getIndexOf(counters, counterSize, toFind);
if (foundIndexValue == -1) {
counters[counterSize * 2] = toFind;
counters[counterSize * 2 + 1] = 1;
counterSize++;
} else {
counters[foundIndexValue * 2 + 1]++;
}
}

for (int index = 0; index < counterSize; index++) {
if (counters[index * 2 + 1] == 1) {
return counters[index * 2];
}
}
return Integer.MIN_VALUE;
}

public static int getIndexOf(int[] counters, int counterSize, int toFind) {
for (int foundIndex = 0; foundIndex < counterSize; foundIndex++) {
if (counters[foundIndex * 2] == toFind) {
return foundIndex;
}
}
return -1;
}

For example for 10,000 repetitions of the algorithm if arrays are of 50 items (which are randomly generated in a range of 1 to 25) it would give the following output:
Total execution time: 100ms
Total IntArray execution time: 31ms

Does it have any implication in your day to day life this kind of coding? I would say yes: when you work typically with "Settings" kind of classes, you will better work using arrays/Lists than dictionaries, even it is counter-intuitive: there is very unlikely you will get hundreds of settings, but both as debugging experience and performance may be better. Memory usage (especially compared with Java's HashMap implementation) is also much better.