Tuesday, April 26, 2016

Fx2C Updates - handling loading Fxml 3D objects

Fxml to Java compiler speeds up for low spec machines the speed of showing controls, but one very nice contributor fixed support of adding CSS styles. I never tested it, but I noticed that some other edge cases were not supported.

The main use-case is this one: you want to use Fxml to import Java3D objects, they required the inner text xml tag to be handled separately. For example this Fxml file, is valid Fxml:
<?xml version="1.0" encoding="utf-8"?>
<?import javafx.scene.paint.Color?><?import javafx.scene.paint.PhongMaterial?><?import javafx.scene.shape.MeshView?><?import javafx.scene.shape.TriangleMesh?>
<MeshView id="Pyramid">
        <Color red="0.3" green="0.6" blue="0.9" opacity="1.0"/>
      <points>0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 -1.0 -1.0 1.0 0.0 0.0 -1.0 0.0</points>
      <texCoords>0.0 0.0</texCoords>
      <faces>0 0 4 0 1 0 1 0 4 0 2 0 2 0 4 0 3 0 3 0 4 0 0 0 0 0 1 0 2 0 0 0 2 0 3 0</faces>
      <faceSmoothingGroups>1 2 4 8 16 16</faceSmoothingGroups>
This file is definetly valid Fxml, but the Fx2C compiler will not be able to handle it: nodes contain inner text.

If you want more samples and importers from multiple 3D formats (like STL or Collada) follow the next link:

Now it does, so for previous Fxml file, the Fx2C compiler will export the following code which is close to the fastest way to define a MeshView:
public final class FxPyramid {
   public MeshView _view;
   public FxPyramid() {
      MeshView ctrl_1 = new MeshView();
      PhongMaterial ctrl_2 = new PhongMaterial();
      Color ctrl_3 = new Color(0.3, 0.6, 0.9, 1.0);
      TriangleMesh ctrl_4 = new TriangleMesh();
      ctrl_4.getPoints().setAll(0.0f, 1.0f, 1.0f, 1.0f, 1.0f, 0.0f, 0.0f, 1.0f, -1.0f, -1.0f, 1.0f, 0.0f, 0.0f, -1.0f, 0.0f);
      ctrl_4.getTexCoords().setAll(0.0f, 0.0f);
      ctrl_4.getFaces().setAll(0, 0, 4, 0, 1, 0, 1, 0, 4, 0, 2, 0, 2, 0, 4, 0, 3, 0, 3, 0, 4, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 2, 0, 3, 0);
      ctrl_4.getFaceSmoothingGroups().setAll(1, 2, 4, 8, 16, 16);
      _view = ctrl_1;;

Tuesday, March 15, 2016

Reifying DSL for FlatCompiler

An important part of FlatCollections is the part that at least memory wise code can be rewritten only a little to get big speedup and fewer GCs. But as always there were tradeoffs. One of them is that the code itself was hardcoded to give a ListOf<T> (with a reified name like ListOfPoint3D) and a cursor out of this list.

This is all great but what if the List<T> should contain an extra method? Or what if there is a need to generate an extra method for every getter/setter? For this reason there is a simple (I hope) template generator which has reified semantics which for now works only for classes but is really important.

To define a flat type, you would write something like:
flat Point3D {
  X, Y, Z: double
And the new code will be aware of these fields to be filled later.

The code generator is filled using a templated form as following:
each fieldNames : fieldName, index {
    sub set@fieldName (TValue value) {
        _list.set(_offset+ index,  value);

    sub get@fieldName (): TValue {
        return _list.get(_offset+index)
Sure, the code look a bit strange, but it does the job most of the way, and there are items as TValue and so on, they are resolved semantically:
class FlatCursor<T> {
    where {
        TValue = T.valueType
        fieldNames = T.fieldNames
(...) //class content

But the solving appears because of a semantic magic:
specialize ListOf { Point3D }
I would love to improve it more in future, but mileage may vary.  But the most important part is that soon the reification can work fairly smart and more I add logic into this mini-compiler, the more constructs may be supported and bugs found.

Read the latest code under GitHub project:

Friday, March 4, 2016

Is it aquisition of Xamarin useful for typical C# developer?

TLDR In my view in short term: yes, in long term: no way!

Xamarin/Mono stack missed many features of Microsoft's .Net stack and will always suffer if it is an economical force behind it. Xamarin Studio for example is it in a painful state: bugs are slowly fixed, the recommended version it is still Visual Studio, but you can also use Xamarin Studio for various purposes. It is stuck with Gtk# 2.x, though very nice styled and with an unknown underlying framework for developers (Xwt).

Xamarin bought RoboVM, which means that if you are either a C# or Java developer and you want to target iOS, you may need in the past to rely on Xamarin (now Microsoft).

My perspective about medium plan with Mono platform: Mono will be less ambiguous target and more bugs will be addressed just having one implementation in between .Net, CoreCLR and Mono. Another good thing is that I would assume that in future there will be merged the CoreCLR on Linux with Mono, either by migrating the GC of CoreCLR (which is more advanced that whatever Mono had) or migrating the debugging infrastructure from Mono to CoreCLR. This means that if you will target Asp.Net Core 1.0+ you will definitely benefit from the platform correctness and a better experience deploying to Linux.

Another good part of the toolset it is simply that Microsoft .Net as merged platform will work directly to iOS, maybe with a lower license costs.

But this is just for me 1 to 2 years stuff, but after this I would assume that some parts will be more negative for non Microsoft platforms:
- support may be delayed and slowed down, in particular that supporting .Net will be needed to be extensive to most of Mono targets, CPU architectures and so on
- no competition even partnering competition (as is with Java/OpenJDK ecosystem) will mean that IDE options (SharpDevelop is basically discontinued, maybe Xamarin Studio will be also discontinued) will be basically from two vendors, one of them with full integration with various frameworks (Microsoft) and one very well integrated for code editing (JetBrains). Both of them may be for money, so I would assume that will not be so much startup friendly
- having close to a monopoly as the single vendor implementing your own runtime is kind of kill-switch to make your next project to target .Net excluding you are not Microsoft or you already have a big investment into .Net technology

Thursday, February 18, 2016

Question: "Does Java run faster than C and C++ today?"

As I was writing this allocation free parser, I ported the code (90%, in the sense that I did not use smart-pointers) to C++ with hoping that bounds checking or other hidden wins will show off.

The single problem is that C++ is very tricky to optimize.I tried all my best, I did not use any bounds checking (so I skipped using STL all-together), I send as much as I understood everything as const-reference when it was not an integer but a data buffer, and so on. So I did all low-level optimizations I knew and the code was having the same level of abstraction as Java. For very curious people and if requested, I will be glad to give it as a zipped file (the code leaks memory, but when the loop is executed with zero memory allocation - exactly like Java).

But the biggest bummer for C++ is that it ran slower than Java.

Most of the time Java code would achieve a bit more than 800 iterations, rarely 900, and rarely something like 770 iterations (there are fluctuations because of CPU's Turbo, which is very aggressive on a laptop, like it has a stated 2.5 GHz but it operates at 3.5 when is using 1 core). With C++ I could iterate all QuickFix's test suite in 700 to 800 range of iterations. This happened with MinGW GCC 4.9 (32 bit) with -Ofast -flto (as for now being the fastest configuration). The part where C++ wins hands down comparing with Java is memory usage, where the C++ implementation was using just a bit over 5 MB, when Java implementation was using 60 MB. So there are differences, but still, Java was running visibly faster. I tried also using GCC on Ubuntu. But Ubuntu uses GCC 4.8 (64 bit) and at least this code seems not to optimize well and I get just 440 iterations.

But you know what? The Java code was really straight forward, no configuration/ runtime optimization settings. Everything was running just faster. There is not even a debug/release configuration. Java runs as quick (like equivalent with GCC -O3) up to the point it hits a breakpoint. If you hit a breakpoint, it will go back to interpreter mode.

Even it seems kind of stupid, I think that I can see some conclusions of it, if it is kind of possible in many situations for Java to run as smooth, an office suite, like let's say LibreOffice were better off if they were gradually rewritten in Java, instead of removing it because it starts a bit slower. I could imagine a hypothetical future where JavaFX were the dialogs, later the canvas and it would work on almost all platforms where JavaFX runs, including but not limited to: iPhone (it would require RoboVM though, which today is proprietary), Android (GluOn) and would have support for common databases (because of JDBC which has a very wide support) to fill data in the "Excel" (tm) component of the suite.

At last, let's not forget the tooling and build times. Java takes really a fraction in compilation, most of the build time is copying Jars.

But as it is, if you think you have at least high volume and you require a high throughput for your program, try Java, you may really break records.

Tuesday, February 16, 2016

Scanning FIX at 3 Gbps

Have you heard about FIX protocol? It is a financial exchange protocol. It is used extensively as a de-facto format to process in many areas and the format itself it is kind of many dictionary key-value pairs.

So, can you make a quick parser to process FIX files? I did write a mini FIX parser in Java and it uses FlatCollections for tokenizing and the final numbers are really great. But let's clear the ground: most of the ideas are in fact not mine, and they are based on talks about "Mechanical Sympathy" (I recommend presentations of Martin Thomson) meaning that if you understand the hardware (or at least the compilers and the internal costs of it) you can achieve really of high numbers.

So I looked around to QuickFix library, a standard and opensource (complete) implementation of FIX protocol, but it also has some problems of how the code is running so I took all example of FIX protocol sample files. Files: around 450 files combined at 475KB of ASCII files and I setup my internal benchmark as following: considering that I will have them in memory, how quick can I parse them, give full tag to user and it is good enough info to recreate the data. As the code for one file should be really quick (if there is no allocation in file row splitting, which I already did), I made the following "benchmark": how many times in a second I can iterate these files (if they are already saved in memory), split them into rows and tokenize them. The short answer: between 700 to 895 iterations (using one core of Intel Core i7-4710HQ CPU @ 2.50GHz). The variation I think is related with CPU's Turbo. I am not aware of code having hidden allocations (so is allocation free). If there are few allocations (which were done before usage Flat Collections) you will get in 500-700 iterations range (or 2.5 Gbps processing speed)

So, if you have (on average) 800 iterations per second, you can parse around 380 MB/s FIX messages (or around 3 Gbps) using just one core of one laptop using Java (Java 8u61/Windows). If you want another statistic, most messages are few tens of bytes, so, it is safe assume that this parsing code scans 20 million messages/second.

I don't endorse switching your QuickFix to this minimal Fix implementation, but who knows, if you need a good starting point (and who knows, support ;) ) to write a very fast Quick parser, this is a good point to start.

So, if you want to look inside the implementation:

Saturday, February 13, 2016

Java's Flat Collections - what's the deal? (Part II)

I thought about cases when people would want to use flat collections. The most obvious are for example an "point array", "Tuple array", but as thinking more I found some interesting case which is also kind of common: "rectangle", "triangle" or similar constructs.

Typically when people define a circle for instance, would build it as:
class Circle{
   Point2f center = new Point2f();
   float radius;

Without noticing maybe, if you have to store for a 32bit machine one hundred of circles, you will store in fact much more data than the: center.x, center.y, radius x 4 bytes = 12 bytes per circle, and for 100 circles is 1.2 KB (more or less), but more like:
- 100 entries for the reference table: 400 bytes
- 100 headers of Circle object: 800 bytes
- 100 references to Point: 400 bytes
- 100 headers of Circle.center (Point2F): 800 bytes
- 100 x 3 floats: 1200 bytes

So instead of your payload of 1.2 KB, you are into 3.6 KB, so there is a 3X memory usage compaction.

If you have 100 Line instances which themselves have 2 instances of Point2f, you will have instead of 1600 B: (refTable) 400 + (object headers) 2400 bytes + (references to internal points) 800 + (payload) 1600 = 5200 B which is a 3.25X memory compaction.

A simple benchmark shows that not only memory is saved, but also the performance. So, if you use Line (with internal 2 points in it) and you would populate flat collections instead of plain Java objects, you will get the following numbers:

Setup values 1:12983 ms.
Read time 1:5086 ms. sum: 2085865984

If you will use Java objects, you will have a big slowdown on both reading and writing:
Setup values 2:62346 ms.
Read time 2:18781 ms. sum: 2085865984

So, you will get more than 4x speedup on write (like populating collections) and 3x speedup on read by flattening most types.

Last improvement? Not only that reflection works, but sometimes it is ugly to create a type, reflect it and use it later for this code generator of flatter types. So right now, everything the input config is JSon based, and you can create on the fly your own "layouts" (meaning a "flat object"):
  "typeName": "Point3D",
  "fields": ["X", "Y", "Z"],
  "fieldType": "float"}
This code would create a flat class Point3D with 3 floats in it named X, Y, Z (meaning the cursor will use a "getX/setX" and so on).

Here is the attached formatted input of the code generator file named: flatcfg.json.

Wednesday, January 27, 2016

Java's Flat Collections - what's the deal? (Part I)

I've been moving my job to Java environment and as my previous interest in performance is still there, I thought why sometimes Java runs slow (even typically - after you wait your code to warm up)  and I described some solutions that are around Java's ecosystem.

There are 3 solutions for now which are competing to give high performance Java code by allowing quick performance or OS integration: 
- PackedObjects, a way to remove object headers to object collections and it works sadly for now only with IBM JVMs, It should be primarily used by JNI like code to speed it up and removing individual copying. It requires medium changes in compiler, garbage collectors but no language changes (or minimal ones)
- ObjectLayout, a way to give hints for JVM to allocate continuously arrays in a structured manner which may be implemented. It requires GC changes, very few annotations but no language changes 
- Array 2.0 (or Project Panama) is the project which basically plans to bring .Net's struct type to Java, This is the most extensive of all because it has to change: bytecodes, internal changes inside compiler, inside GC

So, I'm here to present a new solution which I found it handy, but it is in very early stage and requires no language changes (still, to take advantage of this, you require yourself some few code changes), it should work with any Java at least newer than 5.0 (maybe Java 1.2, but I'm not 100%) or if it is not fully possible to work with this solution, it will be very easy to patch.

Enter FlatCollection, a code generator which flattens your collections and can make it very easy to work with high(er) performance code for many common cases.

How does it work:
- you find any types it has the same type of fields (for now I think the coding supports only primitive types, as the fully working prototype works with Point of x,y integer fields, but very likely at the time you may read this code, it will work as a generator for any field type)
- you add all types with full namespace inside: input.flat file
- you run the project to create two flat classes out of it: an ArrayListOfYourType, and a CursorOfYourType
- copy all these files inside package you will add in your project: "flatcollections"

Look inside a "realistic" benchmark code to see the same code using an array of Point and this mapped ArrayList inside RunBench.java .

In this kind of real life, the memory consumption for this collection is in range of a half of a full array of points, and the performance of both populating it and reading it is at least 2x-4x in performance.

How does it work: it merges all fields in a continuous array of "primitive" types and removes basically one indirection and many allocation costs.

I will extend in future the samples to show parsing of CSV files and operations like it. If you reuse the collection using  .clear() call, no reallocation is needed, excluding  the new "per-row" code allocates more memory than previous implementations.

Why is important to flatten the data layout? Basically, you can reduce the GC hits or you can map naturally code that was ugly otherwise: let's say to have a Tuple class. Also, the full GC cost (which involves visiting all small objects in your heap) is very low on these collections. So I would assume at least for batch processing or maybe games written in Java it could be a really friendly tool of trade.

What should be done:
- it should be tested to work with collections of any type and to support specializations
- it should work with POJO which are not exposed as fields including Tuple classes
- not mandatory but it should support iterators or any other friendly structures