Sunday, May 22, 2016

Flat Collections - May update

Flat Collections was upgraded again in small ways but they are to simplify the design of code generation.

Java does not have equivalent of "struct" keyword and it is really useful sometimes to generate your own arrays of primitives which have typically bigger primitive list of items. These flat collections can be really useful in most of List<Point> kind of scenarios or List<Token> where a token could be a parse information which packs few integers (tokenKind, startPos, endPos).

Also, if there is any Java develop who can help me to review, package this library, write me a note (or contact me via email with "ciprian dot mustiata at gmail.com".

Follow the code here:
https://github.com/ciplogic/FlatCollection

Monday, May 9, 2016

Quickest Java based CSV on Earth...

If you look over the internet, CSV parsing is really solved and it is really quick. You can parse an 120 MB CSV file in around 1 second (using 1 core). Take this file from this repository: https://github.com/uniVocity/worldcities-import

They have their own bench on my machine and the output is (after JVM is warmed up):
Loop 5 - executing uniVocity CSV parser... took 1606 ms to read 3173959 rows.

But, can you beat it by the help of FlatCollections? The answer is obviously yes, and not by a small amount, but also taking into account that the coding is a bit non trivial.

How much it would take to sum the forth column times using a "miniCsv" library?

int[] sum = new int[1]; 
CsvScanner csvScanner = new CsvScanner();
   try {
    csvScanner.scanFile("worldcitiespop.txt", (char) 10, (state, rowBytes) -> {
      int regionInt = state.getInt(3, rowBytes); 
     sum[0] += regionInt;     
    }); 
   }  
   catch (IOException e) {
    e.printStackTrace();
   }
This code would sum the 4th column using this huge file after the JVM is warmed in...
Time: 371 ms

So, really, if you have a small CSV and you have many integers (or if you need to support other types, I will spend a little time to handle more cases) to calculate about, I will be glad to sped it up, just reference me as the "original" coder.

The file has to be UTF8 or ASCII, Latin, or similar byte encoding, but not UTF16 or UTF32.

So, if you feel that you want to take a look into a specialized CSV parser and you see any improvements, please feel free to read, contribute and do whatever you want with it!

https://github.com/ciplogic/FlatCollection/tree/master/examples/MiniCsv

Bonus: there is no commercial license (you can even sell the code, but it would be nice to be credited though).

The idea how to code it would not be possible without my previous work experience and great people doing this stuff for a living (I do it for passion) like Martin Thomson, Mike Barker or similarly open people. Also, I did not hear them without InfoQ.

Sunday, May 1, 2016

How Quick Can You Rotate a 4K (3840x2160) Image?

First of all this mini-competition started at work with this idea of a flamewar: "C++ (native) languages are quicker than Java".

I could say: "obviously it isn't", but we wanted to be tested. So we considered a great test where C++ is known to shine is pointer arithmetic and rotating pixels in an image would be a very friendly C style coding.

So, can you write your own implementation quicker than a Java implementation to rotate 4K images? But I want to say some observations I did as I tested some implementations.

A reasonably quick implementation with Java is this one, where pixels of a 4K image are stored flat inside src array, and dest is a preallocated array of the same size (3840x2160):

public void rotate90(int[] src, int[] dest, int width, int height) {
    IntStream.range(0, height).forEach(y -> {
        int posSrc = y * width;
        int destPos = height - 1 - y;

        for (int x = 0; x < width; x++) {
            int srcPixel = getPixel(src, posSrc);

            setPixel(dest, destPos, srcPixel);
            posSrc++;
            destPos += height;
        }
    });
}

This implementation would run in around 110 milliseconds. This implementation is really useful, because using a single line of code change, it will run using all cores:
      IntStream.range(0, height).parallel().forEach(y -> {

This will make the code to run at 33.7-37 ms.

One colleague from work wrote this implementation (Mykolas):
public void rotate90Mykolas(int[] src, int[] dest, int width, int height) {
    for (int i = 0; i < src.length; i++) {
        dest[(i % width + 1) * height - (i / width + 1)] = src[i];
    }
}

Is it any slower or faster? Looking to instructions, it should run slower, as instead of looping, there is a plain complex math (divisions or multiplications). But in fact it run faster than the single core version: 100 ms.

At the time of writing this blog entry, this code is not written in parallel. but if I will get a new entry, the code will be updated.

Can be written quicker still?

It depends on which hardware, but in short the answer is yes:

This code is starved by memory accesses, so rotating blocks of 32 pixel squares would rotate it much quicker as the data is mostly in the CPU cache:
public static final int SIZE_CHUNK = 32;
static int calculateChunks(int size, int chunkSize) {
    return (size / chunkSize) + ((size % chunkSize == 0) ? 0 : 1);
}

private static void fillChunksSizes(int width, int chunkSize, int stepsX, 
    int[] chunksPos, int[] chunksPosLength) {
    for (int it = 0; it < stepsX; it++) {
        chunksPos[it] = it * chunkSize;
        if (it != stepsX - 1) {
            chunksPosLength[it] = chunkSize;
        } else {
            int reminder = width % chunkSize;
            chunksPosLength[it] = reminder == 0 ? chunkSize : reminder;
        }
    }
}

public void rotate90Chunks(int[] src, int[] dest, int width, int height) {
    int chunkSize = SIZE_CHUNK;
    int stepsX = calculateChunks(width, chunkSize);
    int[] chunksPosX = new int[stepsX];
    int[] chunksPosXLength = new int[stepsX];
    fillChunksSizes(width, chunkSize, stepsX, chunksPosX, chunksPosXLength);

    int stepsY = calculateChunks(height, chunkSize);
    int[] chunksPosY = new int[stepsY];
    int[] chunksPosYLength = new int[stepsY];
    fillChunksSizes(height, chunkSize, stepsY, chunksPosY, chunksPosYLength);

    IntStream.range(0, chunksPosX.length).parallel().forEach(chunckXId -> {
        int startX = chunksPosX[chunckXId];
        int lengthX = chunksPosXLength[chunckXId];
    IntStream.range(0, chunksPosY.length).forEach(chunkYId -> {
        int startY = chunksPosY[chunkYId];
        int lengthY = chunksPosYLength[chunkYId];
            rotateChunkByIndex(src, dest, width, height, startX, lengthX, startY, lengthY);

        });
    });
}

This code runs on average on a Haswell CPU in 7.85 millisecond (so is around 4 times quicker than iterating over the loops "naively").

The quickest of all I could come with is by rotating blocks which are exactly the chunk size of 32 as specialized implementation. Compilers love constants and love them more if they are typically power of 2.

This sped up a little the code, but the code is basically bigger than this previous implementation and some copy/paste of it, and it runs in 7.2 ms.

So, this is it, you can rotate 9.1 images per second with a loop, using a single thread, and if you use all cores in a i7 laptop, and you take into account how compiler optimizes and CPU caching, you can achieve 138.9 images per second running Java. 4K images.

This is 4 GB/s image processing.

But there is one more thing. This coding works very nice in CPUs which hide divisions, with many SIMD supported instructions, with a high end machine, but how does it work with a low end machine (similar with a phone CPU - including iPhone)?

I've took the code and ran it with CPU Intel(R) Celeron(R) N2930@ 1.83GHz (which is an out-of-order 4 core Pentium class CPU).

Numbers totally changed:
Single threaded rotate image: 119.86 ms.
Multithreaded first test:44.44 ms.
Mykolas implementation: 265 ms.
Chunks: 38.4 ms.
Chunks64: 27.1 ms.

Some observations: moving code from an i7-4710HQ 2.5 GHz to an Baytrail CPU, the speed decreased less than 10%. Even using 4 cores Baytrail vs 4 cores+HT I7M, if your software is memory starved, your code will run roughly the same.

Mykolas implementation got 2.5 times slower, because complex math is expensive on Atom based CPUs. Try using multiplications instead of divisions using lower spec CPUs.

The chunks implementation is also very interesting: when you have a math intensive code but you fit into cache, the Atom CPU is roughly 4x slower than an I7M (and I think even more compared with newer CPUs like Skylake).

So, can you try to make a quicker 4K image rotation than 7.2 ms (in a quad core I7M CPU - so, more than 4GB/s pixel processing)? At your request I will give a full source code of the fastest implementation (which is very similar with Chunks implementation, but just longer). Can you process more than 1.1 GB/s of pixels on an Atom based quadcore?

Happy coding!

Tuesday, April 26, 2016

Fx2C Updates - handling loading Fxml 3D objects

Fxml to Java compiler speeds up for low spec machines the speed of showing controls, but one very nice contributor fixed support of adding CSS styles. I never tested it, but I noticed that some other edge cases were not supported.

The main use-case is this one: you want to use Fxml to import Java3D objects, they required the inner text xml tag to be handled separately. For example this Fxml file, is valid Fxml:
<?xml version="1.0" encoding="utf-8"?>
<?import javafx.scene.paint.Color?><?import javafx.scene.paint.PhongMaterial?><?import javafx.scene.shape.MeshView?><?import javafx.scene.shape.TriangleMesh?>
<MeshView id="Pyramid">
  <material>
    <PhongMaterial>
      <diffuseColor>
        <Color red="0.3" green="0.6" blue="0.9" opacity="1.0"/>
      </diffuseColor>
    </PhongMaterial>
  </material>
  <mesh>
    <TriangleMesh>
      <points>0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 -1.0 -1.0 1.0 0.0 0.0 -1.0 0.0</points>
      <texCoords>0.0 0.0</texCoords>
      <faces>0 0 4 0 1 0 1 0 4 0 2 0 2 0 4 0 3 0 3 0 4 0 0 0 0 0 1 0 2 0 0 0 2 0 3 0</faces>
      <faceSmoothingGroups>1 2 4 8 16 16</faceSmoothingGroups>
    </TriangleMesh>
  </mesh>
</MeshView> 
This file is definetly valid Fxml, but the Fx2C compiler will not be able to handle it: nodes contain inner text.

If you want more samples and importers from multiple 3D formats (like STL or Collada) follow the next link:
http://www.interactivemesh.org/models/jfx3dbrowser.html

Now it does, so for previous Fxml file, the Fx2C compiler will export the following code which is close to the fastest way to define a MeshView:
public final class FxPyramid {
   public MeshView _view;
   public FxPyramid() {
      MeshView ctrl_1 = new MeshView();
      ctrl_1.setId("Pyramid");
      PhongMaterial ctrl_2 = new PhongMaterial();
      Color ctrl_3 = new Color(0.3, 0.6, 0.9, 1.0);
      ctrl_2.setDiffuseColor(ctrl_3);
      ctrl_1.setMaterial(ctrl_2);
      TriangleMesh ctrl_4 = new TriangleMesh();
      ctrl_4.getPoints().setAll(0.0f, 1.0f, 1.0f, 1.0f, 1.0f, 0.0f, 0.0f, 1.0f, -1.0f, -1.0f, 1.0f, 0.0f, 0.0f, -1.0f, 0.0f);
      ctrl_4.getTexCoords().setAll(0.0f, 0.0f);
      ctrl_4.getFaces().setAll(0, 0, 4, 0, 1, 0, 1, 0, 4, 0, 2, 0, 2, 0, 4, 0, 3, 0, 3, 0, 4, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 2, 0, 3, 0);
      ctrl_4.getFaceSmoothingGroups().setAll(1, 2, 4, 8, 16, 16);
      ctrl_1.setMesh(ctrl_4);
      _view = ctrl_1;;
   }
}

Tuesday, March 15, 2016

Reifying DSL for FlatCompiler

An important part of FlatCollections is the part that at least memory wise code can be rewritten only a little to get big speedup and fewer GCs. But as always there were tradeoffs. One of them is that the code itself was hardcoded to give a ListOf<T> (with a reified name like ListOfPoint3D) and a cursor out of this list.

This is all great but what if the List<T> should contain an extra method? Or what if there is a need to generate an extra method for every getter/setter? For this reason there is a simple (I hope) template generator which has reified semantics which for now works only for classes but is really important.

To define a flat type, you would write something like:
flat Point3D {
  X, Y, Z: double
}
And the new code will be aware of these fields to be filled later.

The code generator is filled using a templated form as following:
each fieldNames : fieldName, index {
    sub set@fieldName (TValue value) {
        _list.set(_offset+ index,  value);
    }

    sub get@fieldName (): TValue {
        return _list.get(_offset+index)
    }
}
Sure, the code look a bit strange, but it does the job most of the way, and there are items as TValue and so on, they are resolved semantically:
class FlatCursor<T> {
    where {
        TValue = T.valueType
        countFields=T.countFields
        fieldNames = T.fieldNames
    }
(...) //class content

But the solving appears because of a semantic magic:
specialize ListOf { Point3D }
I would love to improve it more in future, but mileage may vary.  But the most important part is that soon the reification can work fairly smart and more I add logic into this mini-compiler, the more constructs may be supported and bugs found.

Read the latest code under GitHub project:
https://github.com/ciplogic/FlatCollection

Friday, March 4, 2016

Is it aquisition of Xamarin useful for typical C# developer?

TLDR In my view in short term: yes, in long term: no way!

Xamarin/Mono stack missed many features of Microsoft's .Net stack and will always suffer if it is an economical force behind it. Xamarin Studio for example is it in a painful state: bugs are slowly fixed, the recommended version it is still Visual Studio, but you can also use Xamarin Studio for various purposes. It is stuck with Gtk# 2.x, though very nice styled and with an unknown underlying framework for developers (Xwt).

Xamarin bought RoboVM, which means that if you are either a C# or Java developer and you want to target iOS, you may need in the past to rely on Xamarin (now Microsoft).

My perspective about medium plan with Mono platform: Mono will be less ambiguous target and more bugs will be addressed just having one implementation in between .Net, CoreCLR and Mono. Another good thing is that I would assume that in future there will be merged the CoreCLR on Linux with Mono, either by migrating the GC of CoreCLR (which is more advanced that whatever Mono had) or migrating the debugging infrastructure from Mono to CoreCLR. This means that if you will target Asp.Net Core 1.0+ you will definitely benefit from the platform correctness and a better experience deploying to Linux.

Another good part of the toolset it is simply that Microsoft .Net as merged platform will work directly to iOS, maybe with a lower license costs.

But this is just for me 1 to 2 years stuff, but after this I would assume that some parts will be more negative for non Microsoft platforms:
- support may be delayed and slowed down, in particular that supporting .Net will be needed to be extensive to most of Mono targets, CPU architectures and so on
- no competition even partnering competition (as is with Java/OpenJDK ecosystem) will mean that IDE options (SharpDevelop is basically discontinued, maybe Xamarin Studio will be also discontinued) will be basically from two vendors, one of them with full integration with various frameworks (Microsoft) and one very well integrated for code editing (JetBrains). Both of them may be for money, so I would assume that will not be so much startup friendly
- having close to a monopoly as the single vendor implementing your own runtime is kind of kill-switch to make your next project to target .Net excluding you are not Microsoft or you already have a big investment into .Net technology

Thursday, February 18, 2016

Question: "Does Java run faster than C and C++ today?"

As I was writing this allocation free parser, I ported the code (90%, in the sense that I did not use smart-pointers) to C++ with hoping that bounds checking or other hidden wins will show off.

The single problem is that C++ is very tricky to optimize.I tried all my best, I did not use any bounds checking (so I skipped using STL all-together), I send as much as I understood everything as const-reference when it was not an integer but a data buffer, and so on. So I did all low-level optimizations I knew and the code was having the same level of abstraction as Java. For very curious people and if requested, I will be glad to give it as a zipped file (the code leaks memory, but when the loop is executed with zero memory allocation - exactly like Java).

But the biggest bummer for C++ is that it ran slower than Java.

Most of the time Java code would achieve a bit more than 800 iterations, rarely 900, and rarely something like 770 iterations (there are fluctuations because of CPU's Turbo, which is very aggressive on a laptop, like it has a stated 2.5 GHz but it operates at 3.5 when is using 1 core). With C++ I could iterate all QuickFix's test suite in 700 to 800 range of iterations. This happened with MinGW GCC 4.9 (32 bit) with -Ofast -flto (as for now being the fastest configuration). The part where C++ wins hands down comparing with Java is memory usage, where the C++ implementation was using just a bit over 5 MB, when Java implementation was using 60 MB. So there are differences, but still, Java was running visibly faster. I tried also using GCC on Ubuntu. But Ubuntu uses GCC 4.8 (64 bit) and at least this code seems not to optimize well and I get just 440 iterations.

But you know what? The Java code was really straight forward, no configuration/ runtime optimization settings. Everything was running just faster. There is not even a debug/release configuration. Java runs as quick (like equivalent with GCC -O3) up to the point it hits a breakpoint. If you hit a breakpoint, it will go back to interpreter mode.

Even it seems kind of stupid, I think that I can see some conclusions of it, if it is kind of possible in many situations for Java to run as smooth, an office suite, like let's say LibreOffice were better off if they were gradually rewritten in Java, instead of removing it because it starts a bit slower. I could imagine a hypothetical future where JavaFX were the dialogs, later the canvas and it would work on almost all platforms where JavaFX runs, including but not limited to: iPhone (it would require RoboVM though, which today is proprietary), Android (GluOn) and would have support for common databases (because of JDBC which has a very wide support) to fill data in the "Excel" (tm) component of the suite.

At last, let's not forget the tooling and build times. Java takes really a fraction in compilation, most of the build time is copying Jars.

But as it is, if you think you have at least high volume and you require a high throughput for your program, try Java, you may really break records.