As I was writing this allocation free parser, I ported the code (90%, in the sense that I did not use smart-pointers) to C++ with hoping that bounds checking or other hidden wins will show off.
The single problem is that C++ is very tricky to optimize.I tried all my best, I did not use any bounds checking (so I skipped using STL all-together), I send as much as I understood everything as const-reference when it was not an integer but a data buffer, and so on. So I did all low-level optimizations I knew and the code was having the same level of abstraction as Java. For very curious people and if requested, I will be glad to give it as a zipped file (the code leaks memory, but when the loop is executed with zero memory allocation - exactly like Java).
But the biggest bummer for C++ is that it ran slower than Java.
Most of the time Java code would achieve a bit more than 800 iterations, rarely 900, and rarely something like 770 iterations (there are fluctuations because of CPU's Turbo, which is very aggressive on a laptop, like it has a stated 2.5 GHz but it operates at 3.5 when is using 1 core). With C++ I could iterate all QuickFix's test suite in 700 to 800 range of iterations. This happened with MinGW GCC 4.9 (32 bit) with -Ofast -flto (as for now being the fastest configuration). The part where C++ wins hands down comparing with Java is memory usage, where the C++ implementation was using just a bit over 5 MB, when Java implementation was using 60 MB. So there are differences, but still, Java was running visibly faster. I tried also using GCC on Ubuntu. But Ubuntu uses GCC 4.8 (64 bit) and at least this code seems not to optimize well and I get just 440 iterations.
But you know what? The Java code was really straight forward, no configuration/ runtime optimization settings. Everything was running just faster. There is not even a debug/release configuration. Java runs as quick (like equivalent with GCC -O3) up to the point it hits a breakpoint. If you hit a breakpoint, it will go back to interpreter mode.
Even it seems kind of stupid, I think that I can see some conclusions of it, if it is kind of possible in many situations for Java to run as smooth, an office suite, like let's say LibreOffice were better off if they were gradually rewritten in Java, instead of removing it because it starts a bit slower. I could imagine a hypothetical future where JavaFX were the dialogs, later the canvas and it would work on almost all platforms where JavaFX runs, including but not limited to: iPhone (it would require RoboVM though, which today is proprietary), Android (GluOn) and would have support for common databases (because of JDBC which has a very wide support) to fill data in the "Excel" (tm) component of the suite.
At last, let's not forget the tooling and build times. Java takes really a fraction in compilation, most of the build time is copying Jars.
But as it is, if you think you have at least high volume and you require a high throughput for your program, try Java, you may really break records.
Thursday, February 18, 2016
Tuesday, February 16, 2016
Scanning FIX at 3 Gbps
Have you heard about FIX protocol? It is a financial exchange protocol. It is used extensively as a de-facto format to process in many areas and the format itself it is kind of many dictionary key-value pairs.
So, can you make a quick parser to process FIX files? I did write a mini FIX parser in Java and it uses FlatCollections for tokenizing and the final numbers are really great. But let's clear the ground: most of the ideas are in fact not mine, and they are based on talks about "Mechanical Sympathy" (I recommend presentations of Martin Thomson) meaning that if you understand the hardware (or at least the compilers and the internal costs of it) you can achieve really of high numbers.
So I looked around to QuickFix library, a standard and opensource (complete) implementation of FIX protocol, but it also has some problems of how the code is running so I took all example of FIX protocol sample files. Files: around 450 files combined at 475KB of ASCII files and I setup my internal benchmark as following: considering that I will have them in memory, how quick can I parse them, give full tag to user and it is good enough info to recreate the data. As the code for one file should be really quick (if there is no allocation in file row splitting, which I already did), I made the following "benchmark": how many times in a second I can iterate these files (if they are already saved in memory), split them into rows and tokenize them. The short answer: between 700 to 895 iterations (using one core of Intel Core i7-4710HQ CPU @ 2.50GHz). The variation I think is related with CPU's Turbo. I am not aware of code having hidden allocations (so is allocation free). If there are few allocations (which were done before usage Flat Collections) you will get in 500-700 iterations range (or 2.5 Gbps processing speed)
So, if you have (on average) 800 iterations per second, you can parse around 380 MB/s FIX messages (or around 3 Gbps) using just one core of one laptop using Java (Java 8u61/Windows). If you want another statistic, most messages are few tens of bytes, so, it is safe assume that this parsing code scans 20 million messages/second.
I don't endorse switching your QuickFix to this minimal Fix implementation, but who knows, if you need a good starting point (and who knows, support ;) ) to write a very fast Quick parser, this is a good point to start.
So, if you want to look inside the implementation:
https://github.com/ciplogic/FlatCollection/tree/master/examples/MiniFix
So, can you make a quick parser to process FIX files? I did write a mini FIX parser in Java and it uses FlatCollections for tokenizing and the final numbers are really great. But let's clear the ground: most of the ideas are in fact not mine, and they are based on talks about "Mechanical Sympathy" (I recommend presentations of Martin Thomson) meaning that if you understand the hardware (or at least the compilers and the internal costs of it) you can achieve really of high numbers.
So I looked around to QuickFix library, a standard and opensource (complete) implementation of FIX protocol, but it also has some problems of how the code is running so I took all example of FIX protocol sample files. Files: around 450 files combined at 475KB of ASCII files and I setup my internal benchmark as following: considering that I will have them in memory, how quick can I parse them, give full tag to user and it is good enough info to recreate the data. As the code for one file should be really quick (if there is no allocation in file row splitting, which I already did), I made the following "benchmark": how many times in a second I can iterate these files (if they are already saved in memory), split them into rows and tokenize them. The short answer: between 700 to 895 iterations (using one core of Intel Core i7-4710HQ CPU @ 2.50GHz). The variation I think is related with CPU's Turbo. I am not aware of code having hidden allocations (so is allocation free). If there are few allocations (which were done before usage Flat Collections) you will get in 500-700 iterations range (or 2.5 Gbps processing speed)
So, if you have (on average) 800 iterations per second, you can parse around 380 MB/s FIX messages (or around 3 Gbps) using just one core of one laptop using Java (Java 8u61/Windows). If you want another statistic, most messages are few tens of bytes, so, it is safe assume that this parsing code scans 20 million messages/second.
I don't endorse switching your QuickFix to this minimal Fix implementation, but who knows, if you need a good starting point (and who knows, support ;) ) to write a very fast Quick parser, this is a good point to start.
So, if you want to look inside the implementation:
https://github.com/ciplogic/FlatCollection/tree/master/examples/MiniFix
Saturday, February 13, 2016
Java's Flat Collections - what's the deal? (Part II)
I thought about cases when people would want to use flat collections. The most obvious are for example an "point array", "Tuple array", but as thinking more I found some interesting case which is also kind of common: "rectangle", "triangle" or similar constructs.
Typically when people define a circle for instance, would build it as:
class Circle{
Point2f center = new Point2f();
float radius;
}
Without noticing maybe, if you have to store for a 32bit machine one hundred of circles, you will store in fact much more data than the: center.x, center.y, radius x 4 bytes = 12 bytes per circle, and for 100 circles is 1.2 KB (more or less), but more like:
- 100 entries for the reference table: 400 bytes
- 100 headers of Circle object: 800 bytes
- 100 references to Point: 400 bytes
- 100 headers of Circle.center (Point2F): 800 bytes
- 100 x 3 floats: 1200 bytes
So instead of your payload of 1.2 KB, you are into 3.6 KB, so there is a 3X memory usage compaction.
If you have 100 Line instances which themselves have 2 instances of Point2f, you will have instead of 1600 B: (refTable) 400 + (object headers) 2400 bytes + (references to internal points) 800 + (payload) 1600 = 5200 B which is a 3.25X memory compaction.
A simple benchmark shows that not only memory is saved, but also the performance. So, if you use Line (with internal 2 points in it) and you would populate flat collections instead of plain Java objects, you will get the following numbers:
Setup values 1:12983 ms.
Read time 1:5086 ms. sum: 2085865984
If you will use Java objects, you will have a big slowdown on both reading and writing:
Setup values 2:62346 ms.
Read time 2:18781 ms. sum: 2085865984
So, you will get more than 4x speedup on write (like populating collections) and 3x speedup on read by flattening most types.
Last improvement? Not only that reflection works, but sometimes it is ugly to create a type, reflect it and use it later for this code generator of flatter types. So right now, everything the input config is JSon based, and you can create on the fly your own "layouts" (meaning a "flat object"):
Here is the attached formatted input of the code generator file named: flatcfg.json.
Typically when people define a circle for instance, would build it as:
class Circle{
Point2f center = new Point2f();
float radius;
}
Without noticing maybe, if you have to store for a 32bit machine one hundred of circles, you will store in fact much more data than the: center.x, center.y, radius x 4 bytes = 12 bytes per circle, and for 100 circles is 1.2 KB (more or less), but more like:
- 100 entries for the reference table: 400 bytes
- 100 headers of Circle object: 800 bytes
- 100 references to Point: 400 bytes
- 100 headers of Circle.center (Point2F): 800 bytes
- 100 x 3 floats: 1200 bytes
So instead of your payload of 1.2 KB, you are into 3.6 KB, so there is a 3X memory usage compaction.
If you have 100 Line instances which themselves have 2 instances of Point2f, you will have instead of 1600 B: (refTable) 400 + (object headers) 2400 bytes + (references to internal points) 800 + (payload) 1600 = 5200 B which is a 3.25X memory compaction.
A simple benchmark shows that not only memory is saved, but also the performance. So, if you use Line (with internal 2 points in it) and you would populate flat collections instead of plain Java objects, you will get the following numbers:
Setup values 1:12983 ms.
Read time 1:5086 ms. sum: 2085865984
If you will use Java objects, you will have a big slowdown on both reading and writing:
Setup values 2:62346 ms.
Read time 2:18781 ms. sum: 2085865984
Last improvement? Not only that reflection works, but sometimes it is ugly to create a type, reflect it and use it later for this code generator of flatter types. So right now, everything the input config is JSon based, and you can create on the fly your own "layouts" (meaning a "flat object"):
{ "typeName": "Point3D", "fields": ["X", "Y", "Z"], "fieldType": "float"}This code would create a flat class Point3D with 3 floats in it named X, Y, Z (meaning the cursor will use a "getX/setX" and so on).
Here is the attached formatted input of the code generator file named: flatcfg.json.
Subscribe to:
Posts (Atom)