Monday, February 9, 2015

Reviving CR

There is some interest into CR, and this is mostly regarding improving and making a stable compiler for embedded. More for this will follow but is good to know that if you will take the latest Git implementation some harder to find bugs were addressed:
- (major) strings do marshal automatically for PInvokes to wchar_t*. This basically means that if you map methods from Dlls/libSO and they use strings on .Net side, they will call (correctly) the underlying library, and it works also as it should (and Mono or .Net does it)
- (minor) Ldflda instruction is working correctly (this is used often when you use structs)

- (minor) Escape Analysis will work reliable with virtual methods' return value. It made the code to fail otherwise for some trivial programs
- (medium) Bugs in devirtualization and remove methods optimizations were also fixed
- (major) try/finally blocks do work now: CR does not support exceptions and will likely never will, but the code of "happy path" will work. This also makes that code using IDisposable to work - also known as "using block".

Feel free to take the code and work on it and of course, if you have any fixes, help us by making them upstream.

CR has a new home also, please redirect your links to here:

Monday, December 29, 2014

CR is fully stalled, new things to come in 2015

As New Year resolutions appear I want to state publicly my interests and my willing to work into my free/OSS projects.

I had working for two projects in the last year: CodeRefractor and at the end of the year to a more interesting (details will follow later) Java game.

This other project is JHeroes 2 a port of FHeroes 2 to Java. FHeroes 2 is a C++ project on top of SDL library and the main advantage of using Java for JHeroes 2 was to make it run in high resolutions laptops with weak CPUs (like mine :) ). For this reason I had as advantage Java's JavaFX technology, which is similar with WPF from .Net world.

I succeed to load a level, graphics from original Heroes 2 game (by porting FHeroes 2 core code to Java) but when it comes to displaying, the frames are visibily higher (FHeroes 2 in 1920x1080P has like 1-2 FPS on Windows, when JHeroes 2 has in my specific testing around 20 FPS) but the CPU is to high! I mean it uses more than 100% of one CPU even I don't do any separate background thread. Profiling I found that Java does create a separate thread in JavaFx and all commands go to this thread and I think this thread creates OpenGL commands which are executed back on the main thread. The bad part is that CPU usage is really high, and there is no way to put sleeps or anything to reduce this CPU usage. Or at least not obvious way to do it.

So I did what every other developer would do if it doesn't want to spend another 1 year of his free time (to get here I spent my 2 months of free time) to see that it doesn't work as intended at the end, I did test similar technologies using a VM and the single two technologies that do seem to work well enough (on my machine) are:
- OpenGL using OpenTK
- Mono + Cairo on Linux (on Windows it runs at around the 10-20 FPS, when on Linux I get between 30-60 FPS but also I have the possibility to put sleeps as all code runs on the main thread)

So, based on this results as I don't want to rewrite all my code to use OpenGL and from Java to rewrite it into C# (maybe in future I will have this will, but for now not so much), I also suspended my work on JHeroes 2.

Also I want to thank for all contributors to CR and for JetBrains offering a free license of Resharper for supporting OpenSource.

So as New Year resolution, I hope to do something else, maybe I will look into small items to improve on CR (very unlikely) or I will want to continue a Mirah like language implementation for .Net (named Calcium, but more on this I will write later).

But what can I say about 2014 is that I found some reassuring things which make work on CR to not be necessary:
- Microsoft did opensource a lot of its stack
- it has a "native compiler" which will be always better supported than CR will ever be
- Mono matured also a lot in the last year. As for me it still doesn't have full optimizations (CR still have in my understanding some optimizations that neither .Net or Mono has: especially Escape Analysis, "pure function evaluation", and this like that) but using LLVM backend, especially on Linux is really a great platform

Saturday, December 6, 2014

How to speedup your cold application startup?

This entry is only about how to do practical solutions to improve startup especially on virtual machines code.

One of the heaviest things that start with your application is the virtual machine itself. Your VM will extract JAR or Assemblies and will scan them for integrity, will compile the hot parts and so on. So the easiest thing to do is either to: put a splash as fast as possible and let in the background time to download all these files. Also, you can use NGen and in future Java's Jigsaw or Excelsior JET or for very courageous to use CodeRefractor.

But this entry is not about this still. Let's say your application starts (at least most of the code is in memory), but users still notice lag because you process XMLs or you have some UI, and processing all the log is really expensive, or sending data over the wire is expensive. So, in short; your application seems not to be responsive.

So right now I want to stress what to do to make your application start faster.

I. "Index and merge everything"

I. 1. Merge XML data

In my experience you have very often data that repeats itself. Especially if you serialize a lot of items (like a database table that you send through the wire over a web service, but is also true on disk), many items repeat.

Even for cases when you unlikely see a line and repeated patterns, data repeats itself, for example in this file:

This is a section of the SVG code (around the big green eye):

<g id="g242" stroke-width="0.5" stroke="#000" fill="#FFC"></g>
<g id="g246" stroke-width="0.5" stroke="#000" fill="#FFC"></g>
<g id="g262" fill="#e5e5b2"></g>
<g id="g290" fill="#FFF"></g>
<g id="g294" fill="#CCC"></g>
<g id="g298" fill="#000">
    <path id="path300" d="m94.401,0.6s-29.001,12.2-39.001,11.8c0,0-16.4-4.6-24.8,10,0,…0,0,4-11.2,4-13.2s21.2-7.6,22.801-8c1.6-0.4,8.2-4.6,7.8-7.8z"></path>
<g id="g302" fill="#99cc32">
    <path id="path304" d="m47,36.514c-6.872,0-15.245-3.865-15.245-10.114,0-6.248,8.373….446,5.065,12.446,11.313,0,6.249-5.572,11.314-12.446,11.314z"></path>

You can see that a lot of "g" tags, a lot of "id" flags, and a lot of "fill" elements and attributes are all over the XML. If you make your tree data-structure that keeps a list of strings (and a dictionary from these strings to the index of data), will make a lot of data smaller.

So instead of a:
class XmlNode { Name, InnerText: String;
Attributes: Dictionary<string, string>();
Children: List<XmlNode>();
class Texts { Data : List<string>();
QuickIndex: Dictionary<string, int>();
class XmlNode {
  Name, InnerText: Integer;
  Attributes: Dictionary<int, int> (); //it can be just an array of integers, which every 2 items, means key/value
  Children: List<XmlNode>();

And to get any text, you will do soemthing like:
string name = texts.Data[xmlNode.Name];

It sounds probably silly at first and overly verbose (of course it can be wrapped with a lot of design patters, here is to show the principle at work!).

Someome may think that: "it doesn't matter, "g" as a string is just 1 character long so we waste 2 bytes because a string is Unicode", but in reality the memory usage for small strings is much much bigger. A string contains the length (4 bytes), the hash (+ 4 bytes), and some implementations add an extra null finishing character. So minimum string of 1 character length is in fact around 12 bytes. Even more, the object ovherhead is if I recall right of 16 bytes. In Java the memory is padded at 8 bytes, so even if you have 12 + 16 = 28 bytes, in fact you will use 32 bytes. If you use an index, for every duplicate you trade an integer (4 bytes) for 32 bytes.
Look for EXI format and OpenEXI implementation.

I. 2. Merge tree like data

This is in a way a similar construction, and you can use the same tricks using a tree of data, but this can be useful not so much on startup but minimizing the memory usage later,

The solution is to use Reflection and to use a shared dictionary of strings, and every time you find a string that is already used, to replace with the previous one. This patch improves hugely a parse tree of NRefactory, but the code can be changed minimally for any other data structures:
(NB: the patch was not accepted upstream, but the idea remains true).

I.3. Merge pixels using palettes

20 years ago the pictures were using a palette of 256 colors (sometimes just 16). This by default merges a lot of data, in fact just if you have more than 20x20 pixels (which is virtually every picture you may use) you will save bytes of memory and disk.

II. Pack everything

Packing of data is know for taking a long long time and is used in special for big data, but in fact, when did you do the last measurement?

As I've wrote some months ago, I'm trying to improve my Java skills porting a C++ application to Java (a FHeroes 2 OSS clone).

Heroes 2 has two files which are basically 2 big indexed files, one named "Heroes2.agg" and "heroes2x.agg" where there are included all the assets. The original "Heroes2.agg" is like 40 MB on disk, and after I compressed using Gzip (which is included in almost any virtual machine, and I found it included also with Qt), it reduces to 20 MB.

So, how much does it take to read from the SSD the unpacked file, and how much would it take to read the index, and unpack all indivudual items which are packed?

On my machine (an fairly old i5M first gen) will take 140 ms for reading a 40 MB file (so, at least in this "benchmark" the disk would read theoretically 285 MB/s, but using the packed file and unpacking all the items (the logic of the game allows to unpack them only when needed, so it may be faster, but I wanted to compare as much as possible apples with apples) is 100 ms. This is almost a 30% time reduction to load all the assets that may be needed at least in this old game. Or counting it differently, it reads the original data at a speed of 399 MB/s.

III. Combine them

When you give an application to your customers or in general when you have your own applications that people want to think about them that they are snappy, make sure you optimize startup by increasing the CPU load (by calculating what data is common) and read fast this merged data and a big chunk of indexes to the original data. If you have a big data set, use any compression at save time and at startup do decompression.

The same principle happens to be the same for 3D objects (where most of 3D coordinates are shared: vertex, normals, texture coordinates, etc.).

In fact the more you merge the data and use binary data you will be surprised how fast will your application go. Mostly because even on SSD (but it is even worse on "traditional" hard drives) the disk is the slowest component by far, so optimizing it, at expense of CPU is the best way to wait less when you run your code.

IV. Conclusions

These solutions are practical but they should be used only when you measure at least before and after to see if there is a speedup. Maybe before optimizing your image loading algorithm, maybe you should not load the picture at all! But if you load it and you don't care in absolute terms if your picture uses just 256 colors.

It is also important to not forget about hashing if you have a huge repository of assets. This is a different talk altogether, but sometimes not the loading of data is expensive but the disk seek of it. If this is the case, try to split your data using hashing and dictionaries. In simpler terms, if you never created hashing do just this: create 3 level directories based on the first three letters. So: "asbest.pic" should be under folder: a/s/b/asbest.pi This can reduce greately the search overhead of your application, as the OS will have to search much less if a file/folder really exists.

Saturday, November 29, 2014

Why Static Code Analysis is important... and why we don't care

Static code analysis as of today is crucial as the software today is much bigger than ever. It is even more important as you have many people considering better code based on their previous experience, but they don't agree what means better.

So, based on this, is very likely that a static code analysis tool will help you a lot to see these errors, and these errors are found early, sometimes in a live session (like: Resharper tool - also written as R# - a very great extension of C# from JetBrains).

But what is the main issue of companies not using them (or not at a big scale)? I found very often (in my usage of R#, but I think is similar for many other companies and tools) more systemic causes:
I - (some) developers don't like to be nagged with a lot of errors/warnings when they want to finish the task fast. Having variables not named consistent or the namespace doesn't follow the location on disk is less of a concern. Even more, some libraries are for example with deprecated fields, and tools like R# will warn you that the code is broken (or at least using obsolete fields) which is a distraction given that users will not change libraries which are "legacy"
- these tools offer solutions which sometimes are at best strange or even harder to be understood for the user who doesn't understand/accept the technology. For example, Linq syntax in C# is weird for some users who don't know it, and even applying best coding practices, it makes your code less prone with side-effects errors, but the developers need to know Linq
- there is a cost upfront you have to make that these tools do not complain (much), but some companies focus on: let's implement this feature, then let's implement the other feature, but when you say: "I want to clean the code based on a tool" is a very hard sell. Is more likely a question of: "is this code working better/faster after this?" And answering honest (like: "no, it will make the code just more readable and less likely to have bugs in feature") will not change the decision.
- they are fairly costly - price is always relative, but if you take a license for every developer, things get expensive really fast. And there is no visibile advantage. Sometimes the tool is an easier sell if you say: "we need a better refactoring tool", and management may understand it (as R# is a very good refactoring tool), but otherwise is hard sell for management.
- small codebases very unlikely need these tools, the bigger the codebase is, the likely you will need these kinds of checks. The reason is that in small codebases you can put all the tool does in memory, and even you can forget a null pointer, this will be found early. If you need a tool that for example processes a batch of pictures and makes the a document, you may need very little refactors, but most likely is enough to use just default Visual Studio's facilities.

As for me, when I was a single freelance developer and I was paid (mostly) per finished feature, a license of R# or Visual Studio was very justified, as full navigation of Resharper and not wanting to look for all weird bugs in my code, was a understandable price I have to justify (for myself).

But later, when I worked in bigger companies, I did not see people using R#. The closest to R# is NRefactory, which brings a similar (but much buggier) experience and with many many limitations. So if you don't mind to work with a much buggier experience of Xamarin Studio or SharpDevelop, please go for it, at least to know what is this static code analysis about!

What I suggest to you anyway is to use a static code analysis, not at least because is easier to let a tool to look if you forget a null pointer check, than to you to do it, or even worse the user to notice it on their machine.

Monday, November 3, 2014

Poor Man's Parser in C#

I am sure that anyone dreaming about writing a compiler will look at least once into parser-generators or things like it. But some may find it daunting and the worst of it, some of them need to have very hard to debug and setup.

So how to write your own parser - which is easy to maintain/write.

Ok, so first of all, please don't blame me if you care about performant parser engine or you want to to do the "correct way". Also, this strategy I found it to work with somewhat simple languages (let's say Pascal, MQL (MetaTrader's trading language), C, and things like it.

The parser has the duty to make a tree based on your tokens.

- write a lexer that gives a list of tokens. You can start from this implementation. It should be very easy to change it for example that some IMatcher-s to accept specific words
- create a Node in a tree with arbitrary number of children as following:
    public class BlockAstNode {
        public List<BlockAstNode> Items = new List<BlockAstNode>();

        public TokenDef InnerToken { get; set; }
        public bool IsTerminal { get; set; }
        public BlockAstNode()
        public BlockAstNode(TokenDef token)
            InnerToken = token;
            IsTerminal = true;

        public BlockAstNode BuildNonTerminal(int startRange, int endRange)
            var result = new BlockAstNode ();
            for (var i = startRange; i <= endRange; i++)
                var node = Items[i];
            Items.RemoveRange(startRange, endRange- startRange+1);
            Items.Insert (startRange, result);
            return result;
- fill a "tree" with one root and as children create AstNode children for every token (using the constructor with TokenDef, where TokenDef is your token class)
- make a bracing folding. Most major languages support a folding strategy. In Pascal is: "begin .. end", in C is { to } or for Ruby/VB: <class> to <end> or <def>..<end>. Every time when you find a block that starts with your begin brace token, you fold it using BuildNonTerminal method, and the result node is placed instead of the full node. Try for simplicity to find the smallest block that can be folded, and repeat until you cannot fold anymore.
- add more folding strategies for common cases: "fold all comments tokens"
- add fold for reserved words tokens: class, while, if, (interface/implementation) and they are separated iterations that visit the tree recursively and simplify the major instructions

Why this works better than regular parser generators?
- every strategy is very easy to debug: you add multiple iterations and if you have a crash in any of them, you will likely have a mismatched index (with +/- one, but no ugly generated code)

Enjoy writing your new language!

Monday, October 13, 2014

Java Optimizer - a primer

As I've posted the previous blog entry about Java running fast (at least for a mini benchmark: to rotate a picture) a great feedback was appearing in comments from user named V.

He exposed one problem of my methodology, the fact that GCC behaves better on Linux: he gets like 151 ms on Windows (also my testing platform) and a great time of 114 ms (on 32 bit) on Linux. I have to say: this is great news (at least for GCC users on Linux)! Java gave a runtime of executing the same task like 130 ms on his machine (regardless on the OS).

So, as a introduction: why Java gets so good numbers even it compiles the code every time it reads the "bytecode" file and it "interprets" all the code?

Java in a very simpler form is having this diagram:

Don't worry of so many steps but the main thing to know is that when you "execute" Java, in fact there is a program, named Java Virtual Machine which reads all the instructions your program have (named Java bytecode) and is packed typically as a Jar file (or many .class files).

In the upper diagram is also shown some part which in fact doesn't run anything but makes sure your program is correctly formatted.

So what happens when you try to write something like Hello world in Java?
1. you write your,
2. you compile it using javac as a result will get one .class file (which you can pack as .jar if you want) which will describe your original program as stack operations (like instead of a = 2+3, will be rewritten as: "push int 2" "push int 3" "add int" "loadint into "a" "
3. you say to java.exe (on Windows, will be just java on Linux/OS X) to "run" your .class file
4. The java.exe will validate first if your .class is well formatted
5. Java will interpret (meaning will run all the .class instructions one by one) up to a threshold is met (typically is 10.000 "loop edges" meaning 10.000 times when the same sequence of instructions is interpreted)
6. if your hot code touches this threshold will compile it using something named like C1 and C2 compilers (also known as client and server compilers). The main differences (which I found presented publicly) between C1 and C2 is that C2 makes more aggressive inlining and C2 also uses a slower technique to allocate registers. On 64 bit Java, C2 compiler is enabled by default, and from Java 8 the "tiered mode" is enabled by default (for me is not clear yet if is for both 32 or 64 bit or just 32 bit, I never tested it) which it means that some "hottier" code is compiled with C1 then is compiled with C2.

So the "tricks" to get fast code in Java are:
- make as many repetitions as possible of hot code: don't run many branches that they get "warmy" but not "hot enough". They will not compile.
- write the code that doesn't do useless stuff in that loop
- if you need really better performance, use C2 (-server) or Tiered mode. They are enabled both by default on JVM from Java 8, so just updating your JVM you get good defaults on performance
- don't execute a lot of method dispatches (also known as virtual calls): try to make your hot loop to use just one class: C2 will likely inline these calls correctly. Sometimes C2 inlines them speculatively, so you can get great performance even you don't care about virtual calls.

Some big differences that Java get (which can show even more speedup than C++ can get) is "over the object file boundaries" meaning that in some cases C/C++ code runs slower because C++ cannot inline from one object file to another (right now is possible to some extent) but certainly is impossible to inline a "hot code" from an Dll/libSO. Even more important, is that the case where Java keeps at runtime always the list of loaded classes, and even you use virtual methods everywhere, if you don't have any class that overrides your class, the calls will be always dispatched as early binded (not-virtual) calls.

One very interesting talk which discusses the optimizations that Java does, is this Gil Tene's one: is a great way to understand what most optimizing compilers do, so is a great talk even you are using C++ (at least the first 1/2 half is showing typical optimizations that CR or .Net or any C++ compiler for the last 10 years will do for you!).

At last, but not at least, Java has some performance pitfalls, some which happen to be also C++ pitfalls:
- CPU stalls: when your data does not fit nicely into CPU cache your code will run typically 10 times slower (or even more for lower memory bandwidth CPUs like ARMs). Yes, you read correctly: not 10% slower, but 10 times slower. Making your algorithm to read and write into consecutive addresses in memory will run much faster. So when you write your program try to think like the hottest of the hottest code to keep data into 256 KB of data both for reads and writes. Java makes it a bit faster as it doesn't have the equivalent of "value type" so an array of 3D coordinates is in fact an array of individual pointers to separate memory structure. For this reason, I would recommend to pack coordinates into arrays and have an index as a separate class to do your processing. But look into this problem: this can hurt your C++ performance also!
- Java has a startup lag and you still have to wait for Java to warm up: this means that is not ideal for games, but there are tricks without tricking Java to run well enough for games: start your game engine as the first thing. This is in my knowledge what Minecraft does (I don't play it, sorry :( ). This let JVM to learn for the first seconds what is your code does and gives a chance to compile most of what your game will do
- some Java APIs (mostly web services) tend to be "fat" meaning you have to process XML or JSon to read your settings, or to communicate or things like this. I recommend to use faster replacements for calling remote data (without optimizing yourself the dataset). Instead of sending raw Xml (or even saving it on disk), use OpenEXI and to send data over the wire, if you have a more complex data-set, I found KryoNet to run very fast. This is also true for C++ coding, but is a more common problem in the "Java mindset" as far as I've noticed.
- Java gives to you mostly the same performance than a fairly good C++ code (maybe a bit slower on Linux but not as much as you should worry, as V showed to me) but you have to improve the other slow parts: don't be smart on writing your own threading solution (is fairly common in C++ world to rewrite everything) but use as much as possible existing solutions: JavaFx has accelerated graphics for UI for example and is excellent for most UIs. If you find it to be slow as it parses a large XML for UI, create just the small UI parts which are expensive by hand, but don't go over-board! I found for example in jHeroes2 that the original game file named Heroes2.agg is like 40 MB, but is it faster to read it, compress every individual item (this big file is a collection of small files), make it like: Heroes2.opt (of 20 MB file) and reading+unpacking is faster (at least on my machine) than reading the 40 MB. This is also because there is such of a huge difference between CPUs and disk speeds.
- The last but not least is the GC: Java's allocator is much faster than C/C++ heap allocator, but is called for almost every object. C/C++ and C# can allocate them on stack (so they "free" on exit of the function automatically) So try that most objects you allocate to not keep trivial data, but also don't keep them to long in memory. The "long pause" (named full GC) in Java is always proportional of how much data you have at the moment the GC triggered. If you find that you have to many pauses, try to double/triple the memory of your application and the GC should call at most half of the times, or use G1 an adaptive memory collector. If you run games, I would expect you will not allocate (excluding very trivial amount of very short-lived data as key strokes or, so the full GC is never called.

Hope this blog entry was useful and I hope (as I may not add many things to CR in the close future) to add some other items like this. Would you find (the reader) useful to see some Youtube entries about "how to" programming tutorials? Would you follow them?

Sunday, October 12, 2014

A case for Java on Desktop (with a benchmark)

When I started CodeRefractor, Windows 8 was somewhat just released, and .Net looked an unstable platform to invest in (for reasons like: Xaml/WPF team was disolved, or Metro runtime being a new way to get locked in into Microsoft's garden, or Xamarin, but not get me started there also).

In the last year many of the shortcomings were fixed but mostly from Xamarin's team and the related projects: MonoGame runs on most platforms nicely, Xamarin added some optimizations (which were part of CR :) ) and made the performance story better, but also even the Microsoft's story is much better in the .Net world (Ryu JIT, an Ahead-of-Time strategy which would give a similar performance profile with CR when finished).

So why run Java on desktop? Here is my top list of today why you should consider Java to C++ or C# on desktop:
- all major IDEs of Java are free of charge (IntelliJ Idea, Eclipse and Netbeans)
- the framework of UIs (JavaFx) is decent and has all the features you would expect on a top-notch UI framework including: CSS styling, GPU acceleration
- Java fixes with Java 8 two of the biggest coding/behavior issues: lambdas make working with UI events, threads etc. sane! The language is very repetitive for my taste, but it makes it very readable and the removal of PerfGen (a nifty bug-by-design in my view)
- Java 8 added Stream API, which as for me the biggest feature is adding of a simple way to write parallel code.

So fine, if you grant me that Java works mostly as language/frameworks as almost all languages of year 2010, why not pick still all other languages? As for me: cross platform and top-notch performance are really huge features. To give a Jar and it is supported on all platforms is a huge deal.

So I made claims about performance, isn't it Java slow? Especially on desktop? I would say: "depends" and "if you know how Java works, Java runs fast".

So I did this weekend a simple Qt application and I want to do something that compilers like C++ do know how to optimize it: a rotation of a picture by 90 degrees (reference hardware: Core i5M-540M, Win64, GCC MinGw 4.8.1-32 bit, Java 8.0 u20 64 bit)

This is the "hottest of the hottest" loop that would rotate a picture in C++ (without using assembly):
void MainWindow::rotatePixelData(int* __restrict src, int* __restrict dest, int width, int height){
    int newWidth = height;
    int newHeight = width;
    for(int x=0; x<width; x++) {
        for(int y=0; y<height; y++) {
            int newX = height-1-y;
            int newY = x;
            dest[newY*newWidth+newX] = src[y*width+ x];

Best time of rotating a 4K picture is: 144 ms (the first time is like 161 ms).

The Java code is the same just is not using "__restrict":
static void rotatePixelData(int[] src, int[] dest, int width, int height) {
        int newWidth = height;
        int newHeight = width;
        for (int x = 0; x < width; x++) {
            for (int y = 0; y < height; y++) {
                int newX = height - 1 - y;
                int newY = x;
                dest[newY * newWidth + newX] = src[y * width + x];

Typical time: 126 ms (the worst, yet not the first 146 ms).

Of course, without wanting to prove myself that I show bad benchmarks, or that I didn't tune the compiler enough, I wanted to make a difference significant from Java:

public class RotatorParallel {
    static List<Integer> buildSteps(int steps){
        List<Integer> result= new ArrayList<>();
        for(int i = 0;i<steps;i++){
        return result;
    public static void quickerRotatePixelData(int[] src, int[] dest, int width, int height) {
        int newWidth = height;
        List<Integer> steps = buildSteps(height);
        steps.parallelStream().forEach((Integer yBoxed)->{
            int y = yBoxed;
            for (int x = 0; x < width; x++) {
                int newX = height - 1 - y;
                int newY = x;
                dest[newY * newWidth + newX] = src[y * width + x];

This code is clearly longer, and it is more contrived, and what it does, is splitting the computation per row and using the parallel Streams API I've told earlier: it will run using all cores.

The most times are 62-64 ms and the slowest of the slowest are 70 ms. (so is exactly 2X speedup)

Some things to consider:
- C++ code is using 32 bits (yet I optimized as much as I could without using assembly). Java is using the same Jar and by default I was using 64 bit. This is a case where you guarantee that with no setup Java gives better numbers overall. If you compile on 64 bit, your code will not run on machines with 32 bit CPUs. I would expect that the code will run close to Java but not faster than 5 ms (but is also possible that it runs slower).
- Java optimizes based on hot code, and to rotate a picture is hot code
- using Stream API has an overhead, so split the processing into bigger chunks if you make it parallel. The most of overheads are with merging data (if you do coding like map-reduce) and some "invisible" boxing, like even the shown sample, the chunk is a row of image, and the overhead of boxing of a simple "int" type is there written in an explicit form
- the code is not CPU cache friendly: if you will split the picture into chunks (like sqsuares of 128x128 and you rotate them, I would expect that both Java and C++ will get better by a factor of 2, but what will not change the relation of Java running slower/faster).
- having multiple cores will make the code faster by the numbers of cores. If you use C++, you can criticize me with: why didn't you use OpenMP, and I can say, yes excluding that OpenMP is not working on all compilers of today (I'm talking here mostly about missing it in Clang/Mac OS X). So you have to make likely your mini-multi-core framework to be cross platform.

Still I think that Java is not optimal for deploy (meaning: you have to ask users to install Java runtime) and developer experience has to improve (hopefully it will improve with the next Java (named 9) by using "Project Jigsaw" and "sjavac"). As for me Jigsaw (which splits Java runtime into modules) is what have to be added in the next release as at least in the public presentations they predicted that the startup of the Java applications will improve still. And this is a big deal in the impression that Java is slow.