I could say: "obviously it isn't", but we wanted to be tested. So we considered a great test where C++ is known to shine is pointer arithmetic and rotating pixels in an image would be a very friendly C style coding.
So, can you write your own implementation quicker than a Java implementation to rotate 4K images? But I want to say some observations I did as I tested some implementations.
A reasonably quick implementation with Java is this one, where pixels of a 4K image are stored flat inside src array, and dest is a preallocated array of the same size (3840x2160):
public void rotate90(int[] src, int[] dest, int width, int height) { IntStream.range(0, height).forEach(y -> { int posSrc = y * width; int destPos = height - 1 - y; for (int x = 0; x < width; x++) { int srcPixel = getPixel(src, posSrc); setPixel(dest, destPos, srcPixel); posSrc++; destPos += height; } }); }
This implementation would run in around 110 milliseconds. This implementation is really useful, because using a single line of code change, it will run using all cores:
IntStream.range(0, height).parallel().forEach(y -> {
This will make the code to run at 33.7-37 ms.
One colleague from work wrote this implementation (Mykolas):
public void rotate90Mykolas(int[] src, int[] dest, int width, int height) { for (int i = 0; i < src.length; i++) { dest[(i % width + 1) * height - (i / width + 1)] = src[i]; } }Is it any slower or faster? Looking to instructions, it should run slower, as instead of looping, there is a plain complex math (divisions or multiplications). But in fact it run faster than the single core version: 100 ms.
At the time of writing this blog entry, this code is not written in parallel. but if I will get a new entry, the code will be updated.
Can be written quicker still?
It depends on which hardware, but in short the answer is yes:
This code is starved by memory accesses, so rotating blocks of 32 pixel squares would rotate it much quicker as the data is mostly in the CPU cache:
public static final int SIZE_CHUNK = 32;
static int calculateChunks(int size, int chunkSize) {
return (size / chunkSize) + ((size % chunkSize == 0) ? 0 : 1);
}
private static void fillChunksSizes(int width, int chunkSize, int stepsX,
int[] chunksPos, int[] chunksPosLength) { for (int it = 0; it < stepsX; it++) { chunksPos[it] = it * chunkSize; if (it != stepsX - 1) { chunksPosLength[it] = chunkSize; } else { int reminder = width % chunkSize; chunksPosLength[it] = reminder == 0 ? chunkSize : reminder; } } }
public void rotate90Chunks(int[] src, int[] dest, int width, int height) {
int chunkSize = SIZE_CHUNK;
int stepsX = calculateChunks(width, chunkSize);
int[] chunksPosX = new int[stepsX];
int[] chunksPosXLength = new int[stepsX];
fillChunksSizes(width, chunkSize, stepsX, chunksPosX, chunksPosXLength);
int stepsY = calculateChunks(height, chunkSize);
int[] chunksPosY = new int[stepsY];
int[] chunksPosYLength = new int[stepsY];
fillChunksSizes(height, chunkSize, stepsY, chunksPosY, chunksPosYLength);
IntStream.range(0, chunksPosX.length).parallel().forEach(chunckXId -> {
int startX = chunksPosX[chunckXId];
int lengthX = chunksPosXLength[chunckXId];
IntStream.range(0, chunksPosY.length).forEach(chunkYId -> {
int startY = chunksPosY[chunkYId];
int lengthY = chunksPosYLength[chunkYId];
rotateChunkByIndex(src, dest, width, height, startX, lengthX, startY, lengthY);
});
});
}
This code runs on average on a Haswell CPU in 7.85 millisecond (so is around 4 times quicker than iterating over the loops "naively").
The quickest of all I could come with is by rotating blocks which are exactly the chunk size of 32 as specialized implementation. Compilers love constants and love them more if they are typically power of 2.
This sped up a little the code, but the code is basically bigger than this previous implementation and some copy/paste of it, and it runs in 7.2 ms.
So, this is it, you can rotate 9.1 images per second with a loop, using a single thread, and if you use all cores in a i7 laptop, and you take into account how compiler optimizes and CPU caching, you can achieve 138.9 images per second running Java. 4K images.
This is 4 GB/s image processing.
But there is one more thing. This coding works very nice in CPUs which hide divisions, with many SIMD supported instructions, with a high end machine, but how does it work with a low end machine (similar with a phone CPU - including iPhone)?
I've took the code and ran it with CPU Intel(R) Celeron(R) N2930@ 1.83GHz (which is an out-of-order 4 core Pentium class CPU).
Numbers totally changed:
Single threaded rotate image: 119.86 ms.
Multithreaded first test:44.44 ms.
Mykolas implementation: 265 ms.
Chunks: 38.4 ms.
Chunks64: 27.1 ms.
Some observations: moving code from an i7-4710HQ 2.5 GHz to an Baytrail CPU, the speed decreased less than 10%. Even using 4 cores Baytrail vs 4 cores+HT I7M, if your software is memory starved, your code will run roughly the same.
Mykolas implementation got 2.5 times slower, because complex math is expensive on Atom based CPUs. Try using multiplications instead of divisions using lower spec CPUs.
The chunks implementation is also very interesting: when you have a math intensive code but you fit into cache, the Atom CPU is roughly 4x slower than an I7M (and I think even more compared with newer CPUs like Skylake).
So, can you try to make a quicker 4K image rotation than 7.2 ms (in a quad core I7M CPU - so, more than 4GB/s pixel processing)? At your request I will give a full source code of the fastest implementation (which is very similar with Chunks implementation, but just longer). Can you process more than 1.1 GB/s of pixels on an Atom based quadcore?
Happy coding!
No comments:
Post a Comment