How to speedup your cold application startup?

This entry is only about how to do practical solutions to improve startup especially on virtual machines code.

One of the heaviest things that start with your application is the virtual machine itself. Your VM will extract JAR or Assemblies and will scan them for integrity, will compile the hot parts and so on. So the easiest thing to do is either to: put a splash as fast as possible and let in the background time to download all these files. Also, you can use NGen and in future Java's Jigsaw or Excelsior JET or for very courageous to use CodeRefractor.

But this entry is not about this still. Let's say your application starts (at least most of the code is in memory), but users still notice lag because you process XMLs or you have some UI, and processing all the log is really expensive, or sending data over the wire is expensive. So, in short; your application seems not to be responsive.

So right now I want to stress what to do to make your application start faster.

I. "Index and merge everything"

I. 1. Merge XML data

In my experience you have very often data that repeats itself. Especially if you serialize a lot of items (like a database table that you send through the wire over a web service, but is also true on disk), many items repeat.

Even for cases when you unlikely see a line and repeated patterns, data repeats itself, for example in this file: http://upload.wikimedia.org/wikipedia/commons/f/fd/Ghostscript_Tiger.svg

This is a section of the SVG code (around the big green eye):

<g id="g242" stroke-width="0.5" stroke="#000" fill="#FFC"></g>
<g id="g246" stroke-width="0.5" stroke="#000" fill="#FFC"></g>
(...)
<g id="g262" fill="#e5e5b2"></g>
(...)
<g id="g290" fill="#FFF"></g>
<g id="g294" fill="#CCC"></g>
<g id="g298" fill="#000">
<path id="path300" d="m94.401,0.6s-29.001,12.2-39.001,11.8c0,0-16.4-4.6-24.8,10,0,…0,0,4-11.2,4-13.2s21.2-7.6,22.801-8c1.6-0.4,8.2-4.6,7.8-7.8z"></path>
</g>
<g id="g302" fill="#99cc32">
<path id="path304" d="m47,36.514c-6.872,0-15.245-3.865-15.245-10.114,0-6.248,8.373….446,5.065,12.446,11.313,0,6.249-5.572,11.314-12.446,11.314z"></path>
</g>

You can see that a lot of "g" tags, a lot of "id" flags, and a lot of "fill" elements and attributes are all over the XML. If you make your tree data-structure that keeps a list of strings (and a dictionary from these strings to the index of data), will make a lot of data smaller.

So instead of a:
class XmlNode { Name, InnerText: String;
Attributes: Dictionary<string, string>();
Children: List<XmlNode>();
}
Define:
class Texts { Data : List<string>();
QuickIndex: Dictionary<string, int>();
}
class XmlNode {
Name, InnerText: Integer;
Attributes: Dictionary<int, int> (); //it can be just an array of integers, which every 2 items, means key/value
Children: List<XmlNode>();
}

And to get any text, you will do soemthing like:
string name = texts.Data[xmlNode.Name];

It sounds probably silly at first and overly verbose (of course it can be wrapped with a lot of design patters, here is to show the principle at work!).

Someome may think that: "it doesn't matter, "g" as a string is just 1 character long so we waste 2 bytes because a string is Unicode", but in reality the memory usage for small strings is much much bigger. A string contains the length (4 bytes), the hash (+ 4 bytes), and some implementations add an extra null finishing character. So minimum string of 1 character length is in fact around 12 bytes. Even more, the object ovherhead is if I recall right of 16 bytes. In Java the memory is padded at 8 bytes, so even if you have 12 + 16 = 28 bytes, in fact you will use 32 bytes. If you use an index, for every duplicate you trade an integer (4 bytes) for 32 bytes.
Look for EXI format and OpenEXI implementation.

I. 2. Merge tree like data

This is in a way a similar construction, and you can use the same tricks using a tree of data, but this can be useful not so much on startup but minimizing the memory usage later,

The solution is to use Reflection and to use a shared dictionary of strings, and every time you find a string that is already used, to replace with the previous one. This patch improves hugely a parse tree of NRefactory, but the code can be changed minimally for any other data structures: https://github.com/icsharpcode/NRefactory/issues/448
(NB: the patch was not accepted upstream, but the idea remains true).

I.3. Merge pixels using palettes

20 years ago the pictures were using a palette of 256 colors (sometimes just 16). This by default merges a lot of data, in fact just if you have more than 20x20 pixels (which is virtually every picture you may use) you will save bytes of memory and disk.

II. Pack everything

Packing of data is know for taking a long long time and is used in special for big data, but in fact, when did you do the last measurement?

As I've wrote some months ago, I'm trying to improve my Java skills porting a C++ application to Java (a FHeroes 2 OSS clone).

Heroes 2 has two files which are basically 2 big indexed files, one named "Heroes2.agg" and "heroes2x.agg" where there are included all the assets. The original "Heroes2.agg" is like 40 MB on disk, and after I compressed using Gzip (which is included in almost any virtual machine, and I found it included also with Qt), it reduces to 20 MB.

So, how much does it take to read from the SSD the unpacked file, and how much would it take to read the index, and unpack all indivudual items which are packed?

On my machine (an fairly old i5M first gen) will take 140 ms for reading a 40 MB file (so, at least in this "benchmark" the disk would read theoretically 285 MB/s, but using the packed file and unpacking all the items (the logic of the game allows to unpack them only when needed, so it may be faster, but I wanted to compare as much as possible apples with apples) is 100 ms. This is almost a 30% time reduction to load all the assets that may be needed at least in this old game. Or counting it differently, it reads the original data at a speed of 399 MB/s.

III. Combine them

When you give an application to your customers or in general when you have your own applications that people want to think about them that they are snappy, make sure you optimize startup by increasing the CPU load (by calculating what data is common) and read fast this merged data and a big chunk of indexes to the original data. If you have a big data set, use any compression at save time and at startup do decompression.

The same principle happens to be the same for 3D objects (where most of 3D coordinates are shared: vertex, normals, texture coordinates, etc.).

In fact the more you merge the data and use binary data you will be surprised how fast will your application go. Mostly because even on SSD (but it is even worse on "traditional" hard drives) the disk is the slowest component by far, so optimizing it, at expense of CPU is the best way to wait less when you run your code.

IV. Conclusions

These solutions are practical but they should be used only when you measure at least before and after to see if there is a speedup. Maybe before optimizing your image loading algorithm, maybe you should not load the picture at all! But if you load it and you don't care in absolute terms if your picture uses just 256 colors.

It is also important to not forget about hashing if you have a huge repository of assets. This is a different talk altogether, but sometimes not the loading of data is expensive but the disk seek of it. If this is the case, try to split your data using hashing and dictionaries. In simpler terms, if you never created hashing do just this: create 3 level directories based on the first three letters. So: "asbest.pic" should be under folder: a/s/b/asbest.pi This can reduce greately the search overhead of your application, as the OS will have to search much less if a file/folder really exists.

Code Refractor - Virtual Machines/Compiler performance musings

Saturday, December 6, 2014