Monday, July 6, 2015

Improve performance for your selects in Linq

A think I learned inside CodeRefractor is how loops do work inside .Net. One thing I learned fairly quick is that the fastest loop is by far on arrays. It is documented also by Microsoft.

In short, especially using .Net on 64 bit, you will see high performance code over arrays so I strongly recommend if you have data that you read it often out of it (for example for using Linq), you should use ToArray() function.

So let's say you need out of your "tradeData" variable your names out of it.
The code may look like this:
return tradeData.Select(it => it.Id).ToArray();
What's wrong with this code? Let's say "tradeData" variable can have 1.000.000 items and tradeData can be itself an array or a List<T> and when you profile, you can see that iteration takes little time, but most of the time you will see like 16-18 allocations inside of ToArray(), the reason being that ToArray itself keeps an internal array which is resized for more times.


So it should be possible to write a "SelectToArray" method that will have much lower overhead:
     public static class UtilsLinq
    {
        public static TResult[] SelectToArray<TValue, TResult>(this IList<TValue> items, Func<TValue, TResult> func)
        {
            var count = items.Count;
            var result = new TResult[count];
            for (var i = 0; i < result.Length; i++)
            {
                result[i] = func(items[i]);
            }
            return result;
        }
    }

As T[] implements IList<T> makes this code to work for both arrays and List<T>. This code will run as fast as possible and there are no hidden allocations.

And you code becomes:
return tradeData.SelectToArray(it => it.Id);

Strong recommendation for fast(er) code: when you use Select or SelectToArray do NEVER allocate inside it "class" objects but struct objects. If you want to keep a result with multiple data fields, create "struct" types which incapsulate them.

How fast is it? It it fairly fast.

For this code:
            var sz = 10000000;
            var randData = new int[sz];
            var random = new Random();
            for(var i = 0; i<sz; i++)
            {
                randData[i] = random.Next(1, 10);
            }
            var sw = Stopwatch.StartNew();
            for(int t = 0; t<5;t++){
                var arr = randData.SelectToArray(i => (double)i);
            }
            var time1 = sw.ElapsedMilliseconds;
            sw.Stop();
            sw.Restart();
            for(int t = 0; t<5;t++){
                var arr = randData.Select(i => (double)i).ToArray();
            }
            var time2 = sw.ElapsedMilliseconds;
You have
 time1 = 798 ms vs time2 = 1357 (Debug configuration)
 time1 =  574 ms vs time2 = 1003 (Release configuration)

Not sure about you, but this is significant and also it is crucial of you have multiple Linq/Select statements and you want also the resulting items to be fast iterable. Similarly, you will have bigger speedup if you don't do the cast to double, but I wanted to show a more realistic code where the Linq it is doing something light (like typically happens as sometimes there is an indexer involved, or a field access).

NB. This test is artificial, and use these results at your own risk.
Later, I found there is a method: Array.ConvertAll which has very similar internals with this extension method (the limitation is that doesn't work with non-array implementations, but if this is not a big incovenience for you, is better to use the BCL classes).

     public static TResult[] SelectToArray<TValue, TResult>(this TValue[] items, Func<TValue, TResult> func)
        {
            return Array.ConvertAll(items, it => func(it));
        }

Method changed to this and is a bit even faster, because the iteration of items variable si a bit faster this time.

No comments:

Post a Comment