Thursday, November 19, 2015

Tricks of the trade for quick sorting in .Net

I rarely do huge dataset sorting, but I want to share some tricks to sort huge datasets. I am surprised that very few people learned things like it even in university.

The reason is that as students you are thought to think that the CPUs are ideal and you learn big-O notation. This sorting stuff is in the most cases O(N*logN) meaning that for a list of 1,000,000 items you will have let's say 1 second, but for 4,000,000 items, you will have a very small growth to something like 4.1 seconds.

So how to improve the sorting speed? Reduce the constants.

The point of big-O notation is that it considers the "scale" of growing the time, but it doesn't take into account the individual constants.

So, let's say you have 3.000.000 items to sort inside a List<String>. This is a hypothetical but not so hypothetical, in the sense that are huge lists of items that you may need to sort in few milliseconds, and increasing the item count can show much clear where you have speedup.

 Let's say you add those items inside a list, and you use List.Sort(). I created the items semi-random distributed (most of ASCII characters are setup random) of these items. On a 64 bit .Net (4.6) on an oldish Intel i5-2500K, it would run in fairly short time (13067 ms) or 13 seconds.

Can you speedup more? The simplest trick to check first char first, as a separate check. If the first char is the same, you use full string comparison. This should go in 10189 ms. Another small improvement is to sort over arrays. This is fairly small speedup, but is still quicker (9940 ms).

The comparing sorting key class would be like following:
     struct CompareStr : IComparable<CompareStr>
    {
        public char FirstChar;
        public string Text;
        public CompareStr(string text)
        {
            if (string.IsNullOrEmpty(text))
                FirstChar = '\0';
            else
                FirstChar = text[0];
            Text = text;
        }

        public int CompareTo(CompareStr other)
        {
            if (FirstChar != other.FirstChar)
                return FirstChar.CompareTo(other.FirstChar);
            return Text.CompareTo(other.Text);
        }
    }

And the sort routine of those texts is:

             var combinedKeys = texts2.Select(text => new CompareStr(text)).ToArray();
            Array.Sort(combinedKeys);
            var resultList = combinedKeys.Select(combined=>combined.Text).ToList();

But can we do it better? I think that yes, so let's change FirstChar to pack the first two cars padded as a 32 unsigned int (char itself is kind of equivalent with UInt16). The times also improve greatly (6220 ms) which is less than half of original time:

     struct CompareStr2 : IComparable<CompareStr2>
    {
        public uint FirstChar;
        public string Text;
        public CompareStr2(string text)
        {
            if (text.Length <= 2)
                FirstChar = 0;
            else
                FirstChar = (uint)((text[0] << 16) + (text[1]));
            Text = text;
        }

        public int CompareTo(CompareStr2 other)
        {
            if (FirstChar != other.FirstChar)
                return FirstChar.CompareTo(other.FirstChar);
            return Text.CompareTo(other.Text);
        }
    }
And the sorting routine is very similar with first code:
         var combinedKeys = new List<CompareStr2>(texts2.Count);
            combinedKeys.AddRange(texts2.Select(text => new CompareStr2(text)));
            var items = combinedKeys.ToArray();
            Array.Sort(items);
            var resultList = new List<string>(texts2.Count);
            resultList.AddRange(items.Select(combined => combined.Text));


Can be written to be really faster?
   
    struct CompareStrDouble : IComparable<CompareStrDouble>
    {
        double _firstChar;
        public string Text;

        static double ComputeKey(string text)
        {
            var basePow = 1.0;
            var powItem = 1.0 / (1 << 16);
            var result = 0.0;
            foreach (char ch in text)
            {
                result += basePow * ch;
                basePow *= powItem;
            }

            return result;
        }
        public CompareStrDouble(string text)
        {
            _firstChar = ComputeKey(text);
            Text = text;
        }

        public int CompareTo(CompareStrDouble other)
        {
            if (_firstChar != other._firstChar)
                return _firstChar.CompareTo(other._firstChar);
            return Text.CompareTo(other.Text);
        }
    }
   
For reference this is the sorting code:
         List<stringSortSpecialDouble(List<string> texts)
        {
            var combinedKeys = new List<CompareStrDouble>(texts.Count);
            combinedKeys.AddRange(texts.Select(text => new CompareStrDouble(text)));
            var items = combinedKeys.ToArray();
            Array.Sort(items);
            var resultList = new List<string>(texts.Count);
            resultList.AddRange(items.Select(combined => combined.Text));
            return resultList;
        }

This sorting key is really, really fast, 2292 ms which is over 5 times quicker than the original List.Sort (for strings).

Some things to consider:
- Sorting huge datasets may show a design flaw in your application: You should always filter data before sorting it. This use-case may be important though if you get local data to sort. This algorithm of sorting can be used though with Linq's OrderBy or friends. If you have data from database as a source, the database can sort the data for you, so it is no point sometimes to sort yourself the data if you get it from your database.
- This algorithm considers that sorting only when you need it: I am considering here cases when you can have a grid with many thousands of items and you can sort over the header of the column. If you have 300,000 items (this is a huge grid BTW), the sorting using List.Sort is around 1030 ms vs the fastest key used here which is 200 ms
- At last, this algorithm is not culture aware, but is damn quick! This is a tradeoff you have to consider that may break some people's language. For instance, in Romanian (my native language) S  is near Ș char, but using this sorting algorithm, Ș char will be after Z. The same will happen with Russian, or all other chars. So if you care this kind of sorting, make sure you can afford a bit worse precision for non English words.

Running on Mono (I was really curious, weren't you?) gave really horrendous times (for only 30,000 items):
.Net times:
List.Sort: Time: 51
Sort by first char: Time: 65
Sort by first two chars: Time: 33
Sort by double key: Time: 30

Mono time:
List.Sort: Time: 411
Sort by first char: Time: 329
Sort by first two chars: Time: 106
Sort by double key: Time: 39

The fastest implementation was consistently faster, but the things got very quickly worse. For 300.000 items:
.Net times:
List.Sort: Time: 1073
Sort by first char: Time: 709
Sort by first two chars: Time: 420
Sort by double key: Time: 193

Mono times:
List.Sort: Time: 5825
Sort by first char: Time: 4917
Sort by first two chars: Time: 2268
Sort by double key: Time: 409

And for 3000000 items the times are really huge:
.Net times:
List.Sort: Time: 12939
Sort by first char: Time: 9851
Sort by first two chars: Time: 6143
Sort by double key: Time: 2259

Mono times:
List.Sort: Time: 81496
Sort by first char: Time: 70736
Sort by first two chars: Time: 38233
Sort by double key: Time: 5886

So, using latest algorithm presented here, on Mono at least seem to be 10 times faster using this primitive comparison between various strings. I can assume that the string comparison under Mono is not well optimized (like to compare two strings in Mono looks to be around 6-8x slower than .Net) and the tricks used in this algorithm can give to you the user a huge speedup. So if you use string comparison on a Xamarin supported platform for sorting purposes, you may really use a similar techniques as described here.

No comments:

Post a Comment