Monday, May 9, 2016

Quickest Java based CSV on Earth...

If you look over the internet, CSV parsing is really solved and it is really quick. You can parse an 120 MB CSV file in around 1 second (using 1 core). Take this file from this repository: https://github.com/uniVocity/worldcities-import

They have their own bench on my machine and the output is (after JVM is warmed up):
Loop 5 - executing uniVocity CSV parser... took 1606 ms to read 3173959 rows.

But, can you beat it by the help of FlatCollections? The answer is obviously yes, and not by a small amount, but also taking into account that the coding is a bit non trivial.

How much it would take to sum the forth column times using a "miniCsv" library?

int[] sum = new int[1]; 
CsvScanner csvScanner = new CsvScanner();
   try {
    csvScanner.scanFile("worldcitiespop.txt", (char) 10, (state, rowBytes) -> {
      int regionInt = state.getInt(3, rowBytes); 
     sum[0] += regionInt;     
    }); 
   }  
   catch (IOException e) {
    e.printStackTrace();
   }
This code would sum the 4th column using this huge file after the JVM is warmed in...
Time: 371 ms

So, really, if you have a small CSV and you have many integers (or if you need to support other types, I will spend a little time to handle more cases) to calculate about, I will be glad to sped it up, just reference me as the "original" coder.

The file has to be UTF8 or ASCII, Latin, or similar byte encoding, but not UTF16 or UTF32.

So, if you feel that you want to take a look into a specialized CSV parser and you see any improvements, please feel free to read, contribute and do whatever you want with it!

https://github.com/ciplogic/FlatCollection/tree/master/examples/MiniCsv

Bonus: there is no commercial license (you can even sell the code, but it would be nice to be credited though).

The idea how to code it would not be possible without my previous work experience and great people doing this stuff for a living (I do it for passion) like Martin Thomson, Mike Barker or similarly open people. Also, I did not hear them without InfoQ.

No comments:

Post a Comment