Faster Compression

Reader Warning: If you are not familiar with the linux command line, you best turn back now and try coming back later for other, less technical posts. :)

So I was getting tired of waiting for next generation sequencing files (6 - 40GB uncompressed) to compress and decompress, so I decided to speed things up a bit while feeding my 11 idle cores more evenly.

I found pigz (prounceced pig-zee) and lbzip2 that are gzip and bzip2 compatible linux utilities that are specifically designed to utilize multiple cores. To figure out the relative merits of these against their single core predecessors, I decided to have a little bit of fun. Here are a set of timings I developed on a small test file (with real ASCII sequence data):

In summary I am extremely impressed at the boost that a single letter (and 11 idle cores) can give to compression speed. Also, with lbzip2 fully accelerating both compression and decompression (unlike pigz) it makes the bzip2 format not only feasible, but completely logical!

Also another benefit to using bz2 over gz is the ability to quickly index in to these files with random seeks as explained at this nice blog post.

Caveat: the example command given for measurement is written for the zsh shell.

Also, if you'd like to comment, please do so on my G+ post here.