2014-02-02

Statistics for the mildly curious

So you've heard about statistics in the news, and maybe you even took a course a long, long time ago in a classroom far, far away but the memories have all long since contributed to the inevitable heat death of the universe. But for whatever reason you're now curious: what is this statistics thing all about?

Now sure, you could pick up a copy of "Statistics for Dummies" or some-such, but I wanted to give you the essence of statistics in a single blog post using some corny examples. So let's get started!

Summarization

Say you plan on entering the local cucumber growing competition later in the growing season, and you are interested how your current crop is faring. One way to proceed is to go outside and merely look at the cucumbers on their vine to get a feel for how big they are. This system quickly leaves us wanting because it is completely subjective: we cannot go on to accurately record our observations which makes it harder to compare this years crop with cucumbers in past or future seasons or to our estranged cousin Lester's crop in the great cucumber growing country of Latvia.

So we need an objective quantification of what we're interested in. One way to do this is to pick a few (we have a large crop) and then weigh them. Now if you only picked one cucumber which weighed (or massed if your picky) 1.2kg then it is easy enough to write that down, send it to our dear aunt Myrtle (another cucumber aficionado in the family) or save it in our filing cabinet.  But let's say we pick 70 cucumbers.  We certainly don't want to memorize all 70 numbers or even send all 70 weights to poor aunt Myrtle.

So we want some way to summarize these 70 number into a few more meaningful numbers. Thus we introduce the concept of summarizing 'statistics' to represent this dataset in a more compact way. Most people are familiar with the average or mean, but some others include: standard deviation, variance, median, mode, minimum, maximum, quartiles, quantiles, deciles, coefficient of variation, etc... And rather than explaining all those here, if my readership (Hi Bryan!) demands it, I could do another blog post describing some of these in more detail, or you could always check Wikipedia.

For the dataset of our 70 cucumbers masses (which you can download here if you want to follow along at home), we can take a few statistics to summarize the dataset as:
Summary Stats:
Mean:         2.809481
Std Dev:      0.589431
Minimum:      1.827191
1st Quartile: 2.345176
Median:       2.751632
3rd Quartile: 3.079201
Maximum:      4.628404
Good, now dear aunt Myrtle can be proud of our large average cucumber mass, and our consistent (low variance) growing results.

Parameter Inference

But we can do better: Often times, we can describe our data more concisely and more accurately if we make a distributional assumption.

For example, say we are a casino owner and are concerned about possible tampering with a new shipment of dice we just received. To investigate, we paid some poor soul to roll a few dice several thousand times and record the results (available to download here). We then remember the lesson our cousin Victor taught us with his prize winning veggies and summarize the 7500 rolls as:

Summary Stats:
Mean: 3.512000
Minimum: 1.000000
1st Quartile: 2.000000
Median: 4.000000
3rd Quartile: 5.000000
Maximum: 6.000000

Yet, we're left feeling unsatisfied, because we don't really care so much about these particular statistics. Instead, we'd like some way to consider the chance of rolling the various faces of the die. So to do that, we assume that the dice rolls follow some *probability distribution*. In this case, that distribution is the simple discrete or categorical distribution. In this case, we assume there is some probability p_1 of rolling a 1, p_2 of rolling a 2 and so on such that p_1 + p_2 + p_3 + p_4 + p_5 + p_6 = 1. We call these values p_i parameters and given values for these parameters they specify a particular realization of the general model.

At this point, we have a model and some data, but we're stuck: we don't have any way of combining these two together!

Luckily our cousin Vinny has a few opinions in these matters and recommends a solution: count up the number of ones in your dataset and divide by the total number of rolls and use this proportion as the value for p_1. Then rinse and repeat for the other die faces and voila, you have values for all the parameters in your probability model.

Statisticians call this general process inference where data is used to infer, or estimate, the parameter values for an assumed model. While this technique of counting and dividing by the total seems logical, it is merely one potential way of going about this procedure and there are much more principled ways of performing parameter inference.

Aside: While Vinny's method feels good, it is dangerous to proceed this way in general because our method is ad hoc and has no theoretical justification. What I mean by this is that: we have no idea how 'good' Vinny's procedure is, and it may be biased, or introduce a lot of error, or even be incorrect, and unless we have more theoretical justification for using an inference procedure, we must be extremely wary of the resulting estimates. Fortunately much of statistics is focused on resolving the inference problem in a number of situations so don't despair, but unfortunately we don't have the time or space to go into more detail here.

Model Based Inference

Unfortunately, spending so many long nights at the casino performing parameter inference on our dice shipments has made problems at home. The household pet beetle, Sarah, is pregnant, and we're not sure the beetle cage proudly displayed in the living room is going to be large enough to house Sarah and her spawn. If only there was some way to know (or at least predict) how many babies Sarah was likely to have in the coming days, then we could sleep easier at night.

Fortunately, over the last thirty years, we've kept meticulous records of our beetles and their clutch sizes. Furthermore, we noticed that the larger and longer a beetle is, the more offspring it tends to have, so we recorded the mother's carapace length along with the number of beetle babies born in each clutch.

But now we hit a roadblock, because after searching Wikipedia, we can't seem to find any probability distribution designed to model this type of dependence between a beetle's length and the number of babies it has. Fortunately, we can create new probability distributions using the standard distributions (like the categorical distribution we used for dice, or a Gaussian, Poisson, or any other distribution) as building blocks to describe the situation we have. For example, in this case, we might assume that there is a linear relationship between the carapace length of a beetle and the number of children it is going to have.

"But wait a second, couldn't I just use the best fit line in Excel on my beetle clutch data?" The answer to that question is yes, but it leaves much to be desired: A linear best fit line will give us a prediction of the number of beetle babies born, but what confidence would we then have in that prediction?  And if our cage only can hold 15 babies, what is the chance that Sarah will have 15 or less babies?

For these types of probabilistic questions, we need a probabilistic model, which we can construct using simpler distributions and gluing them together to best fit the situation we have. For example, we can assume that the number of babies, B, is distributed (using the '~' tilde character), according to a Poisson distribution which has a single parameter (unlike the 6 that we had earlier for our dice categorical distribution) that we make equal to the length L of the beetle times a constant a.  Thus:

B ~ Poisson(a * L)

Using this, we can perform parameter inference for our model using our data, then plug Sarah's length into our model and see what the chance is that she'll have 15 babies or less.

I know I'm rushing through the details here, but I'm hoping you still grabbed the main concept: that we are adding complexity by building up larger models using simple distributions as building blocks, and then using these models to learn from the data and then make predictions.

Hypothesis Testing

"But wait," I hear you say, "you haven't said anything about that beloved old warhorse of mine: the T-test. Isn't that an integral part of statistics?"

Ahh yes, the t-test. Don't worry my old friend, we'll get there in good time.  But first, let's back up a little.

Lester, our cucumber growing cousin from Latvia, is clamoring for a fight. To him, the thought of some 'foreigner' growing heavier cucumbers is a great insult to the pride of Latvia. In a smug email, he then retorted that his cucumbers were obviously larger and gave us his list of 30 cucumber weights to prove it.

So the question becomes, are his cucumbers really larger than ours?

One straightforward way to proceed is to calculate the mean of his cucumbers and compare that to the mean of your cucumbers, but this method is problematic because Lester and you didn't measure all his cucumbers, merely a sampling (we're assuming that he took a uniform sampling and didn't bias his sample by weighing only the largest ones) and he could have just gotten lucky and randomly chosen a few good ones.

So to make progress, we need to more precisely state the question we would like to ask of our data. One of the simplest set of assumptions we can make about our cucumber data is that it can be modeled using a Gaussian distribution.

The Gaussian distribution has two parameters, one which controls its location, and the other its spread. In fancy terminology, we can say that it is parameterized by the mean and standard deviation.

Now we can specify our question precisely enough to answer it mathematically: Is the mean of the Gaussian modeling cousin Lester's cucumber's mass distribution larger than the mean of the Gaussian modeling my cucumber's masses?

To do this, we can either do all the calculations ourselves or perform a t-test which was designed to answer this exact question.

Summary

Now I have glossed over many details to make this whole introduction more palatable, and I couldn't even show you the result of a T-test without going into the details of how to interpret it (which are fraught with subtleties that would take time to properly). But I hope I have given a high level overview of some of the main concepts of statistics.

If you're interested in learning more, then of course there are many details to cover in each of the topics above, but I've completely ignored the large topic of Bayesian versus Frequentist stastics. While they both cover the same activities above, they do so in philosophically different ways. This division often causes large areas of disagreement among amateurs and mid-level statisticans but I would say that most professionals seem to know where the dividing line is and pick a side while acknowledging that the other side is a valid way to proceed given certain assumptions and requirements. But that is another topic for another time!

2013-10-07

Long or Short? Entertaining or Cerebral?

I had a few minutes and I was contemplating writing another speed blogging post when I began to think about the merits of shorter blog posts vs. longer blog posts. So I'm going to post the benefits of both and then, if you're reading this, please let me know what kind of blog posts you'd like to see in the comments at the bottom!

Short Posts
Everybody and everything tells us that, in this day and age we are busy individuals and the quicker that someone can give us the relevant sound bite or headline the better. So brevity is a definite plus for drawing in readers, and it certainly makes it easier to write. In fact, one of the reasons I'm bringing this up at all is my startling list of blog post drafts that continue to grow despite their near zero completion rate. The pressure and writer's block associated with writing long posts is apparently greater than my desire to share them with the rest of the world.

And finally, the process of boiling down an argument to the bare essentials, stripping away all the cluttering details, is valuable unto itself, and something that is critical in sales (another post that is in the works), parenting, life, etc... And all the YouTube channels that I now admire always center around shorter videos, most under three minutes. And I feel myself drawn away from the longer TED talks and university courses as a result of this effect.

Long Posts
One of my original inspirations for blogging is the intellectual powerhouse that is Paul Graham. His series of essays going back several years is a flood of thought provoking prose on topics from highschool to entrepreneurship. His essays are usually quite long, and are more stream-of-consciousness than bullet pointed persuasion pieces.

And it seems that one of the points of a blog in the first place is to go into more depth than would be afforded with a tweet or Google+ post. And how often do we get a chance to really sit down and examine a problem in full detail in our normal lives.

Summary
To confuse things even more, I'll give three more opinions (or perhaps even meta-opinions) about this argument:

(1) So it seems like short vs long is more of a question of priorities: Do I want to attract readers who are unable to sit through a 2000 word monologue? Or do I want to develop my thinking skills and to heck with those pesky readers with their short attention span!

(2) Of course, now that I'm thinking about it, I may have it all wrong again. It's not about short or long, but how long do I need to say what I need to say? This post itself is an argument for that direction, because in the process of writing it, I have clarified my thoughts enormously on top of the jumble that I had when diving into it.

(3) And finally, perhaps it doesn't matter what I say, or how long it takes me to say it, as the key is getting in the habit of writing things down. As according to what I've read, it is the habit of writing regularly that is important, not how you write or what you write. (This thought may be a bit out of the blue, as it's unmotivated by any of the above, but I'm putting it in here anyways)

So what do you think? Did you like the rambling second half of the post, or the more structured and orderly first half? What would you like to see?

2013-08-24

Speed Blogging - Flow and Mindfullness

I'm going to try something today that I've been meaning to do for a while: speed blogging.

Basically setting a timer for a few minutes and seeing how much of a blog post you can get out in a small amount of time. This way I don't burn 2 hours on a blog post and feel less inclined to do it the next time because of the huge time commitment. So I've got 6 minutes and twelve seconds left on my egg timer (literally we are boiling eggs) and we'll see how far I get.

So Bryan has started a new blog (located here) and in his most recent post he talks about mindfulness. In due time I will address all of the Hero's Journey content he has been writing about, but I wanted to share one thought about mindfulness itself.

Specifically, he gives the example of driving to the grocery store, and before you know it, you find yourself in the parking lot (or even in the store) without realizing where the last few minutes had gone.

This is a great example of not being mindful... when the thinking process was dominated by thoughts of worry about the future or regret about the past. (Although I'm not sure how I would classify it if those were good thoughts about the past or future).

But on the other hand, this same phenomenon can occur in something called "Flow" (obligatory wikipedia link). Flow is where you get so wrapped up in a project that you lose track of the world around you, the passage of time, and even your own physical state of being (tired/hungry). I would say that being in a state of Flow is actually one of the best examples of mindfulness, because you are no longer even trying to live in the present, but instead your thought processes are completely embedded in the present. You have no thought for past, future, or even the external world.

So one way to practice being mindful is to practice Flow. Luckily there is quite a bit of literature on how to initiate and maintain states of flow, and I'm hoping the wikipedia page will lead the way (for both you and I).

Wow! A whole blog post in under 6 minutes, pretty cool. A little stressful, but a fun exercise nonetheless. And 53 seconds to spare too!

2013-07-12

(Statistical) Trouble with Trouble (the board game)

"Don't move until you see it." (credit to Joshua Waitskin)
I was at the in-laws house a few weeks ago when I stumbled upon the classic board game of Trouble.

After a few microseconds of nostalgia, my mind raced to the question that I'm sure most people have the first time they see this game: "Does the dice 'popper' mechanism introduce any observable bias into the roll patterns?"

Now, for those that don't have experience with the game, it comes with a nifty little 'pop-o-matic' dice roller in the middle. Upon pushing this clear plastic half-hamster-ball bubble, the dice will rattle around a few times and then let you know how far to advance your pieces.

But there's the rub. To my eye, the dice didn't seem to have enough space to move around in the little bubble to sufficiently randomize each roll. So could this introduce some pattern to the supposedly random rolls?

To find out I recruited the nearest volunteer I could lay my grubby mitts on (my lovely wife) and had her 'roll' the dice a few times while I typed them into my computer. After a few rolls (185 to be exact... which reminds me I need to add another entry to the "I have the best wife in the world" list), it was time to do a bit of simple calculations. First let's look at the number of 1's, 2's, 3's etc... we rolled overall:
1 rolled 34 times
2 rolled 32 times
3 rolled 27 times
4 rolled 35 times
5 rolled 25 times
6 rolled 32 times
Hmm.. looks pretty even to me. If we wanted, we could use some fancy statistics to give us a probability that this distribution of counts is 'uneven' but with the typical wagers made on a game of trouble being what they are ($0), we'll let it slide for now. And besides it makes intuitive sense that we don't see any biases showing up in these counts, as a symmetrical dice with any small amount of randomization should end up being pretty uniformly distributed.

But now we turn to the interesting story of second order interactions. And by that I mean: given that we start with (for example) a 6 showing, are we more likely to see certain faces on the next roll?

To investigate, I considered each observation in pairs (like 5 to 3, 3 to 6, ...), and then binned them into the 36 possible combinations (6*6). With this I obtained a matrix where the element in the first row and second column would be the number of times we saw a 1 and then a 2.
5   2  5   3  6  13
3   5  5   6  7   5
3   4  2  11  2   5
6   6  9   6  4   4
5  10  3   3  1   3
11  5  3   6  5   2
If you look at it for a second, something might jump out at you: There's a diagonal line of entries that are all significantly higher than the surrounding elements! Specifically, we see something like a 1 going to a 2 only twice, but we saw 13 1's rolling to a 6.

To explain this, I only need to show you one picture, a dice 'unrolled' (so punny) so you can see all the faces.

Don't move until you see it... (A classic inside joke from watching too much Chessmaster videos narrated by the wonderful Joshua Waitskin). Notice how the 1 and the 6 are on opposite sides of the die, and so are the 3,4 and 2,5 pairs. And notice how these are the exact same pairs that are over represented in our sample! So this leads evidence to the theory that the popper tends to flip the dice over in a better than chance way.

So next time you play Trouble, you'll have to decide whether to tell everybody this trick in the spirit of good sportsmanship, or to keep it to yourself and go on to wipe out Grandma's life savings in your ruthless ploy of statistically biased, high stakes Trouble.

Appendix 1: And I must say, I feel slightly proud, because a little bit of light Googling turned up nothing in regards to this bias, so I may very well be the first person on the entire internet to have documented it... or more likely: I simply didn't exercise enough Google-fu to find it.

Appendix 2: If you want the data, then say no more: '15264425415362261611516125166443431234131525646324116165231641523346213413545111466224436542365144532265251616362616443364434526132522342556143432242161165242434252413461634534352156162' although I warn you that there were some unobserved rolls by the interruption of a certain curious 3 year old. If you want the code, then let me know, but it is only about six lines of Julia.

Appendix 3: Bonus points, fit these observations to a Markov chain model to use in your next game with Grandma.

2013-07-08

Turning a New Leaf

Recently, I've gained a clarifying insight into my own inner workings.

It started with my brother, Bryan, spontaneously gifting me one of my Amazon Kindle wishlist books. I must admit that when I first saw the title he had lovingly purchased for me, I couldn't remember putting that particular book on my list and worried that I had wasted his effort and money by having an out-of-date wishlist.

Fortunately I was wrong, vastly wrong.

The book is called Mindset, by Carol Dweck. The premise is simple and you can read about it in a little more detail on her book website: there are two general ways that people tend to think about their own abilities. She calls these the growth mindset and the fixed mindset. People with fixed mindsets tend to agree with statements of the form "I have a limited ability for ____ and although I may be able to improve with practice, any improvement will be ultimately limited by my fundamental capacity." On the other hand, people with growth mindsets tend to see their abilities as essentially limitless, and determined by the amount of effort and practice that they've spent developing them.

This may sound rather simple or even like plain old common sense: "Believe in yourself." "You can do anything you put your mind to." The Little Engine That Could. etc... But the real novelty that I found in this book was the extensive ramifications that propagate from these two mindsets, and the pervasive nature that the fixed mindset has on our society, almost lurking beneath the surface. Let me give you an example:

Imagine eight year old Jimmy comes home from school one day with boundless energy and a gigantic smile. "MOM, DAD!" he cries, "I made an A+ on my spelling test!" "Wowee! Look at that," his dad says proudly as he gives Jimmy a hug, "what do I always tell you? You're smart as a whip, just like your mother."

Stop. Did you see the mistake? Probably not, but that's okay. Let's fast forward a bit and see how this scenario plays out. After basking in the praise and attention he earned as a result of his test grade, Jimmy returns to school the next day. Here the teacher stands at the front of the room and gives an announcement:

"I have exciting news everyone. I have here the signup sheet for the school's annual spelling bee that will happen next month. All of your parents will be invited, and the best speller will win a $50 gift certificate."

Now switch back to Jimmy for a second, do you see the little gears in his head whirring? For some kids, the internal monologue goes a bit like this, "Whoa! I'd love to impress my parents, I'm going to try so hard and win this for them."

On the other hand, it often sounds a lot more like this, "... I've never been that good at spelling, and if I look bad in front of my parents, they won't think I'm smart anymore. I probably just got lucky on that test anyway, I don't think I'll sign up."

Now you can start to see what I meant by his Dad's 'mistake'. By praising Jimmy in being "smart", Jimmy has even more incentive to do things to retain his outwards appearance of smartness without extending his neck too far and exposing any chance of failure. After all, if he's already been labeled smart then why risk his status by taking such a risk? Equivalently, since he has a fixed ability (set in from birth) then why risk showing people the upper bound on his limits, or even worse: finding the limits out for himself?

Long story short (I could probably go on for a long time about this, as this idea has far reaching implications in business, education, and relationships) I have discovered that while in many respects I love challenging myself to grow and mature, I recognize the tell tale signs of the fixed mindset on many aspects of my life:

  • As a child and as an adult I love learning. About most anything. But when it comes time to put this knowledge to practice (which I usually have the desire to) I promptly switch gears and learn about something else. Thus a long line of unfulfilled projects has accumulated the TODO list of my past: writing music, making a 3D model in Blender, writing short stories, writing open source code projects (using any number of the languages/toolkits/libraries that I have studied and pored over) 
  • In applying for undergraduate and graduate colleges, I never even applied for the top tier schools. Yes, there were reasons at the time for applying where I did, but I think a strong unspoken element remained: In not applying, I can always say "I didn't even apply to those schools, but I probably would have gotten in" whereas in applying and getting rejected, then it would have been clear to me (and the rest of the world) where my limits are (and that they even exist).
  • Now in graduate school, I have many creative and exciting ideas for things to try in research, the knowledge to (start) implement(ing) them, and the freedom to do so from my wonderful advisers. But time and time again I find myself applying myself most fervently to the easier problems and putting off the truly challenging and interesting problems.
The last bullet point is the most frustrating and perplexing. I have all the resources at my fingertips, time, (enough) freedom, ability, potential collaborators. What the heck is my problem? At times I tried to blame this on circumstances or people around me: so-and-so wasn't supporting me, I work best in teams, I find the easy stuff the most interesting etc... But after reading Mindset, I realize the problem lies squarely on having a fixed mindset. With this mindset, if I try and fail, then I've found my ultimate limitations, and announced them to the world. Whereas if I make up excuses about everything around me, then it is no longer in my control in the first place. 

This may sound logical and obvious. But it has taken me this long to figure it out because I'm usually so good at the growth mindset. At different levels of the work/life hierarchy I am a master of trying hand until you succeed: Don't understand higher kinded types in Haskell? Don't understand how to develop distributed, fault tolerant systems with ZeroMQ using the PAXOS algorithm? Don't understand measure theoretic probability theory? Stochastic processes? etc...? Then go back and try again, find different resources, attack it from a different angle, talk to people about it. just keep going! 

So for some reason, when it comes to learning, I have a growth mindset to the extreme, but when it comes to doing, I have a fixed mindset. Why that came about I'm not sure, but now that I've recognized it, I'm going to take actions to fix it. 

My first real action is this blog post itself. In posting this I am a) automatically increasing my accountability when people subsequently ask me about my progress and equally importantly b) actually DOING something (anything!) without worrying about what people will think when they read it/find mistakes/disagree/think my points are overly simplistic or what have you. 


And more generally, I have made a list of how I spend my (non-family) free time and broken it into two categories:

  • Learning
    • Reading Books
    • Reading the Internet
  • Doing
    • Writing (blog posts mainly, but who knows for the future)
    • Building
      • Open source projects (on/using Julia)
      • Binky (I'll write a blog post on this someday soon)
    • Growing (stretch learning)
      • Math (eg: HoTT book and prerequisites) (but doing it Right meaning doing exercises, rather than just reading the chapters)
      • Piano (requires more practice than learning)
Currently I spend 99.9% of time in the Learning category, but I'm going to be blocking some websites, and making very specific plans for how to spend much more time in the Doing category. And of course I'll be targeting my graduate research head-on without worrying about failure. I'll also be making very detailed plans moving forward and using some of the motivational techniques I have learned over my years of (excessive considered in isolation) learning to stick to all of this. :)

So next time you're afraid of something: fail early and fail often [1]. I'm pumped!

[1] - This is wonderfully explained more in the Lean Startup doctrine that has recently gained such a large following in the entrepreneurial world. 

2013-02-21

Harnessing Computers - One (Linux command line) step at a time

This post again is mainly for Linux command line users, but it may have more general appeal as well:

Computers often make our lives easier in innumerable ways, but it is often quite a challenge to figure out the 'best' (or even a good) way to perform a single task. I'd like to share an example of where I hit an sweet spot towards this effort while backing up some pictures the other day.

The general problem is simple: you have multiple computers/devices and you aren't sure whether a set of pictures was backed up up to your house-wide backup solution (you do have one right?). This becomes especially difficult when you aren't the main operator of some of those devices, and so it becomes really confusing who took what pictures off the camera at what time and whether they were backed up.

The naiive solution is to look at the set of pictures in question, and browse through your backup trying to find the same folder name and then check the pictures inside. To speed this up, I used the find command in Linux which works as follows:
find <dir> -iname <filename>
Where -iname specifies that we don't care about the case, and dir specifies the directory to begin our search (this is recursive, so it includes all files and folders underneath).

So after picking a file name at random from the set of pictures in question (IMG_4071.JPG), I searched and found a few results. Then rather than browsing to this location and checking the files manually, I decided to use a little more of find's magic: We can also tell find to perform a command on each file that matches:
find <dir> -iname <filename> -exec <cmd> \;
Where cmd operates on each file separately and the escaped semicolon (\;) tells find that the command is finished. So I performed the following command on both the folder in question, and in the backup directory and compared the results:
find <dir> -iname IMG_4071.JPG -exec md5sum {} \;
Now md5sum is a program that computes a short string that is designed to be unique but non-random based on the data in the file. Thus if the two images had matching md5sums, then they would be the same image data with overwhelming probability.

This way I could quickly determine, at a glance, if I had already backed up an image (and by extension, a set of pictures) without having to search through hundreds of directories, and tens of thousands of pictures.

And yes, this could be extended to even fancier methods, but I think this is a very optimal point of amount of work put in (writing a single command), and what I needed from it.

2013-01-18

Faster Compression

Reader Warning: If you are not familiar with the linux command line, you best turn back now and try coming back later for other, less technical posts. :)

So I was getting tired of waiting for next generation sequencing files (6 - 40GB uncompressed) to compress and decompress, so I decided to speed things up a bit while feeding my 11 idle cores more evenly.

I found pigz (prounceced pig-zee) and lbzip2 that are gzip and bzip2 compatible linux utilities that are specifically designed to utilize multiple cores. To figure out the relative merits of these against their single core predecessors, I decided to have a little bit of fun. Here are a set of timings I developed on a small test file (with real ASCII sequence data):


In summary I am extremely impressed at the boost that a single letter (and 11 idle cores) can give to compression speed. Also, with lbzip2 fully accelerating both compression and decompression (unlike pigz) it makes the bzip2 format not only feasible, but completely logical!

Also another benefit to using bz2 over gz is the ability to quickly index in to these files with random seeks as explained at this nice blog post.

Caveat: the example command given for measurement is written for the zsh shell.

Also, if you'd like to comment, please do so on my G+ post here.