Optimizing Python + Numpy Code

So I decided to learn Cython, and in doing so, I found an interesting post comparing Matlab speed to Numpy + Python and Numpy + Cython. Obviously I was intrigued, and with the available code, pasted into vim I was off and running!

The first thing to do is the first step in ALL optimization workflows: profiling! Using the wonderful cProfile.run('GMMGaussianMC(100000)','prof.dat') we can glean the following: (output pasted here for readability).

  • Hmmm.. 32 seconds (out of 60) in numpy.linalg.solve
  • Only 10 seconds is spent 'in' GMMGaussian (and 3 in GMMGaussianMC), so the bulk is in Numpy
This explains why Tim's Cython conversion did not net any gain (it's all numpy bound). Thus while chewing on the numpy problem, I started implementing some simple (but in retrospect, probably minor) common python + numpy optimizations: vectorizing code by pulling it out of the inner for loop. And inlining the function call (I'd be curious to see how much this helped, as it is a rather ugly change in general). 

This took us down from 30 seconds to about 22 seconds, not bad, but not great either.

Now, for the linear algebra. It appears that a great deal of our time is spent in the numpy.linalg.solve function, why is this? And after a bit of searching I found this insightful mailing list post (see towards the bottom). It appears that MATLAB does a lot of heuristics in order to use the most efficient way of solving a linear system, whereas numpy takes the easy out approach by calling the generic LAPACK solver (DGESV for anyone interested). After rejecting the idea of compiling a faster LAPACK because it would require rebuilding Numpy, I decided to look more closely at the linear algebra at use in the code.

Sure enough, I found that we were dealing with purely diagonal matrices which makes the prospect of simply inverting and multiplying possible. BAM!!! This takes us down to 8 seconds, which is comparable to the MATLAB script.

Profiling now we can see that everything is either in the GMMGaussianMC function, or quick numpy functions like randn, dot, inner, etc... There is not much we can do about this to my knowledge, so we are now done! (link to final code)

Now time to go learn about the theory behind this algorithm... 

P.S. Sorry about the poor writing quality in this post, I'm trying to talk and write at the same time, and the result is a poor mess of both :)