On IPython and Repoducible Research

IPython logo


I’m attending the last day of the Keystone Symposium on Precision Genome Engineering and Synthetic Biology.  The afternoons are free, and the skiing is kind of weak, so when I need a break from TALENs and Cas9 (so much Cas9), I’m learning Python.

What’s particularly interesting is the community that’s trying to position Python as the next big thing in scientific computing; the successor to R, MATLAB, Mathematica, etc.  I used to think of Python as a “programming language” like C or Java or PERL, where you wrote a program to do what you want, then ran it on your data.  (And there are plenty of resources to support using it that way; PyDev comes to mind.)  I knew from my first brush with it 15 years ago (!!) that it had a REPL interface: you can bring up a Python “command line” and type expressions in, and the interpreter will evaluate them for you and give you the answer.  I didn’t really think much of it; I figured it was useful for noodling around, learning the language, debugging, etc.

Boy was I wrong.

IPython is a Python shell with proper support for interactive computing, like R or MATLAB.  It extends “traditional” Python with support for parallel and distributed computing, tight integration with several visualization toolkits, and a browser-based notebook that lets you record your data analysis workflow along with the results, and then share the whole thing trivially with coworkers and collaborators.  It makes literate programming absolutely effortless.

(I should note that IPython isn’t the only player in this space; Spyder and Enthought Canopy are two of the other efforts to make Python well-suited for interactive scientific programming.)

The other part of the equation is a set of libraries for data handling and analysis.  SciPy and SAGE are two “meta” libraries, bundling together a lot of mature software for importing, manipulating and analyzing data; building and running models; doing computational experiments, etc.  I was particularly happy to discover pandas, a library for handling structured data similar to data frames in R.  The toolkit isn’t quite as developed as R or MATLAB, but it’s growing as companies embrace the open source ethos of using Python tools for their own work, improving those tools and then contributing their improvements back to the community.  The adoption seems to be particularly strong in the academic community; it even saw a spot on Nature.com recently.

Which brings me to reproducible research.  Philip Bourne is one of my science idols; he was the founding editor-in-chief of PLoS Computational Biology and the originator of the “Ten Simple Rules” series (if you are a researcher in any field and you haven’t browsed these, you should!).  He has long been an advocate of reproducible research, but especially in computer science and computational biology it can be difficult to document exactly the steps you took to generate your data or do your analysis.  The last time I heard him speak on the subject, he was advocating standard directory layouts to organize data and using GNU Make to automate the running of tools, programs and scripts.  Clunky and time-consuming to say the least.

An IPython notebook completely obviates that.  It lets you record exactly what you did (the Python code) along with the rationale (in beautiful rich-text) and the output, all stored in one place.  It makes publishing your work so that others can reproduce it trivial, but the importance goes way beyond that.  I’ve learned the hard way that keeping a good notebook isn’t for some speculative person who picks up my work when I’m gone, it’s for me-in-six-months.  Keeping track of where I’ve been mentally, and what I’ve tried that didn’t work (or occasionally did), is astoundingly important … and anything that can make that easier is something that I’ll adopt enthusiastically.

So, now I’m a Python enthusiast.  Not looking forward to scaling the learning curve, but the underlying language makes a lot more sense to me than, say, R (which I’ve been using for a decade and still don’t feel particularly comfortable in.)  if only I could get easy integration between IPython and my Drupal-based online notebook…..

Postscript – I know that Mathematica has had a notebook interface for something like 5 years.  IPython’s strikes me as more flexible, better looking, based on open standards, and you can get it without paying a zillion dollars.  (-:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.