First impressions of the IPython Notebook
I've been using the IPython Notebook for less than a week, and I can already tell that there's no way I'm going back to my former data analysis workflow. Here are my first impressions.
My former workflow
Before discovering the IPython Notebook, I analyzed data with Python for many years using the following workflow.
I first create a new directory for each experiment and then enter the following loop:
This cycle repeats dozens of times per day, and thousands of times throughout the course of a project. My experiment directory gets littered with hundreds of files containing:
There are no semantic links between those files, so I can't easily tell, say, which output diagram was produced by which script (running with which set of command-line arguments), or which part of a particular note refers to which output.
The best I can do to impose some order amidst this chaos is to give each file a sensible name and organize all files into a well-groomed directory hierarchy. Attempting to do so either leads to cryptic long filenames such as the ones shown in this monstrosity (click to see a full-sized screenshot):
or I just give up and name my files something useless like
The extreme overhead of organizing my code, output, and notes takes time away from doing real work.
Data analysis with the IPython Notebook
Here is how I now do data analysis with the IPython Notebook.
I first create a new notebook file for each experiment and then enter the following loop:
This workflow looks eerily similar to my original one, with one superficial but massively important difference: Everything related to my analysis is located in one unified place. Instead of saving dozens or hundreds of code, output, and notes files for an experiment, everything is bundled together in one single file.
The IPython Notebook drastically reduces the overhead of organizing code, output, and notes files, which allows me to spend more time doing real work.
I no longer need to get annoyed by creating files like
Also, if I show an output diagram to my colleagues and they give me a suggestion, I can write it down as a note right below that diagram rather than putting it in a totally separate note file.
Setup: I got IPython Notebook up and running very quickly thanks to the wonderful Enthought Canopy distribution. I just downloaded and installed the distribution, launched the Canopy IDE, and created a new IPython Notebook within there.
Update in Dec 2013: I found that the Canopy IDE on Mac is a bit laggy. What works better instead is installing Canopy as usual and then launching IPython Notebook directly in the Web browser by switching to my experiment directory and running:
Notebook organization: I typically keep one notebook file for each experiment, structured in the following way:
I keep repeating the pattern of Code -> Output -> Notes for subsequent rounds of analysis until my notebook gets too big, and then I usually start a new notebook file.
Separating analysis and plotting cells: One useful idiom is to keep separate cells for analysis and plotting code, especially if the analysis takes a long time to run. The basic idea is to assign the results of an analysis to a global variable, and then, in a separate cell, parse the contents of that global variable to generate graphs. That way, you can experiment with many different types of plots without re-executing the (long-running) analysis code.
Refactoring into Python modules: The IPython Notebook is great for writing many small snippets of analysis and plotting code, but it's not a full replacement for traditional Python source files. Thus, after prototyping in the notebook, I usually refactor some of my code into individual Python files so that I can edit more comfortably using my preferred text editor. I then import those modules in the topmost code cell of each notebook.
Beware of global namespace pollution: Because all cells in a notebook execute in a single global namespace within one Python process, all top-level variables are shared across cells. While this is a major convenience most of the time, it can also lead to problems if you're reusing the same global variable in multiple cells and forget which cell was most recently executed. The most subtle of these bugs arise for temporary variables defined within loops, since those are still globally scoped and persist even after loop exit. When in doubt, just restart your IPython kernel and re-execute all of your cells from scratch.
Wish list: Here are some potential features that would improve the user experience of this already wonderful tool:
Last modified: 2013-12-02