Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Burrito: Wrapping Your Lab Notebook in Computational Infrastructure

research paper summary
Burrito: Wrapping Your Lab Notebook in Computational Infrastructure. Philip J. Guo and Margo Seltzer. USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2012.
Researchers in fields such as bioinformatics, CS, finance, and applied math have trouble managing the numerous code and data files generated by their computational experiments, comparing the results of trials executed with different parameters, and keeping up-to-date notes on what they learned from past successes and failures.

We created a Linux-based system called Burrito that automates aspects of this tedious experiment organization and notetaking process, thus freeing researchers to focus on more substantive work. Burrito automatically captures a researcher's computational activities and provides user interfaces to annotate the captured provenance with notes and then make queries such as, "Which script versions and command-line parameters generated the output graph that this note refers to?"
@inproceedings{GuoBurrito2012,
  author = {Guo, Philip J. and Seltzer, Margo},
  title = {BURRITO: Wrapping Your Lab Notebook in Computational Infrastructure},
  booktitle = {Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance},
  series = {TaPP'12},
  year = {2012},
  location = {Boston, MA},
  url = {http://dl.acm.org/citation.cfm?id=2342875.2342882},
  acmid = {2342882},
  publisher = {USENIX Association},
  address = {Berkeley, CA, USA},
}

(This summary was adapted from the Burrito project home page.)

Computational researchers and data scientists have a lot of trouble managing the plethora of code and data files generated by their experiments, comparing the results of trials executed with different parameters, and keeping up-to-date notes on what they learned from past successes and failures.

To reduce these annoyances, we created Burrito (GitHub repo), a Linux-based activity monitoring system that automates as much of this experiment organization and notetaking process as possible, thus freeing researchers to focus on their actual work of getting insights from data.

Burrito automatically captures a researcher's computational activities with no perceptible run-time slowdown. It provides user interfaces to annotate the captured activities with notes and then make queries such as, “Which script versions and command-line parameters generated the output graph that this note refers to?”

The Problem

The process of hacking on experimental/researchy code is messy:

  • You constantly adjust your code and run it with different parameters, which generates a ton of output data files.
  • Version control systems are too cumbersome for your rapidly-changing workflow. Instead, to keep track of what each file represents, you create weird filenames to encode metadata such as command-line parameters or version numbers (see screenshot below for an example):

  • It's hard to assess how changes to your source code and execution parameters led to corresponding changes in output data files.
  • You read documentation web pages, PDF files, sample code, and other resources while you hack, so it's hard to remember which resources influenced you to make specific edits.
  • You try to be disciplined about keeping notes, but you often forget what exact code or data your notes refer to, since they change so rapidly.

Burrito System Overview

Burrito solves the above problems by wrapping a layer of computational infrastructure around your normal Linux work environment. It consists of eight main components:

  1. A versioning filesystem that automatically tracks all edits to all of your files and allows you to view old file versions. This eliminates the need to use version control systems or weird file naming conventions.
  2. A tracer that records the origin (provenance) of files, telling you which program invocations created or read from each file, and what their parameters were.
  3. A tracer that records your GUI interactions, such as which application windows you were viewing at specific times.
  4. A set of plugins that record your activities within specific applications, such as which MATLAB commands you were running and which web pages you were visiting.
  5. A real-time Activity Feed that allows you to view and take notes on your recent activities.
  6. An Activity Context Viewer that displays what else you were reading and writing when hacking on some part of your code.
  7. A Computational Context Viewer that shows how changes to your source code and execution parameters affected your experiment's output files.
  8. A Lab Notebook Generator that creates an HTML summary of your activities over a given time period.

Activity Feed

This app is a sidebar anchored to the left side of your Linux desktop background. It displays a real-time stream of your actions and allows you to annotate any action with notes. Since all notes are linked with their original context, you no longer need to worry about creating, organizing, and locating a mess of text files that contain your scattered notes. Here's a screenshot of the Activity Feed in action (click to enlarge):

Our Activity Feed's UI is inspired by the Facebook News Feed: New events appear at the top of the feed and push down older ones. It currently displays six types of events:

  1. A Bash command event shows a group of Bash shell commands executed in the same directory. You can click on any command to copy it to the clipboard and paste it into a terminal to re-execute.
  2. A website visit event shows a set of web pages that you just viewed. You can click on any page title to open its link in a web browser.
  3. A file modification event shows a group of files modified by a particular process. For example, saving a source code file in a text editor will create a new file modification event, as does executing a script to generate an output data file.
  4. A digital sketch event shows a thumbnail view of a sketch that you've just drawn using, say, a Wacom pen tablet.
  5. You can create a status update event by entering text in the status text box and pressing the “Post” button. This is the main way for you to describe what you're working on at a given moment, which helps place other events in context.
  6. You can create a checkpoint event by clicking on either the “Happy Face” or “Sad Face” button and then entering a note in the pop-up text box describing why you're happy or sad about the current state of your experiment. A “happy checkpoint” is like making a commit in a version control system, and a “sad checkpoint” is like filing a bug report in a bug tracking system.

You can click on a file modification event in the feed and select the following actions:

  • Open the version of the chosen file either before or after that modification.
  • Diff multiple old versions of a chosen file by launching the Meld visual diff tool (see screenshot below).
  • Revert the file to the chosen version.
  • Watch the file for changes. The Activity Feed will report a warning if a future modification causes that file to differ from the chosen version. This action creates a simple regression test.
  • View the context surrounding edits to the chosen file (see the next two sections).

Here's a screenshot showing the Activity Feed on the left, and a visual diff of two file versions on the right (click to enlarge):

Activity Context Viewer

This application enables you to answer questions such as, “When I left work last week, I was editing this part of my script and had a collection of reference materials open ... what were they?”

You launch this application with a target text file (e.g., source code) as its argument. The GUI is a table view where each row represents a “version” of the chosen file (determined based on heuristics), displaying these four columns of data:

  • Diffs of this file between the previous and current versions.
  • Resources read while working on this version, including which web pages, documents, and source code you viewed.
  • Resources written while working on this version, including other code that you edited, and checkpoints, status updates, and digital sketches that you created.
  • Annotations that you can add to this file version.

Computational Context Viewer

This application allows you to answer questions such as, “What effects did changes in my source code files and execution parameters have on my experiment's output files?”

You launch this application with a target output file (e.g., a graph generated by a script) as its argument. The GUI displays all versions of the chosen file in reverse chronological order, along with what led to the creation of each version.

Here is a screenshot of the Computational Context Viewer GUI showing three versions of an output graph file and what led to their creation (click to enlarge):

The screenshot above shows three versions of an output graph file in reverse chronological order (right column). It also shows the command-line parameters of the executions that produced each graph version (middle column). Finally, it shows diffs in the source code files that, when executed, caused the changes between each graph version (left column). The first row's diff is the code responsible for highlighting the three center bars in yellow, and the second row's diff is the code responsible for turning the output file from a line graph into a bar graph.

Lab Notebook Generator

This final application generates a customizable HTML report that summarizes your activities over a given time period. You can use these reports as the basis for writing papers, tutorials, and theses.


Read the full paper for details:

Burrito: Wrapping Your Lab Notebook in Computational Infrastructure. Philip J. Guo and Margo Seltzer. USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2012.
Researchers in fields such as bioinformatics, CS, finance, and applied math have trouble managing the numerous code and data files generated by their computational experiments, comparing the results of trials executed with different parameters, and keeping up-to-date notes on what they learned from past successes and failures.

We created a Linux-based system called Burrito that automates aspects of this tedious experiment organization and notetaking process, thus freeing researchers to focus on more substantive work. Burrito automatically captures a researcher's computational activities and provides user interfaces to annotate the captured provenance with notes and then make queries such as, "Which script versions and command-line parameters generated the output graph that this note refers to?"
@inproceedings{GuoBurrito2012,
  author = {Guo, Philip J. and Seltzer, Margo},
  title = {BURRITO: Wrapping Your Lab Notebook in Computational Infrastructure},
  booktitle = {Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance},
  series = {TaPP'12},
  year = {2012},
  location = {Boston, MA},
  url = {http://dl.acm.org/citation.cfm?id=2342875.2342882},
  acmid = {2342882},
  publisher = {USENIX Association},
  address = {Berkeley, CA, USA},
}
Created: 2012-06-15
Last modified: 2017-10-03
Related pages tagged as human-computer interaction:
Related pages tagged as software: