Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

IncPy: Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting

research paper summary
Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting. Philip J. Guo and Dawson Engler. International Symposium on Software Testing and Analysis (ISSTA), 2011.
Programmers across a wide range of disciplines (e.g., bioinformatics, neuroscience, econometrics, finance, data mining, information retrieval, machine learning) write scripts to parse, transform, process, and extract insights from data. To speed up iteration times, they split their analyses into stages and write extra code to save the intermediate results of each stage to files so that those results do not have to be re-computed in every subsequent run. As they explore and refine hypotheses, their scripts often create and process lots of intermediate data files. They need to properly manage the myriad of dependencies between their code and data files, or else their analyses will produce incorrect results.

To enable programmers to iterate quickly without needing to manage intermediate data files, we added a set of dynamic analyses to the programming language interpreter so that it automatically memoizes (caches) the results of long-running pure function calls to disk, manages dependencies between code and on-disk data, and later re-uses memoized results, rather than re-executing those functions, when guaranteed safe to do so. We created an implementation for Python and show how it enables programmers to iterate faster on their data analysis scripts while writing less code and not having to manage dependencies between their code and datasets.
@inproceedings{GuoIncPy2011,
 author = {Guo, Philip J. and Engler, Dawson},
 title = {Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting},
 booktitle = {Proceedings of the 2011 International Symposium on Software Testing and Analysis},
 series = {ISSTA '11},
 year = {2011},
 isbn = {978-1-4503-0562-4},
 location = {Toronto, Ontario, Canada},
 pages = {287--297},
 numpages = {11},
 url = {http://doi.acm.org/10.1145/2001420.2001455},
 doi = {10.1145/2001420.2001455},
 acmid = {2001455},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {dependency management, scientific workflows},
}
Towards Practical Incremental Recomputation for Scientists: An Implementation for the Python Language. Philip J. Guo and Dawson Engler. USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2010.
Computational scientists often prototype data analysis scripts using high-level languages like Python. To speed up execution times, they manually refactor their scripts into stages (separate functions) and write extra code to save intermediate results to disk in order to avoid re-computing them in subsequent runs. To eliminate this burden, we enhanced the Python interpreter to automatically memoize (save) the results of long-running function executions to disk, manage dependencies between code edits and saved results, and re-use memoized results rather than re-executing those functions when guaranteed safe to do so. There is a ~20% run-time slowdown during the initial run, but subsequent runs can speed up by several orders of magnitude. Using our enhanced interpreter, scientists can write simple and maintainable code that also runs fast after minor edits, without having to learn any new programming languages or constructs.
@inproceedings{GuoTaPP2010,
 author = {Guo, Philip J. and Engler, Dawson},
 title = {Towards Practical Incremental Recomputation for Scientists: An Implementation for the {Python} Language},
 booktitle = {Proceedings of the 2nd Workshop on the Theory and Practice of Provenance},
 series = {TAPP'10},
 year = {2010},
 location = {San Jose, California},
 url = {http://dl.acm.org/citation.cfm?id=1855795.1855801},
 acmid = {1855801},
 publisher = {USENIX Association},
 address = {Berkeley, CA, USA},
}

(This summary was adapted from the IncPy project home page, which was created in early 2010.)

IncPy (Incremental Python) is an enhanced Python interpreter that speeds up script execution times by automatically memoizing (caching) the results of long-running function calls and then re-using those results rather than re-computing, when safe to do so.

When you first run your script with the IncPy interpreter, it might be ~20% slower since IncPy needs to determine which functions are safe and worthwhile to memoize, what their dependencies are, and then memoize their results to a persistent cache. After you make some edits to your script, subsequent runs can be much faster since IncPy can skip calls to functions whose dependencies are still satisfied and load their results directly from the cache. That way, IncPy can greatly speed up your iteration and debugging cycles.

IncPy is designed to be a drop-in replacement for the Python 2.6 interpreter, so it should work seamlessly with all of your existing scripts and 3rd-party libraries. You don't need to learn any new language features or programming idioms to get its benefits.

How can IncPy be useful for me?

If you've written Python scripts that run for at least a few minutes, then you've probably encountered the following dilemma:

  1. Simple code, but slow runs: If you keep your code relatively simple, then it takes unnecessarily long to re-execute after you make minor edits to your code, since the Python interpreter re-executes your entire script, even those parts that have not been affected by your edits.
  2. Complicated code + temp data files, but faster runs: If you write extra caching code to save intermediate results to disk (and later load them from disk), then subsequent runs of your script can be much faster. However, now your code is more complicated, and you need to manage those temporary data files that your script has generated.

As your project progresses, you might end up writing a collection of ad-hoc scripts, each reading some input files, munging the data, and writing intermediate results out to temporary files that other scripts then read and munge.

For instance, the above diagram (click to enlarge) shows the Makefile that I created during a summer internship in which I wrote dozens of Python scripts to munge software bug database and employee personnel data. Each rectangle represents a Python script, each ellipse represents a data file (shaded ellipses represent the final results of my analyses), and each arrow shows a script either reading from or writing to a data file. To speed up execution times, I re-factored my scripts to load and save several layers of intermediate datasets (white ellipses), so that I could tweak portions of my analyses and not have to wait for the entire workflow to re-execute. As a consequence, my code got more bloated, and I also had to keep track of over a dozen intermediate data files. I realized from this experience that an enhanced Python interpreter could automatically do all of this caching and dependency management, so that's when I set out to create IncPy.

By running your scripts with IncPy rather than the regular Python interpreter, you can keep your code simple while still getting the benefits of faster execution times. In particular:

  • You can potentially edit and debug your code, re-execute it, and see new results in a few seconds rather than waiting for minutes (or even hours)
  • You no longer need to write code to cache your intermediate results to temporary data files, which simplifies your code base and eliminates possible data consistency bugs
  • You no longer need to manually track dependencies between intermediate datasets and the code and data that created them. No more asking yourself, "Oh wait, is that dataset outdated now that I've updated this other one? Which script should I run to re-create this dataset from that one?"

How does IncPy differ from other approaches to memoization?

  • IncPy is fully automatic, so you don't need to figure out which functions are safe and worthwhile to memoize.
  • IncPy provides an on-disk persistent cache, which can be shared across executions of the same script or related scripts that call shared functions.
  • IncPy guarantees correctness by tracking dependencies between cached results and the code and data that produced them and clearing cache entries when dependencies are broken, so you don't need to manage the intermediate datasets.
  • IncPy is designed for a general-purpose imperative programming language, with an emphasis on having a low run-time overhead, in contrast to related work on automatic memoization for functional and domain-specific languages.

Can you show me a quick demo?

Sure, this 6-minute screencast demonstrates some of IncPy's basic capabilities:


Read the full conference and workshop papers for details:

Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting. Philip J. Guo and Dawson Engler. International Symposium on Software Testing and Analysis (ISSTA), 2011.
Programmers across a wide range of disciplines (e.g., bioinformatics, neuroscience, econometrics, finance, data mining, information retrieval, machine learning) write scripts to parse, transform, process, and extract insights from data. To speed up iteration times, they split their analyses into stages and write extra code to save the intermediate results of each stage to files so that those results do not have to be re-computed in every subsequent run. As they explore and refine hypotheses, their scripts often create and process lots of intermediate data files. They need to properly manage the myriad of dependencies between their code and data files, or else their analyses will produce incorrect results.

To enable programmers to iterate quickly without needing to manage intermediate data files, we added a set of dynamic analyses to the programming language interpreter so that it automatically memoizes (caches) the results of long-running pure function calls to disk, manages dependencies between code and on-disk data, and later re-uses memoized results, rather than re-executing those functions, when guaranteed safe to do so. We created an implementation for Python and show how it enables programmers to iterate faster on their data analysis scripts while writing less code and not having to manage dependencies between their code and datasets.
@inproceedings{GuoIncPy2011,
 author = {Guo, Philip J. and Engler, Dawson},
 title = {Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting},
 booktitle = {Proceedings of the 2011 International Symposium on Software Testing and Analysis},
 series = {ISSTA '11},
 year = {2011},
 isbn = {978-1-4503-0562-4},
 location = {Toronto, Ontario, Canada},
 pages = {287--297},
 numpages = {11},
 url = {http://doi.acm.org/10.1145/2001420.2001455},
 doi = {10.1145/2001420.2001455},
 acmid = {2001455},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {dependency management, scientific workflows},
}
Towards Practical Incremental Recomputation for Scientists: An Implementation for the Python Language. Philip J. Guo and Dawson Engler. USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2010.
Computational scientists often prototype data analysis scripts using high-level languages like Python. To speed up execution times, they manually refactor their scripts into stages (separate functions) and write extra code to save intermediate results to disk in order to avoid re-computing them in subsequent runs. To eliminate this burden, we enhanced the Python interpreter to automatically memoize (save) the results of long-running function executions to disk, manage dependencies between code edits and saved results, and re-use memoized results rather than re-executing those functions when guaranteed safe to do so. There is a ~20% run-time slowdown during the initial run, but subsequent runs can speed up by several orders of magnitude. Using our enhanced interpreter, scientists can write simple and maintainable code that also runs fast after minor edits, without having to learn any new programming languages or constructs.
@inproceedings{GuoTaPP2010,
 author = {Guo, Philip J. and Engler, Dawson},
 title = {Towards Practical Incremental Recomputation for Scientists: An Implementation for the {Python} Language},
 booktitle = {Proceedings of the 2nd Workshop on the Theory and Practice of Provenance},
 series = {TAPP'10},
 year = {2010},
 location = {San Jose, California},
 url = {http://dl.acm.org/citation.cfm?id=1855795.1855801},
 acmid = {1855801},
 publisher = {USENIX Association},
 address = {Berkeley, CA, USA},
}
Related pages tagged as research paper summary:
Related pages tagged as data science:
Related pages tagged as software: