10th anniversary of coming up with my first independent research project idea
Today is July 24, 2019, and exactly ten years ago I came up with the idea for IncPy. By that point I had been working in research labs for over six years as both an undergraduate and graduate student, but this was the very first research project that I independently came up with on my own. I recounted that fateful moment at the end of Intermission chapter of the Ph.D. Grind:
“And then, on July 24, 2009—halfway through my internship—inspiration suddenly struck. In the midst of writing computer programs in my MSR [Microsoft Research] office to process and analyze data, I came up with the initial spark of an idea that would eventually turn into the first project of my dissertation. I frantically scribbled down pages upon pages of notes and then called my friend Robert to make sure my thoughts weren't totally ludicrous. At the time, I had no sense of whether this idea was going to be taken seriously by the wider academic community, but at least I now had a clear direction to pursue when I returned to Stanford to begin my fourth year.”
I don't want to get too sentimental, but IncPy launched my research career over the past decade. Amongst other things, it:
I could go on and on ... but basically without IncPy I wouldn't have a research career at the moment. I could've probably found some way to finish up my Ph.D. by contributing to other people's research agendas, but I wouldn't have had the skills or motivation to continue doing this for a living after graduating.
Initial Project Notes
I thought it'd be fun to share my notes from that summer day ten years ago when “I frantically scribbled down pages upon pages of notes and then called my friend Robert to make sure my thoughts weren't totally ludicrous.” Looking back, I'm amazed at how clear the idea was in my head even from that inception moment; the final ISSTA paper largely followed this initial pitch. Here it is, unedited from the top of my IncPy project notes file:
Idea conceived on: 2009-07-24
Problem: People who write ad-hoc data analysis scripts (in, say, Python) often need to explicitly save intermediate results to disk in order to have their scripts run quickly when making incremental changes.
Proposed solution: Hack the Python interpreter so that it monitors how long each function executes for and then selectively memoizes the results of expensive but side-effect-free functions to disk, and then uses those cached values on subsequent runs (until the underlying data changes).
Target user audience:
What this project is NOT:
Claims of effectiveness:
Possible next steps:
Motivation: I've written lots of ad-hoc data analysis scripts with basically the following workflow:
[input file] -> parse file -> do processing -> [output result]
now the 'do processing' step might be sophisticated, so oftentimes I want to serialize intermediate results to disk so that when I re-run my script, I don't have to process the entire base file all over again. e.g.,:
[input file] -> parse file -> process 1 -> [intermediate file] -> process 2 -> [output result]
now I might have multiple input files, multiple intermediate stages, etc., and soon this starts getting really annoying. what i would really like to do is to write a straight-up procedural Python program that does ALL the processing from input to final output and NOT HAVE TO EXPLICITLY SAVE AND RESTORE INTERMEDIATE FILES. the interpreter should be smart enough to realize when it can memoize intermediate results to disk (and later read them back). in the general case, this is a HARD problem, but i think that it's quite doable if we restrict our domain.
let's imagine memoizing at a function level:
def foo(): a = bar() b = baz() return a + b def bar(): <pure, no dependencies> def baz(): for line in open('data.txt'): ... return <something>
When the interpreter executes 'x = foo()' for the first time, foo(), bar(), and baz() must all be run. The return values are all memoized to disk (if they had parameters, their parameter values would also be memoized). Now if I execute the program again and data.txt has changed (i.e., I updated my original dataset), then when I run 'x = foo()', it will NOT need to run bar() again since it can use the memoized value, but it WILL need to run baz() again since data.txt has changed.
(Note that it's only worth memoizing if we observe that a function takes a LONG TIME to run. How long? Maybe longer than 10 seconds? Otherwise, memoizing isn't really worthwhile and can introduce extra disk read/write overhead, which can easily dominate actual execution time.)
If I modify any function, then its result will have to be recomputed.
This is really like using Makefiles to minimize the amount of source files that need to be compiled and linked, except that we are operating on the granularity of individual Python functions. We must solve the sequence of data dependency constraints to execute a conservative over-approximation of the functions.
Note that this formulation requires the programmers to at least use the function as a level of abstraction. If he/she simply wrote all the computation together in one large function, then we can't memoize anything.
Also, note that we can't memoize unless we can prove that the function is pure. That might be tricky to do, but I suspect that lots of functions written for data analysis scripts are pure (or nearly pure).
Donate to help with web hosting costs
Last modified: 2019-07-24