Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Philip Guo

(a.k.a. Philip J. Guo, Philip Jia Guo)

Professional Biography

In Fall 2014, Philip will start as an assistant professor of computer science at the University of Rochester. His main research interests are in human-computer interaction, online education, learning technologies, and software engineering.

In addition to research, Philip is passionate about building software tools for online education. In 2010, he created a free web-based tool for learning programming called Online Python Tutor (pythontutor.com), which over 200,000 people have used so far.

Philip received a Ph.D. in Computer Science from Stanford University in 2012 and S.B. and M.Eng. degrees in Electrical Engineering and Computer Science from MIT in 2006.

His Ph.D. dissertation was one of the first to identify the unique software needs of computational scientists and to develop five new tools to address those needs. One of those tools, CDE, has around 10,000 users. Most of his software is open-source on GitHub (github.com/pgbovine/).

In 2012, Philip wrote a popular free e-book called The Ph.D. Grind (phdgrind.com), which is the first known detailed account of an entire Ph.D. experience. So far, over 100,000 people have downloaded it, and hundreds of readers have sent him heartfelt email responses. He also writes a monthly blog column for the Communications of the ACM, and his personal website (pgbovine.net) gets over 250,000 visitors per year.

Awards and Honors

  • National Science Foundation (NSF) Graduate Fellowship (2009 – 2011)

  • National Defense Science and Engineering (NDSEG) Graduate Fellowship (2006 – 2009)

  • MIT Charles and Jennifer Johnson Thesis Award for Outstanding Computer Science Master of Engineering Thesis (2006)

  • Software Engineering In Practice Best Paper Award for Characterizing and Predicting Which Bugs Get Reopened

  • ACM SIGSOFT Distinguished Paper Award for HAMPI: A Solver for String Constraints

  • CHI Best Paper Honorable Mention for Two Studies of Opportunistic Programming: Interleaving Web Foraging, Learning, and Writing Code

Theses

  • Philip J. Guo. Software Tools to Facilitate Research Programming. Ph.D. dissertation, Department of Computer Science, Stanford University, May 2012.
    [Show Abstract | PDF]

    Research programming is a type of programming activity where the goal is to write computer programs to obtain insights from data. Millions of professionals in fields ranging from science, engineering, business, finance, public policy, and journalism, as well as numerous students and computer hobbyists, all perform research programming on a daily basis.

    My thesis is that by understanding the unique challenges faced during research programming, it becomes possible to apply techniques from dynamic program analysis, mixed-initiative recommendation systems, and OS-level tracing to make research programmers more productive.

    This dissertation characterizes the research programming process, describes typical challenges faced by research programmers, and presents five software tools that I have developed to address some key challenges. 1.) Proactive Wrangler is an interactive graphical tool that helps research programmers reformat and clean data prior to analysis. 2.) IncPy is a Python interpreter that speeds up the data analysis scripting cycle and helps programmers manage code and data dependencies. 3.) SlopPy is a Python interpreter that automatically makes existing scripts error-tolerant, thereby also speeding up the data analysis scripting cycle. 4.) Burrito is a Linux-based system that helps programmers organize, annotate, and recall past insights about their experiments. 5.) CDE is a software packaging tool that makes it easy to deploy, archive, and share research code.

    Taken together, these five tools enable research programmers to iterate and potentially discover insights faster by offloading the burdens of data management and provenance to the computer.
  • Philip J. Guo. A Scalable Mixed-Level Approach to Dynamic Analysis of C and C++ Programs. Master of Engineering thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 2006. (Charles and Jennifer Johnson Thesis Award for Outstanding Computer Science M.Eng. Thesis)
    [Show Abstract | PDF | Webpage]

    This thesis addresses the difficult task of constructing robust and scalable dynamic program analysis tools for programs written in memory-unsafe languages such as C and C++, especially those that are interested in observing the contents of data structures at run time. In this thesis, I first introduce my novel mixed-level approach to dynamic analysis, which combines the advantages of both source- and binary-based approaches. Second, I present a tool framework that embodies the mixed-level approach. This framework provides memory safety guarantees, allows tools built upon it to access rich source- and binary-level information simultaneously at run time, and enables tools to scale to large, real-world C and C++ programs on the order of millions of lines of code. Third, I present two dynamic analysis tools built upon my framework—one for performing value profiling and the other for performing dynamic inference of abstract types—and describe how they far surpass previous analyses in terms of scalability, robustness, and applicability. Lastly, I present several case studies demonstrating how these tools aid both humans and automated tools in several program analysis tasks: improving human understanding of unfamiliar code, invariant detection, and data structure repair.

Publications

In computer science, conferences (not journals) are often the primary publication venues for new research; workshops are for discussing works-in-progress.

Here is my Google Scholar profile.

Projects where I was the lead author:

  • Philip J. Guo. Online Python Tutor: Embeddable Web-Based Program Visualization for CS Education. In Proceedings of the ACM Technical Symposium on Computer Science Education (SIGCSE), March 2013. [Show Abstract | PDF | Webpage]

    This paper presents Online Python Tutor, a web-based program visualization tool for Python, which is becoming a popular language for teaching introductory CS courses. Using this tool, teachers and students can write Python programs directly in the web browser (without installing any plugins), step forwards and backwards through execution to view the run-time state of data structures, and share their program visualizations on the web.

    In the past three years, over 200,000 people have used Online Python Tutor to visualize their programs. In addition, instructors in a dozen universities such as UC Berkeley, MIT, the University of Washington, and the University of Waterloo have used it in their CS1 courses. Finally, Online Python Tutor visualizations have been embedded within three web-based digital Python textbook projects, which collectively attract around 16,000 viewers per month and are being used in at least 25 universities. Online Python Tutor is free and open source software, available at pythontutor.com
  • Philip J. Guo. CDE: Automatically Package and Reproduce Computational Experiments. Invited book chapter in Implementing Reproducible Computational Research, 2013 [PDF coming soon]

  • Philip J. Guo and Margo Seltzer. Burrito: Wrapping Your Lab Notebook in Computational Infrastructure. In Proceedings of the USENIX Workshop on the Theory and Practice of Provenance (TaPP), June 2012.
    [Show Abstract | PDF | Webpage]

    Researchers in fields such as bioinformatics, CS, finance, and applied math have trouble managing the numerous code and data files generated by their computational experiments, comparing the results of trials executed with different parameters, and keeping up-to-date notes on what they learned from past successes and failures.

    We created a Linux-based system called Burrito that automates aspects of this tedious experiment organization and notetaking process, thus freeing researchers to focus on more substantive work. Burrito automatically captures a researcher's computational activities and provides user interfaces to annotate the captured provenance with notes and then make queries such as, "Which script versions and command-line parameters generated the output graph that this note refers to?"
  • Philip J. Guo. CDE: Run Any Linux Application On-Demand Without Installation. In Proceedings of the 2011 USENIX Large Installation System Administration Conference (LISA), December 2011.
    [Show Abstract | PDF | Webpage]

    There is a huge ecosystem of free software for Linux, but since each Linux distribution (distro) contains a different set of pre-installed shared libraries, filesystem layout conventions, and other environmental state, it is difficult to create and distribute software that works without hassle across all distros. Online forums and mailing lists are filled with discussions of users' troubles with compiling, installing, and configuring Linux software and their myriad of dependencies. To address this ubiquitous problem, we have created an open-source tool called CDE that automatically packages up the Code, Data, and Environment required to run a set of x86-Linux programs on other x86-Linux machines. Creating a CDE package is as simple as running the target application under CDE's monitoring, and executing a CDE package requires no installation, configuration, or root permissions. CDE enables Linux users to instantly run any application on-demand without encountering "dependency hell".

    This paper subsumes the following other papers:

    • Philip J. Guo. CDE: A Tool For Creating Portable Experimental Software Packages. In Computing in Science and Engineering: Special Issue on Software for Reproducible Computational Science, Jul/Aug 2012.
      [Show Abstract | PDF]

      One technical barrier to reproducible computational science is that it's hard to distribute scientific code in a form that other researchers can easily execute on their own computers. To help eliminate this barrier, the CDE tool packages all software dependencies required to rerun Linux-based computational experiments on other computers.

    • Philip J. Guo and Dawson Engler. CDE: Using System Call Interposition to Automatically Create Portable Software Packages. In Proceedings of the 2011 USENIX Annual Technical Conference (short paper), June 2011.
      [Show Abstract | PDF | Extended PDF]

      It can be painfully hard to take software that runs on one person's machine and get it to run on another machine. Online forums and mailing lists are filled with discussions of users' troubles with compiling, installing, and configuring software and their myriad of dependencies. To eliminate this dependency problem, we created a system called CDE that uses system call interposition to monitor the execution of x86-Linux programs and package up the Code, Data, and Environment required to run them on other x86-Linux machines. Creating a CDE package is completely automatic, and running programs within a package requires no installation, configuration, or root permissions. Hundreds of people in both academia and industry have used CDE to distribute software, demo prototypes, make their scientific experiments reproducible, run software natively on older Linux distributions, and deploy experiments to compute clusters.

  • Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, Jeffrey Heer. Proactive Wrangling: Mixed-Initiative End-User Programming of Data Transformation Scripts. In Proceedings of the 2011 Symposium on User Interface Software and Technology (UIST), October 2011.
    [Show Abstract | PDF]

    Analysts regularly wrangle data into a form suitable for computational tools through a tedious process that delays more substantive analysis. While interactive tools can assist data transformation, analysts must still conceptualize the desired output state, formulate a transformation strategy, and specify complex transforms. We present a model to proactively suggest data transforms which map input data to a relational format expected by analysis tools. To guide search through the space of transforms, we propose a metric that scores tables according to type homogeneity, sparsity and the presence of delimiters. When compared to "ideal" hand-crafted transformations, our model suggests over half of the needed steps; in these cases the top-ranked suggestion is preferred 77% of the time. User study results indicate that suggestions produced by our model can assist analysts' transformation tasks, but that users do not always value proactive assistance, instead preferring to maintain the initiative. We discuss some implications of these results for mixed-initiative interfaces.
  • Philip J. Guo and Dawson Engler. Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting. In Proceedings of the 2011 International Symposium on Software Testing and Analysis (ISSTA), July 2011.
    [Show Abstract | PDF | Webpage]

    Programmers across a wide range of disciplines (e.g., bioinformatics, neuroscience, econometrics, finance, data mining, information retrieval, machine learning) write scripts to parse, transform, process, and extract insights from data. To speed up iteration times, they split their analyses into stages and write extra code to save the intermediate results of each stage to files so that those results do not have to be re-computed in every subsequent run. As they explore and refine hypotheses, their scripts often create and process lots of intermediate data files. They need to properly manage the myriad of dependencies between their code and data files, or else their analyses will produce incorrect results.

    To enable programmers to iterate quickly without needing to manage intermediate data files, we added a set of dynamic analyses to the programming language interpreter so that it automatically memoizes (caches) the results of long-running pure function calls to disk, manages dependencies between code and on-disk data, and later re-uses memoized results, rather than re-executing those functions, when guaranteed safe to do so. We created an implementation for Python and show how it enables programmers to iterate faster on their data analysis scripts while writing less code and not having to manage dependencies between their code and datasets.

    This paper subsumes our earlier workshop paper:

    • Philip J. Guo and Dawson Engler. Towards Practical Incremental Recomputation for Scientists: An Implementation for the Python Language. In Proceedings of the USENIX Workshop on the Theory and Practice of Provenance (TaPP), February 2010.
      [Show Abstract | PDF]
      Computational scientists often prototype data analysis scripts using high-level languages like Python. To speed up execution times, they manually refactor their scripts into stages (separate functions) and write extra code to save intermediate results to disk in order to avoid recomputing them in subsequent runs. To eliminate this burden, we enhanced the Python interpreter to automatically memoize (save) the results of long-running function executions to disk, manage dependencies between code edits and saved results, and re-use memoized results rather than re-executing those functions when guaranteed safe to do so. There is a ~20% run-time slowdown during the initial run, but subsequent runs can speed up by several orders of magnitude. Using our enhanced interpreter, scientists can write simple and maintainable code that also runs fast after minor edits, without having to learn any new programming languages or constructs.
  • Philip J. Guo. Sloppy Python: Using Dynamic Analysis to Automatically Add Error Tolerance to Ad-Hoc Data Processing Scripts. In Proceedings of the 2011 International Workshop on Dynamic Analysis (WODA), co-located with ISSTA, July 2011.
    [Show Abstract | PDF | Webpage]

    Programmers and data analysts get frustrated when their long-running data processing scripts crash without producing results, due to either bugs in their code or inconsistencies in data sources. To alleviate this frustration, we developed a dynamic analysis technique that guarantees scripts will never crash: It converts all uncaught exceptions into special NA (Not Available) objects and continues executing rather than crashing. Thus, imperfect scripts will run to completion and produce partial results and an error log, which is more informative than simply crashing with no results. We implemented our technique as a "Sloppy" Python interpreter that automatically adds error tolerance to existing scripts without any programmer effort or run-time slowdown.
  • Philip J. Guo, Thomas Zimmermann, Nachiappan Nagappan, Brendan Murphy. "Not My Bug!" and Other Reasons for Software Bug Report Reassignments. In Proceedings of the 2011 ACM Conference on Computer Supported Cooperative Work (CSCW), March 2011.
    [Show Abstract | PDF]

    Bug reporting/fixing is an important social part of the software development process. The bug-fixing process inherently has strong inter-personal dynamics at play, especially in how to find the optimal person to handle a bug report. Bug report reassignments, which are a common part of the bug-fixing process, have rarely been studied.

    In this paper, we present a large-scale quantitative and qualitative analysis of the bug reassignment process in the Microsoft Windows Vista operating system project. We quantify social interactions in terms of both useful and harmful reassignments. For instance, we found that reassignments are useful to determine the best person to fix a bug, contrary to the popular opinion that reassignments are always harmful. We categorized five primary reasons for reassignments: finding the root cause, determining ownership, poor bug report quality, hard to determine proper fix, and workload balancing. We then use these findings to make recommendations for the design of more socially-aware bug tracking systems that can overcome some of the inefficiencies we observed in our study.
  • Philip J. Guo, Thomas Zimmermann, Nachiappan Nagappan, Brendan Murphy. Characterizing and Predicting Which Bugs Get Fixed: An Empirical Study of Microsoft Windows. In Proceedings of the 2010 IEEE International Conference on Software Engineering (ICSE), May 2010.
    [Show Abstract | PDF]

    We performed an empirical study to characterize factors that affect which bugs get fixed in Windows Vista and Windows 7, focusing on factors related to bug report edits and relationships between people involved in handling the bug. We found that bugs reported by people with better reputations were more likely to get fixed, as were bugs handled by people on the same team and working in geographical proximity. We reinforce these quantitative results with survey feedback from 358 Microsoft employees who were involved in Windows bugs. Survey respondents also mentioned additional qualitative influences on bug fixing, such as the importance of seniority and interpersonal skills of the bug reporter.

    Informed by these findings, we built a statistical model to predict the probability that a new bug will be fixed (the first known one, to the best of our knowledge). We trained it on Windows Vista bugs and got a precision of 68% and recall of 64% when predicting Windows 7 bug fixes. Engineers could use such a model to prioritize bugs during triage, to estimate developer workloads, and to decide which bugs should be closed or migrated to future product versions.
  • Philip J. Guo and Dawson Engler. Linux Kernel Developer Responses to Static Analysis Bug Reports. In Proceedings of the 2009 USENIX Annual Technical Conference (short paper), June 2009.
    [Show Abstract | PDF]

    We present a study of how Linux kernel developers respond to bug reports issued by a static analysis tool. We found that developers prefer to triage reports in younger, smaller, and more actively-maintained files, first address easy-to-fix bugs and defer difficult (but possibly critical) bugs, and triage bugs in batches rather than individually. Also, although automated tools cannot find many types of bugs, they can be effective at directing developers attentions towards parts of the codebase that contain up to 3X more user-reported bugs.

    Our insights into developer attitudes towards static analysis tools allow us to make suggestions for improving their usability and effectiveness. We feel that it could be effective to run static analysis tools continuously while programming and before committing code, to rank reports so that those most likely to be triaged are shown to developers first, to show the easiest reports to new developers, to perform deeper analysis on more actively-maintained code, and to use reports as indirect indicators of code quality and importance.
  • Philip J. Guo, Jeff H. Perkins, Stephen McCamant, Michael D. Ernst. Dynamic Inference of Abstract Types. In Proceedings of the 2006 International Symposium on Software Testing and Analysis (ISSTA), July 2006.
    [Show Abstract | PDF]

    An abstract type groups variables that are used for related purposes in a program. We describe a dynamic unification-based analysis for inferring abstract types. Initially, each run-time value gets a unique abstract type. A run-time interaction among values indicates that they have the same abstract type, so their abstract types are unified. Also at run time, abstract types for variables are accumulated from abstract types for values. The notion of interaction may be customized, permitting the analysis to compute finer or coarser abstract types; these different notions of abstract type are useful for different tasks. We have implemented the analysis for compiled x86 binaries and for Java bytecodes. Our experiments indicate that the inferred abstract types are useful for program comprehension, improve both the results and the run time of a follow-on program analysis, and are more precise than the output of a comparable static analysis, without suffering from overfitting.

Projects where I was an assistant:

  • Thomas Zimmermann, Nachiappan Nagappan, Philip J. Guo, Brendan Murphy. Characterizing and Predicting Which Bugs Get Reopened. In Proceedings of the 34th ACM/IEEE International Conference on Software Engineering, Software Engineering In Practice (SEIP) track, June 2012. (SEIP Best Paper Award)
    [Show Abstract | PDF]

    Fixing bugs is an important part of the software development process. An underlying aspect is the effectiveness of fixes: if a fair number of fixed bugs are reopened, it could indicate instability in the software system. To the best of our knowledge there has been on little prior work on understanding the dynamics of bug reopens. Towards that end, in this paper, we characterize when bug reports are reopened by using the Microsoft Windows operating system project as an empirical case study. Our analysis is based on a mixed-methods approach. First, we categorize the primary reasons for reopens based on a survey of 358 Microsoft employees. We then reinforce these results with a large-scale quantitative study of Windows bug reports, focusing on factors related to bug report edits and relationships between people involved in handling the bug. Finally, we build statistical models to describe the impact of various metrics on reopening bugs ranging from the reputation of the opener to how the bug was found.
  • Adam Kiezun, Vijay Ganesh, Philip J. Guo, Pieter Hooimeijer, Michael D. Ernst. HAMPI: A Solver for String Constraints. In Proceedings of the 2009 International Symposium on Software Testing and Analysis (ISSTA), July 2009. (ACM SIGSOFT Distinguished Paper Award)
    [Show Abstract | PDF]

    Many automatic testing, analysis, and verification techniques for programs can be effectively reduced to a constraint-generation phase followed by a constraint-solving phase. This separation of concerns often leads to more effective and maintainable tools. The increasing efficiency of off-the-shelf constraint solvers makes this approach even more compelling. However, there are few, if any, effective and sufficiently expressive off-the-shelf solvers for string constraints generated by analysis techniques for string-manipulating programs.

    We designed and implemented HAMPI, a solver for string constraints over bounded string variables. HAMPI constraints express membership in regular languages and bounded context-free languages. HAMPI constraints may contain context-free-language definitions, regular-language definitions and operations, and the membership predicate. Given a set of constraints, HAMPI outputs a string that satisfies all the constraints, or reports that the constraints are unsatisfiable.

    HAMPI is expressive and efficient, and can be successfully applied to testing and analysis of real programs. Our experiments use HAMPI in: static and dynamic analyses for finding SQL injection vulnerabilities in Web applications; automated bug finding in C programs using systematic testing; and compare HAMPI with another string solver. HAMPI's source code, documentation, and the experimental data are available at http://people.csail.mit.edu/akiezun/hampi/.
  • Adam Kiezun, Philip J. Guo, Karthick Jayaraman, Michael D. Ernst. Automatic Creation of SQL Injection and Cross-site Scripting Attacks. In Proceedings of the 2009 IEEE International Conference on Software Engineering (ICSE), May 2009.
    [Show Abstract | PDF]

    We present a technique for finding security vulnerabilities in Web applications. SQL Injection (SQLI) and cross-site scripting (XSS) attacks are widespread forms of attack in which the attacker crafts the input to the application to access or modify user data and execute malicious code. In the most serious attacks (called second-order, or persistent, XSS), an attacker can corrupt a database so as to cause subsequent users to execute malicious code.

    This paper presents an automatic technique for creating inputs that expose SQLI and XSS vulnerabilities. The technique generates sample inputs, symbolically tracks taints through execution (including through database accesses), and mutates the inputs to produce concrete exploits. Ours is the first analysis of which we are aware that precisely addresses second-order XSS attacks.

    Our technique creates real attack vectors, has few false positives, incurs no runtime overhead for the deployed application, works without requiring modification of application code, and handles dynamic programming-language constructs. We implemented the technique for PHP, in a tool Ardilla. We evaluated Ardilla on five PHP applications and found 68 previously unknown vulnerabilities (23 SQLI, 33 first-order XSS, and 12 second-order XSS).
  • Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, Scott R. Klemmer. Two Studies of Opportunistic Programming: Interleaving Web Foraging, Learning, and Writing Code. In Proceedings of the 2009 ACM Conference on Human Factors in Computing Systems (CHI), April 2009. (Best Paper Honorable Mention)
    [Show Abstract | PDF]

    This paper investigates the role of online resources in problem solving. We look specifically at how programmers—an exemplar form of knowledge workers—opportunistically interleave Web foraging, learning, and writing code. We describe two studies of how programmers use online resources. The first, conducted in the lab, observed participants Web use while building an online chat room. We found that programmers leverage online resources with a range of intentions: They engage in just-in-time learning of new skills and approaches, clarify and extend their existing knowledge, and remind themselves of details deemed not worth remembering. The results also suggest that queries for different purposes have different styles and durations. Do programmers' queries "in the wild" have the same range of intentions, or is this result an artifact of the particular lab setting? We analyzed a month of queries to an online programming portal, examining the lexical structure, refinements made, and result pages visited. Here we also saw traits that suggest the Web is being used for learning and reminding. These results contribute to a theory of online resource usage in programming, and suggest opportunities for tools to facilitate online knowledge work.

    This paper subsumes the following other papers:

    • Joel Brandt, Philip J. Guo, Joel Lewenstein, Scott R. Klemmer. Opportunistic Programming: How Rapid Ideation and Prototyping Occur in Practice. In Workshop on End-User Software Engineering (WEUSE), co-located with the International Conference on Software Engineering (ICSE), May 2008. [PDF]

    • Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, Scott R. Klemmer. Opportunistic Programming: Writing Code to Prototype, Ideate, and Discover. In IEEE Software Volume 26, Issue 5: Special issue on End-User Software Engineering, Sep/Oct 2009. [PDF]

    • Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, Scott R. Klemmer. How the Web Helps People Turn Ideas Into Code. Book chapter in No Code Required: Giving Users Tools to Transform the Web, edited by A Cypher, M. Dontcheva, T. Lau, J. Nichols. Morgan Kaufmann, 2010.

  • Michael D. Ernst, Jeff H. Perkins, Philip J. Guo, Stephen McCamant, Carlos Pacheco, Matthew S. Tschantz, Chen Xiao. The Daikon system for dynamic detection of likely invariants. In Science of Computer Programming, 2007.
    [Show Abstract | PDF]

    Daikon is an implementation of dynamic detection of likely invariants; that is, the Daikon invariant detector reports likely program invariants. An invariant is a property that holds at a certain point or points in a program; these are often used in assert statements, documentation, and formal specifications. Examples include being constant (x = a), non-zero (x != 0), being in a range (a <= x <= b), linear relationships (y = ax+b), ordering (x <= y), functions from a library (x = fn(y)), containment (x E y), sortedness (x is sorted), and many more. Users can extend Daikon to check for additional invariants.

    Dynamic invariant detection runs a program, observes the values that the program computes, and then reports properties that were true over the observed executions. Dynamic invariant detection is a machine learning technique that can be applied to arbitrary data. Daikon can detect invariants in C, C++, Java, and Perl programs, and in record-structured data sources; it is easy to extend Daikon to other applications.

    Invariants can be useful in program understanding and a host of other applications. Daikon's output has been used for generating test cases, predicting incompatibilities in component integration, automating theorem proving, repairing inconsistent data structures, and checking the validity of data streams, among other tasks.

    Daikon is freely available in source and binary form, along with extensive documentation, at http://pag.csail.mit.edu/daikon/.
  • Brian Demsky, Michael D. Ernst, Philip J. Guo, Stephen McCamant, Jeff H. Perkins, Martin Rinard. Automatic Inference and Enforcement of Data Structure Consistency Specifications. In Proceedings of the 2006 International Symposium on Software Testing and Analysis (ISSTA), July 2006.
    [Show Abstract | PDF]

    Corrupt data structures are an important cause of unacceptable program execution. Data structure repair (which eliminates inconsistencies by updating corrupt data structures to conform to consistency constraints) promises to enable many programs to continue to execute acceptably in the face of otherwise fatal data structure corruption errors. A key issue is obtaining an accurate and comprehensive data structure consistency specification.

    We present a new technique for obtaining data structure consistency specifications for data structure repair. Instead of requiring the developer to manually generate such specifications, our approach automatically generates candidate data structure consistency properties using the Daikon invariant detection tool. The developer then reviews these properties, potentially rejecting or generalizing overly specific properties to obtain a specification suitable for automatic enforcement via data structure repair.

    We have implemented this approach and applied it to three sizable benchmark programs: CTAS (an air-traffic control system), BIND (a widely-used Internet name server) and Freeciv (an interactive game). Our results indicate that (1) automatic constraint generation produces constraints that enable programs to execute successfully through data structure consistency errors, (2) compared to manual specification, automatic generation can produce more comprehensive sets of constraints that cover a larger range of data structure consistency properties, and (3) reviewing the properties is relatively straightforward and requires substantially less programmer effort than manual generation, primarily because it reduces the need to examine the program text to understand its operation and extract the relevant consistency constraints. Moreover, when evaluated by a hostile third party "Red Team" contracted to evaluate the effectiveness of the technique, our data structure inference and enforcement tools successfully prevented several otherwise fatal attacks.

Selected Presentations

  • Online Python Tutor: Web-Based Program Visualization for CS Education, invited talk at Sonoma State University - Computer Science Colloquium, November 2012, Rohnert Park, CA (Video)

  • The Ph.D. Grind: Candid Discussions About Ph.D. Life, invited talk at UC Riverside - Computer Science Colloquium, October 2012, Riverside, CA

  • Online Python Tutor, invited talk at Hacker School, October 2012, New York, NY

  • The Ph.D. Grind: Candid Discussions About Ph.D. Life, Google Tech Talk, August 2012, Mountain View, CA (Video)

  • Burrito: Wrapping Your Lab Notebook in Computational Infrastructure, given at the USENIX Workshop on the Theory and Practice of Provenance (TaPP), June 2012, Boston, MA (Video)

  • CDE: Run Any Linux Application On-Demand Without Installation, given at the USENIX Large Installation System Administration Conference (LISA), December 2011, Boston, MA

  • Proactive Wrangling: Mixed-Initiative End-User Programming of Data Transformation Scripts, given at the Symposium on User Interface Software and Technology (UIST), October 2011, Santa Barbara, CA

  • Using Automatic Persistent Memoization to Facilitate Data Analysis Scripting, given at the International Symposium on Software Testing and Analysis (ISSTA), July 2011, Toronto, Canada

  • Sloppy Python: Using Dynamic Analysis to Automatically Add Error Tolerance to Ad-Hoc Data Processing Scripts, given at the International Workshop on Dynamic Analysis (WODA), July 2011, Toronto, Canada

  • CDE: A tool for automatically creating reproducible experimental software packages, invited talk at the "Reproducible Research: Tools and Strategies for Scientific Computing" workshop, July 2011, Vancouver, Canada

  • CDE: Using System Call Interposition to Automatically Create Portable Software Packages, given at the USENIX Annual Technical Conference, June 2011, Portland, OR (Video)

  • CDE: Using System Call Interposition to Automatically Create Portable Software Packages, Google Tech Talk, February 2011, Mountain View, CA (Video)

  • Towards Practical Incremental Recomputation for Scientists, given at the USENIX Workshop on the Theory and Practice of Provenance (TaPP), February 2010, San Jose, CA

  • The potentials and challenges of implementing automatic test generation using combined concrete and symbolic execution: A survey of research papers from 2006 to 2008, invited talk at Fujitsu Laboratories, October 2009, Sunnyvale, CA

  • Linux kernel developer responses to static analysis bug reports, given at the USENIX Annual Technical Conference, June 2009, San Diego, CA

  • Automatic Creation of SQL Injection and Cross-site Scripting Attacks, given at the International Conference on Software Engineering (ICSE), May 2009, Vancouver, Canada (also an invited talk at the Samsung R&D center, May 2009, San Jose, CA)

  • Dynamic Inference of Abstract Types, given at the International Symposium on Software Testing and Analysis (ISSTA), July 2006, Portland, Maine

Last modified: 2013-04-25