Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Work log from my 3rd year of Ph.D. (2008-2009)

I kept a daily log of all of my work-related activities for an entire year of my Ph.D. (my 3rd year, 2008-2009). Here it is in full, with some light commentary from ten years later (2019).

These are my raw work logs from Sep 2008 to June 2009, which encompass my entire 3rd year of Ph.D. in computer science. I kept logs only for that year as a personal experiment to give some structure to my otherwise-unstructured days. Notes:

  • I highly recommend reading Year 3 of The Ph.D. Grind before reading this log since it provides the high-level context surrounding these events.
    • That was the lowest point of my six years in grad school, so I might have sounded overly negative. Reading it makes me cringe a bit now, so please don't quote me on any of this material!
  • I didn't take any classes or serve as a TA that year, so this log reflects all of my work activities, which is mostly research.
  • I've lightly edited some log entries here to remove passwords, private info, and certain expletives; omitted parts are [[marked in double brackets]].
  • I've also reformatted slightly to wrap lines that were too wide.
  • The bullet points below some entries contain my hindsight commentary written on 2019-07-17.

September 2008

- created new Dragon Book website
  - ~/MyDocuments/Documents - Summer 2008/dragon-book-website/

- read Joel L's honors thesis in preparation for meeting with Joel
  Brandt about CHI paper submission
- met with Joel Brandt to discuss CHI submission

- started implementing SQL parse tree matching for Ardilla [[to iterate
  on]] what we submitted in ICSE 09 paper

- started brainstorming and analyzing Adobe Flex search data in
  preparation for data analysis section of CHI 09 submission
  • Ardilla ICSE 2009 paper, submitted at the start of Sep 2008. I got roped into helping out on this project last-minute when I was visiting Boston on vacation in August, and Adam (first author, my former labmate when I was an undergrad at MIT) saw me hanging out in MIT CSAIL and asked whether I wanted to help out. This paper was totally unrelated to my Ph.D. work, but it used similar ideas as my lab's work on Klee.
- met with Joel to discuss Adobe search log data analysis strategy

- hacked on Python code to analyze Adobe search log data
- read over draft of Dawson's CACM paper and hand-wrote comments on
  hard copy

- began categorizing URLs for Adobe Flex search log data for CHI

- met with Joel to discuss what measurements to make using Adobe search
  log data

- set up code for categorizing query refinements for Adobe Flex search
  log data in preparation for generating numbers for paper
- refactored Python code for Adobe Flex search log data parsing and
  munging to make things easier for myself in the near future

- started generating numbers for CHI paper using Python to export to
  .csv and then R to crunch numbers

- responded to Joel's email request for tweaked numbers by exporting
  more stuff to .csv using Python and then using R to crunch numbers
  • It was really painful to do data science/visualization in Python and R before today's modern libraries came on the scene.
- refactored R code for doing data analysis so that it's less ugly

- generated more numbers for CHI paper and started writing into LaTeX
- organized and wrote up a semi-coherent outline/draft of Adobe log
  analysis section of CHI paper into the LaTeX itself (rather than
  having them as scattered notes)

- created some bar charts in R and LaTeX tables for CHI paper to make
  Joel's job of choosing and interpreting them easier
- edited Dragon Book website in accordance with Jeff Ullman's

- modify search term classifier to detect anything in CamelCase or
  camelCase as a code search - regexp: [a-z][A-Z][a-z]
  and also some other heuristics

- (semi-)automated numbers generation for CHI paper by organizing
  scripts better and making a Makefile

- Generate more data for Adobe log analysis study in accordance with
  Joel's requests

- Turned off a filter for sessions and re-ran numbers and updated paper
  accordingly; thankfully I had my Makefile set up already, so it was a
  cinch to re-generate all the numbers when something changed ;)
- Clean up R code (e.g., removing pesky print statements and just leaving
  insights in comments)
- Hand-classify more webpage types so that we can get to 80% classified,
  and also eliminate 'Adobe bug DB' category
- Run some statistical tests and generate p-values for stuff
- Re-ran numbers in R with Joel's new hand-coded session categories

- Typed up all the data we want to present for Adobe query log study

- Ported software research lunch website over from Cristi's acct. to my
  • I organized the weekly software systems inter-group lunch talks (called 'software lunch') that year in the department.
- edit, edit, edit CHI submission

- make more figures

- make a LaTeX summary table that describes our over-arching idea
- edit, edit, edit CHI submission

- submit CHI paper!
- edited Dragon Book website according to Jeff Ullman's suggestions

- installed Zotero because I plan to use it to track and take notes on
  the papers that I read
  • I didn't end up using Zotero (or any other research notetaking or reference management software) for very long!
- met with Cristi to lay out some TODOs for me in the near future

- read paper "Towards automatic discovery of deviations in binary
  implementations with applications to error detection and fingerprint
  generation" and wrote up notes on Zotero

- read paper "DART: Directed Automated Random Testing" and wrote up
  notes on Zotero
  • Cristi was the senior student in Dawson's lab at the time, where he led the Klee project. He was about to graduate and was looking for tenure-track faculty jobs.
- Started reading about web servers and HTTP to get myself oriented in
  preparation for cross-checking servers

- Created new wiki page as I begin investigating web servers to check
  with Klee:

- Started installing and investigating thttpd in
  cluestick:~/thttpd-2.25b/ since it looks like a small and tractable
  candidate to try first ... but stopped shortly after compiling it
  since Dawson busted in and suggested for me to really solidify the
  coreutils/busybox cross-checking results before moving onto a new
  application domain

- Started looking into cross-checking once again and made some tweaks to
  my HTML reporter for cross-checking located at:

- Started a 4-hour run of about a dozen selected UNIX utilities at:
  These apps were selected both because they were simple text-processing
  apps and also because they yielded lots of mis-match errors in
  previous Klee runs

- read paper "EXE: Automatically Generating Inputs of Death" and wrote
  up notes on Zotero

- Ugh, canceled batch of 4-hour runs and started a new batch of 2-hour
  runs cuz cristi told me that i need to enable -emit-all-errors to get
  Klee to properly print out ALL cross-checking errors:
  • I spent almost the entire year working on Klee, which ended up being my last year on that project. I never got any publishable results so it didn't contribute to my dissertation. At the start of Fall 2008 I used Klee for cross-checking to verify behavioral equivalence of different variants of the same software specification.
  • Our internal wiki was useful for taking and sharing research notes within our lab. Later I set up my own wiki for personal projects, which I could update with notes even from my early iPhone (mobile web browsing was still in its early days back then).
- Began manually investigating mismatches in 2-hour run of
  cross-checking runs on selected unix utils:

- Updated repos. on hexapod and kicked-off a 4-hour run batch:
  pgbovine@hexapod ~/FT-Apps/util-crosscheck/bin/pg__2008-09-24__4-hour-run

- After trying to get to work better and avoid
  pesky piping errors, cristi suggested for me to just use run-bout to
  execute programs using inputs from .bout files.  I hacked run-bout to
  redirect the target programs' stdout/stderr to files so that I can
  capture them in my Python script for post-processing.

- Read "RA: Research Assistant for the Computational Sciences" paper and
  typed up notes on Zotero
  • cluestick, hexapod, etc. were the names of some of our research lab's Linux compute servers housed in the basement of the Stanford CS Department building. Back then it was still more convenient to have our own servers than to rent cloud-based services.
  • I'd often kick off long compute jobs on those servers, lasting anywhere from an hour to up to around 10 hours (overnight runs), which I'd check up on the next morning.
- Convert to using klee/scripts/run-bout to run
  target programs concretely (it even has a timeout feature!)
  rather than hacking my own Python code to reinvent the wheel

- Re-factored into two stages, the first of which
  executes coreutils/busybox versions of program and creates Python data
  structures holding their results, and the second of which reads in
  those structures and outputs to HTML.  This will greatly facilitate
  furtuer processing of cross-checking results.

- Began grouping reports by 'category', which correspond to whether
  {coreutils, busybox} X {stdout, stderr} are non-null and/or match

- Started new folder in EXE-Util-Tests/coreutils-busybox-cck-results to
  store partial results from coreutils/busybox cross-checking

- Kicked off a 1-hour run of ALL coreutils-busybox apps here:
  • EXE and FT were names for earlier versions of Klee.
- Looked through reports for
  which is now helpfully separated by categories and wrote up my notes
  about trends in each category (as a quasi-taxonomy)

- Shot off the following email to cristi and dawson about possible next
  steps for me:

i've categorized results by whether the stdout/stderr outputs from
coreutils/busybox are non-null and/or identical and looked thru results
of each category, and the most fruitful category for actual mismatch
bugs is when both coreutils/busybox stdout are NON-NULL and their
stderr's are NULL.  this is an indication of both programs accepting the
input and doing something sensible (hence no output on stderr), so any
differences are likely to be functionality mismatches.

one idea for making klee work better on cross-checking is to have a mode
where it simply exits early whenever it detects ANY write to stderr.
what this will accomplish is to prune all the paths that result in
executions where there was *some* error output and only generate tests
for the most fruitful category i described above.  of course, i want to
make this only an option and not the norm (since it will miss stuff).
how hard would this be for me to implement?  can i just muck around in
the cross-checking code, or do i have to modify klee itself?

- Kicked off a batch of 2-hour runs in
  after cristi told me about the hack you need to do to symbolically
  compile printfs in order to suppress spurious .cck.err msgs.

- Updated software lunch web site with new schedule
Looked through reports for
and noticed that running printf/vprintf symbolically doesn't seem to
remove any spurious stdout differences, so I'm gonna undo that
suggestion by cristi to hack klee/configure to compile them symbolically

Looked into 4-hour run results from:
and compared them with 2-hour run results, noticing that lots of times
there were just more inputs generated but not much else success.

Kicked off a 1-hour run of ALL coreutils-busybox apps here:

which were run with Klee just trying to generate inputs, not trying to
do any checking
- uploaded Dragon Book web site to
  • The prior year I was a TA for Monica Lam's compilers course, which used this textbook. She wanted me to update the book's website, which I started on 2008-09-08 (see its entry above). And HOLY COW, still works (at least for now), and it's the exact page that I uploaded almost 11 years ago! Check out this screenshot:

- generated .pickle intermediate files for
  to facilitate further analysis at a later time

- followed cristi's advice and removed uclibc from klee/ and recompiled
  to again try his strategy of compiling printf/vprintf symbolically in
  order to hopefully eliminate spurious stdout differences resulting
  from not tracking printf's
  - hmmm, there seem to be lots of 'different exit codes' sort of errors
  - uniq seems to have lots of spurious stdout differences before
  compiling printf symbolically, ugh still there after compiling printf
  - ughhh still doesn't seem to work properly!
  - hmmm, try again and it sort of works better for some mysterious
    reason, wtf?

- on hexapod, tried to debug cause of spurious printf differences using
  'uniq' as the test program since uniq contains obvious false positives
  (and also 'wc' as well)
  - hmmm, now it seems like there are some false positives whenever
    either outputs to stderr (e.g., in 'wc', 'uniq'), but it's okay when
    they ONLY output to stdout with no output to stderr.  this is
    really puzzling :0
    - also problematic when NEITHER outputs to stdout, wtf???
    - maybe has something to do with calls to fflush(stdout) ???
    which screws something up in stdout, thereby causing it to be a
    different size with junk contents?
    - so it seems like sometimes when in reality neither busybox nor
      coreutils has stdout, it says that busybox stdout is of length 0
      and coreutils is of length > 0, hmmmm ...

- I still can't figure out exactly what's going on, but recompiling with
  symbolic printf seems to eliminate at least some false positives.
  I'm going to call it a day now and just kick off a batch of 2-hour
  runs, which will hopefully come back with fewer # of false positives:
- Used "cscope -b -q" in klee/ to generate a cscope index for navigating
  all Klee source files specified in cscope.files
  and added to my .vimrc
  so that I can easily navigate klee code base

- Look over 2-hour run results in:
  which should hopefully not have as many false positives since I
  compiled printf symbolically (i hope!)
  - see results analysis in:

- Re-built Klee in debug mode with klee/
  and dawson just suggested "Just make sure to not add if statements
  when you put in printf's etc, or the forks won't work."
  - started running stuff through gdb but doesn't seem too informative
    because it's too low-level and i dunno what i'm looking for yet
  - i'm gonna try again with Release version and printfs in Klee since it
    seems to run MUCH faster

- cristi suggested for me to use -write-paths and -replay-path=
  in order to replay paths within klee itself.  this might help me to
  debug better.
  - per cristi's suggestion, i had the .path files output the file and
    line numbers of branches so that i can get an idea of how execution
    flows through each path
    - my current implementation is super ugly right now, but at least it
      works ... not too informative, though :(

- i'm still investigating the mismatches, and my current hypothesis is
  that there is a subtle difference between an argument full of null
  bytes and no argument being present at all.  klee generates a series
  of, say, 11 null bytes for a command-line argument, so when it's
  replayed, that argument is present (but just all null), but for some
  reason, within klee, it treats it as that argument never existing ...
  dunno whether this is a red herring, though :(
  - hmmmm, might be a red herring, though, since when I force
    len(argc) == 1, I can't seem to get any mis-matches :/

- i have a strong hunch it has SOMETHING to do with command-line option
  parsing, like in:
    while ((optc = getopt_long (argc, argv, "clLmw", longopts, NULL)) != -1)
  of wc.c.  Perhaps we're not modeling certain functions???

- tried to simplify the FT-Apps/util-crosscheck/*.c files to make it as
  easy to debug as possible :0

- throw up my hands for now and generate a new round of 1-hour runs with
  the hope of being able to see if I can find more patterns for
  mismatches occurring:
  • It was really hard for me to form an accurate mental model of how Klee worked since I wasn't involved in originally developing it and it's a super-complex piece of software.

October 2008

- per dawson's suggestion, tried to run cross-checking wc without
  symbolic argv's, tweaking --sym-args in

  - hmmmm, by doing --sym-args 0 1 2 rather than --sym-args 0 1 10
  it seems to go down lots more paths ...
  anyways, i'll keep it short for now (at length 2), but the same
  problems still manifests itself.  i guess that's good, right?
  it will shorten the debug cycle.

- implemented a -write-verbose-paths option to Klee to have it write
  *.verbose-path files for each test case which details the exact file
  and line number of each branch taken in that path.  This is a lot more
  human-readable than the 1's and 0's outputted by -write-paths.

- tried to debug wc diffs using -write-verbose-paths and -replay-path
  - it seems like I can isolate a case where BOTH coreutils/busybox exit
    with an error (file not found), but busybox version doesn't write to
    stdout while coreutils version DOES write to stdout
  hmmmm, something to do with stdout_writes not being symbolic?

- I think that there might be some sort of issue with sharing the
  symbolic stdout between two runs.

- tried commenting out parts of coreutils wc to try to get it to not
  print anything to stdout, by making it exit() early at various points
  ... to klee-compile ONLY wc, do this:

pgbovine@hexapod ~/FT-Apps/coreutils-6.11/obj-klee/src
$ klee-gcc  -I. -I../lib -I.././../src  -I.././../lib   -g -MT wc.o -MD
-MP -MF .deps/wc.Tpo -c -o wc.o .././../src/wc.c; klee-gcc  -g
-Wl,--as-needed -o wc wc.o ../lib/libcoreutils.a  ../lib/libcoreutils.a

Woohoo, so I've found that in wc_file() in wc.c, the call to this
error() function means the difference between having it output something
to stdout and nothing to stdout
    //exit (EXIT_FAILURE); // pg - NO STDOUT WRITE
    error (0, errno, "%s", file);
hmmmm ... maybe error() in lib/error.c (?) redirects stderr to stdout
somehow?  or only when it's running within Klee?  because I see a
print-out of the following:

pgbovine@hexapod ~/FT-Apps/util-crosscheck/bin
$ ~/FT-Apps/coreutils-6.11/obj-gcc/src/wc ''
@@@ FOO @@@
/home/pgbovine/FT-Apps/coreutils-6.11/obj-gcc/src/wc: :
  No such file or directory
@@@ BAR @@@

while running natively but NOT while running in Klee with this:

    fprintf(stderr, "@@@ FOO @@@\n");
    //exit (EXIT_FAILURE); // pg - NO STDOUT WRITE
    error (0, errno, "%s", file);
    fprintf(stderr, "@@@ BAR @@@\n");

hmmm, perhaps it's using the uclibc version of error() rather than the
one in coreutils-6.11 ... hmmm, I still can't figure out which version
of error() it uses, but if I don't run error() from coreutils, then that
gets rid of spurious prints to stdout

- as a first step towards concocting a simple test case, just have one
  'run' try to print some [[expletive]] to error()and see how stdout differs.
  in util-crosscheck/*.c, don't even TRY to run the target programs.

- email to dawson and cristi at the end of my work day:
so i think i've isolated the cause of at least one of the nasty
problems.  the minimal test case for cross-checking 2 programs A and B:

Program A: do nothing
Program B: do nothing but call error() to issue an error message.

error() isn't supposed to print out anything to stdout (should only
print to stderr, i think, right?), but for some reason, Klee thinks that
Program A prints nothing to the symbolic stdout but Program B prints
some junk to stdout.  Even though during concrete replay, error()
doesn't print anything to stdout.

I still can't figure out what code error() belongs to (have a hunch it
might be uclibc but dunno for sure) or how to fix it yet, but I will
create a regression test based on this finding.


to make the example more realistic, we have:

busybox version: call busybox error function to issue error msg
coreutils version: call error() to issue error msg

the busybox error function behaves nicely and doesn't output anything to
symbolic stdout, but the coreutils error() outputs some junk to symbolic
stdout (but when it's run in replay, it doesn't output anything to
stdout) ... hmmm, perhaps different versions of error() are being called
from within Klee and outside of Klee during replay.

- started creating small regression tests for cross-checking in
  and was able to isolate the error() bug in a tiny test case

- cristi suggests to simply klee_alias __error (from uclibc) to nop in
  order to hide the error() problem for now, and keep chugging along
  to see what other false positives/negatives it can uncover.

- ohhh, I think I found the problem now!  the smallest test case is
  fprintf(stderr, "%d", 1) which triggers fprintf to print to stdout
  somehow (due to some screwy thing we're doing with uclibc I think)

- kicked off a run in:
  which is the result of after I hacked error() to not be pathological
- Looked at results from:
  which is the result of after I hacked error() to not be pathological
  - Sources of false positives:
    - again with different stdout contents (busybox of length 0 while
      coreutils of length non-zero when they should both be 0)
    - still lots of different exit code errors :(

- hmmmm, perhaps since the problem is in fprintf, I should simply
  override it with a stub version rather than overriding error()
  since other functions call fprintf and [[expletive]] too

- ok, great the problem isn't in error(), it's in uclibc fprintf()!!!

- oh wow, after looking at uclibc code itself, check out what's going on
  with fprintf:

int fprintf(FILE * __restrict stream, const char * __restrict format, ...)
  va_list arg;
  int rv;

  va_start(arg, format);
        if (stream==stdout || stream==stderr) {
          rv = vprintf(format, arg);
        } else {
          rv = vfprintf(stream, format, arg);

  return rv;

notice how if you pass in a stream of stdout or stderr, it does a
vprintf(format, arg) while ignoring the stream ... now what's in

/* DWD/CrC */
/* Copyright (C) 2004       Manuel Novoa III    
 * GNU Library General Public License (LGPL) version 2 or later.
 * Dedicated to Toni.  See uClibc/DEDICATION.mjn3 for details.

#include "_stdio.h"


int vprintf(const char * __restrict format, va_list arg)
  return vfprintf(stdout, format, arg);

This was apparently hacked by daniel to simply print to stdout!!!  so
even if you wanted to print to stderr, you would end up printing to
stdout!!!  WTF???

The original uclibc fprintf.c didn't manifest this weird behavior.
Apparently Daniel had made this change but not made any comments ...
this could be a big problem, so I gotta undo it somehow :(

Ohhhh, I see, because when you run normally (without printf symbolic),
for speed, you want to NOT have fprint(stdout, ...) fprintf(stderr, ...)
to go slowly, so you want to execute them uninstrumented.  But for
cross-checking, I want printfs symbolic, so this [[expletive]]
[[expletive]] up.

My fix is to do the following to fprintf.c:

int fprintf(FILE * __restrict stream, const char * __restrict format, ...)
  va_list arg;
  int rv;

  va_start(arg, format);

/* DWD/CrC/PG */

// ok, this is really subtle, but if we are NOT running with symbolic
// printfs (for improved speed), we should simply print stdout and
// stderr to native stdout using vprintf, but if we are running with
// symbolic printfs (for completeness, esp. for cross-checking), we
// should always run instrumented code (e.g., vfprintf)
        if (0) {
        if (stream==stdout || stream==stderr) {
          rv = vprintf(format, arg);
        } else {
          rv = vfprintf(stream, format, arg);

  return rv;

- Began a batch of 2-hour runs in hexapod pg__2008-10-03__2-hour-run
  which is after I applied the fprintf fix
  • Cristi (senior Ph.D. who was about to graduate), Daniel (Ph.D. student in my year but left early), and Dawson (our advisor) were the creators of Klee; they wrote the highly-influential OSDI 2008 paper.
- Looked at results in:
    hexapod pg__2008-10-03__2-hour-run
  which is after I applied the fprintf fix to uclibc
  for full results

- it seems like Klee thinks coreutils exits with 0 pretty much all the
  time, hmmm

- another weird behavior is that when I run programs by themselves for
  short periods of time, spurious errors seem to trigger less frequently

- started looking into uniq to track down a spurious stdout mismatch
  - so for a certain test, arg0 is '-d', but the path in Klee that
    generated this input doesn't go through the argument parsing code
    that handles '-d' in the coreutils version.  that's disturbing!!!
  - uniq.c calls getopt_long to parse command-line params from argv
  - hmmm, but from the verbose paths trace, it seems like the klee
    version uses uclibc getopt.c rather than the version bundled with
    coreutils ... hmmm, is that gonna be a problem?

- wrote a really small regression test getopt-test to try to tickle
  option parsing behavior

- OH [[EXPLETIVE]], I think my mistake was in not realizing that there are
  certain options you're supposed to pass into Klee (e.g.,
  -emit-all-errors) and others that you're supposed to pass into the
  target program to be used to set up the model environment (e.g.,
  --sym-args), and if these are confused, then strange things happen
  because Klee is executing the target program with different argc and
  argv than you expect.
  - I have since documented to be more careful about
  - but this turns out not to be a deal-breaker

- kicked off a 1-hour run in hexapod pg__2008-10-04__1-hour-run
  after simplifying options to:
    '--sym-args 0 1 2 --sym-files 1 8 --sym-stdout'
  • A typical daily workflow back then was coming into the office in the morning, checking to see the results of last night's overnight (~10-hour) experimental run, seeing what went wrong, figuring out how I could improve upon it, implementing some code changes, and then kicking off another overnight run at the end of the workday. I repeated this similar process for several months.
- look at 1-hour run in hexapod pg__2008-10-04__1-hour-run
  after simplifying options to:
    '--sym-args 0 1 2 --sym-files 1 8 --sym-stdout'
  - uniq still has the problem where with the '-d' option, coreutils
    seems to NOT execute the path with the '-d' option set but busybox
    does ... something with argument parsing or environment vars or
    something else entirely???
  - also still lots of seemingly spurious 'different exit codes' errors

- wow, so for a certain input and the '-c' (count) option on wc, Klee
  thinks that it's possible for coreutils/busybox to have different
  outputs, but when it's concretized and run, the outputs are identical
  ... e.g., in Klee:

      busybox: 1  @ @ H@@
    coreutils:  @ @ H@@

  but when replayed, they both are equal modulo spaces:

      busybox: 1  @ t ``P
    coreutils:       1  @ t ``P

  could this have something to do with the constraint solver's
  particular choice of concrete outputs???

  Hmmmm, OR it might mean that coreutils version wasn't going down the
  path with the '-c' command-line option!  Since the coreutils version
  within Klee didn't seem to print the 1.

- if you look at busybox libbb/getopt32.c, that code looks [[EXPLETIVE]];
  maybe it's not capturing the constraints properly at the bit level or
  something, it's [[EXPLETIVE]] code
  - oh wow, busybox libbb/getopt32.c screws with optind and [[expletive]],
  like it screws with global variables defined by getopt.h!!!
    - esp. the optind variable, which might be [[expletived]] up across runs
      since getopt() is only supposed to be called ONCE per run but it's
      called TWICE, once for busybox (whose own getopt calls the real
      getopt), and another time for coreutils

    - [[EXPLETIVE]] ME!  I think I found the one-line fix:
        optind = 1;
      because getopt() and friends use the global variable optind, you
      need to RESET this to 1 if you are going to call it more than once
      per program invocation.  getopt() is only meant to be called once
      per invocation, but since we're doing cross-checking, we're
      calling it TWICE (once for busybox and again for coreutils).
    - added a regression test cck-regression-tests/getopt-test-2 for this

- kick off a batch of 2-hour runs in hexapod:pg__2008-10-05__2-hour-run
  after applying getopt() fix
- Look at 2-hour run in hexapod:~/pg__2008-10-05__2-hour-run
  - wc: no false positives or negatives!
  - uniq: some false positives when both output to stderr
    - generates LOTS of uninteresting and similar (CO, BE) cases
  - uname: false positives for (CO, BO) because we are doing simple
    string comparison, but the DATE fields are different across runs;
    could be eliminated by canonicalizing times as something like
    XX:XX:XX or something
  - tee: some spurious exit code differences for (CE, BE)
  - head: a bunch of spurious different exit code differences for (CE, BE)
          or maybe they're NOT spurious after all
          oh wow, so busybox 'head' still reports error code 0 for 'No
          such file or directory', could be a bug
  - fold: some spurious exit code differences for (CE, BE)
  - date: some 'false positives' for (CO, BO) but could be eliminated by
    canonicalizing all numbers to X's in dates.
    - some false negatives where it's not picking on true error code diffs
  - comm: no false positives!

- have run-bout collect the exit code of applications so that we can
  print them in HTML report

- so the biggest problem I still see is that sometimes Klee thinks that
  exit codes are different when concrete execution says they're not, and
  vice versa ... hmmm

- ah i see, i surmise that it's an error due to getopt() again.  I
  already reset optind, but I need to also reset optarg.

- kicked off 2-hour run in hexapod:~/pg__2008-10-06__2-hour-run
  (done after resetting optarg to hopefully eliminate more false
- per Daniel's suggestion, used llvm-nm to see what global symbols from
  libc might clash between busybox and coreutils versions:

  see FT-Apps/util-crosscheck/clashing-symbols

These are the variables that seem to be shared between bb and coreutils:

optarg and optind I already reset, and __errno_location seems to return
a pointer to errno, so I want to reset that

- Look at 2-hour run in hexapod:~/pg__2008-10-06__2-hour-run
  (done after resetting optarg to hopefully eliminate more false
    - spurious exit code diffs in test000020.bout, 23, 27, 32, 35, 39, 44, 47
    - false negative on stdout diffs in test000008.bout and 33
      - they are actually different modulo whitespace but Klee doesn't
        pick up on it
    - false negative on exit code diff in test000212.bout, 210
    - spurious exit code diff in test000036.bout, 39, 45, 50, 75, 80
    - spurious exit code diff in 179, 190, 217, 243, 256, 263, 264, 409, 470
    and lots more
    - whoa model.err files wow!
  tee, uniq, wc:
    - lots of spurious exit code diffs
  uname: clean!

  Ok, so lots of spurious exit code diffs still remain, although the
  stdout diffs seem to be gone

- hmmm, daniel suggested for me to do a fix to run-bout.c because I
  was clobbering the exit code output file more than once

- now with that fix in place, it seems like more false positives gone
  wc: clean!
  uname: clean!
  uniq/tee/fold: a bunch of false positives where it thinks coreutils
  exits with 0 but it actually exits with 1
  head: test000508.bout false negative where it thinks that error codes
        are identical when they are in fact different
        a bunch of false positives where it thinks busybox exits with 0
        when it actually exits with 1

- tweaked some cross-checking code to simplify handling of exit codes,
  which I think makes a lot more sense now ... I dunno what the code was
  doing before :0

- hmmm, seems like I can't really reproduce the false positives when I
  replay the seed .bout files
  ugh seems a bit non-deterministic

- ohhh, so collation order seems to affect the output of comm; the
  larger problem is what variables are defined in the test environment
  divergence occurs in comm.c w.r.t hard_LC_COLLATE variable

  - ahhh, daniel suggests to use rather than calling
    run-bout directly in order to have it set up the proper environment!
  - well, that wasn't completely appropriate, but at least I learned
    that you need to start with an EMPTY environment (using 'env -i')
    and then source in the vars that you need in order to replay

    env -i /bin/bash -c "source $HOME/klee/models/simple/testing-env;
    $HOME/FT-Apps/busybox-08-12-08/obj-gcc/bin/comm test000008.bout

  - hacked to actually execute programs in the
    (near)-virginal testing environment where they execute within Klee

  - ugh, tee still has a false positive where Klee thinks that coreutils
    exits 0 but it actually exits 1
    e.g., pg__2008-10-06__2-hour-run/tee-data/test000012.bout
    - wow, when i recompile busybox with debug info, so that i can see
      it in .verbose-paths, there's a lot of weird [[expletive]] that
      seems to be executing!

  pgbovine@hexapod ~/FT-Apps/util-crosscheck/bin
  $ KLEE_EXTRA_ARGS="-only-seed
  --emit-all-errors --write-verbose-paths" tee

- Kick off a run in hexapod pg__2008-10-07__2-hour-run
Look at results in hexapod:~/pg__2008-10-07__2-hour-run, and they seem
pretty solid ...
  - There are still false positive exit code mismatches, where Klee
    thinks that coreutils exited with 0 when it actually exited with 1,
    but I'm going to punt on them for now since they're not showstoppers
    - e.g., uniq, tee, head, fold
  - also some false negatives where it doesn't detect exit code

Try to calculate coverage of these tests on both coreutils and busybox
versions by doing:

cd ~/FT-Apps/coreutils-6.11/obj-gcov/src wc ~/pg__2008-10-07__2-hour-run/wc-data/
cd ~/FT-Apps/busybox-08-12-08/obj-gcov/bin wc ~/pg__2008-10-07__2-hour-run/wc-data/

to get separate coverage numbers for coreutils/busybox

created a script called ~/FT-Apps/util-crosscheck/
with optional --gen-zcov-html option to replay and visualize coverage

Generate coverage information and manually looked through zcov HTML
files, hunting for insights:
    - coreutils usage() not being called with status = 0 (success), so
      not printing out actual usage and instead just printing out a
      compact "try --help for more information" message, so missing some
      coverage there
    - batch_convert() function not getting called because the 'f' option
      is not being supplied on command-line (could fix by forcing that
    - wow, 100% line coverage in busybox but only 59% in coreutils
    - same issue with full usage() error message not being printed, and
      only an abbreviated message being printed ... no big deal
    - lots of functionality that bb doesn't implement that doesn't get
      tested by coreutils version ... hmmm, interesting, so if we can
      achieve very high coverage on one version (presumably the simpler
      one), then it doesn't matter if we get super high coverage on the
      other one since it's most likely functionality not implemented by
      the simpler one.
      - hmmm, perhaps this is why cristi wanted to run busybox FIRST
        since it implements less functionality

- To assess the effects of Klee run-time on coverage, kick off a few
  runs of varying durations:


(we already have 2-hour data from pg__2008-10-07__2-hour-run)

- Started writing a draft of the paper just so we can have a centralized
  place to put all of our ideas and proposals.
  • I started drafting up my own Klee follow-up paper based on my ongoing cross-checking experiments, but that never ended up getting submitted.
Kicked off a 1-hour run of ALL benchmarks in:
  - hexapod pg__2008-10-10__1-hour-run-ALL
- Read camera-ready version of Klee paper just to see what's already
  been written up about cross-checking, and fortunately, not much has

- Generate .pickle files and coverage numbers for
  in preparation for analysis
- FT-Apps/util-scripts/helper/ lists names for
  busybox/coreutils cck versions of programs that don't exist:

cal.log:ERROR: Could not find ./cal.bc or ./cal
catv.log:ERROR: Could not find ./catv.bc or ./catv
dos2unix.log:ERROR: Could not find ./dos2unix.bc or ./dos2unix
install.log:ERROR: Could not find ./install.bc or ./install
length.log:ERROR: Could not find ./length.bc or ./length
[[.log:ERROR: Could not find ./[[.bc or ./[[
realpath.log:ERROR: Could not find ./realpath.bc or ./realpath
test.log:ERROR: Could not find ./test.bc or ./test
true.log:ERROR: Could not find ./true.bc or ./true
usleep.log:ERROR: Could not find ./usleep.bc or ./usleep
uudecode.log:ERROR: Could not find ./uudecode.bc or ./uudecode
uuencode.log:ERROR: Could not find ./uuencode.bc or ./uuencode

so I should probably remove them.

- Graphed out preliminary coverage numbers from 1-hour run of all cck
  programs in LaTex using
  - eh, this isn't really that useful cuz it only accounts for lines in
    the utility itself and not ELOC in libraries as well - use LLVM to
    count ELOC coverage like in Klee OSDI paper

- Make a cursory pass through results in:
  and noticed that there are A LOT of test cases generated where both
  reject input, which suggests that generating valid inputs is a real
  problem :/
  - hmmm, what can we say about coreutils/busybox implementations of
    programs that yield LOTS of tests where both ACCEPT input rather
    than lots of tests where both do NOT accept input?  Perhaps they're
    a lot more 'similar' than we'd expect?
    - could we define a similarity function that way?

- whoami is a great small example where there is 1 test case where both
  accept input and 12 test cases where neither accept input

- Talked with cristi about division of labor on cross-checking project:
  - I'm gonna continue with coreutils/busybox and he'll try to find
    a new target application to check

- Implemented, which creates a summary of all
  the cross-checking outputs in a particular run in one single HTML file

- cristi also suggests doing 20-minute runs rather than 1-hour for
  brevity, because especially with DISABLE_COV_ONLY_NEW on, you might
  get overwhelmed by the sheer number of paths

- started creating a crappy script for automating
  cross-checking runs (could obviously make much better in near future!)

- kicked off a run in hexapod:pg__2008-10-15__20-min-run-ALL
- Improved and to print
  out better information about cck run results
  - For, output how often error codes match or
  - For, print out 'input accept category' to
    facilitate debugging and matching with output of

- Started manually looking at results in
  and writing up observations in:
- continue looking thru results in
  and writing up observations in:

- oh wow, grouping by stderr messages can be instructive because you can
  see what cluster of command-line options lead to the SAME (or similar)
  reported error

- kicked off a run to report ALL paths for coverage measurements:
- continue looking thru results in
  and writing up observations in:

- from reading thru all the raw reports, I can usually distill it down
  to a small handful of mismatch bugs, most of which consist of one
  version being a bit more lenient on what inputs it allows thru and
  accepts (or implementing more functionality, usually coreutils)
- finished looking thru results in
  and writing up observations in:

- exit codes matter because they serve like function return values when
  programs are strung together into scripts!

- started writing up summary of cu/bb cross-checking results in HTML

- wrote a custom text to HTML converter to generate cu/bb cross-checking
  results HTML file since it was too ugly to manually write the HTML
  • I wrote Python scripts to parse the raw experiment outputs and generate organized-looking webpages containing sortable HTML tables that were easier to read.
- finished custom text to HTML coverter in
  so that I can write in semi-legible text format and have it rendered
  in pretty HTML

- created a new uclibc svn repository so that we can keep track of our
  patches to it and merged all changes into it so that we can keep
  better track of them (before it was just a tarball within Klee)

- finished first draft of an HTML summary of the dozens of
  coreutils/busybox cross-checking mismatches that I found by manually
  looking through thousands of error reports
  - and emailed out to dawson and cristi to have them take a look
- started playing around with making custom input filters for argc/argv
  for programs in order to drive execution down more successful paths

- started looking into fixing keeda after its fatal hard disk corruption
  problem, ughhh
  - managed to 'rename' bugspray to keeda and to restart netapp mount,
    svn/git, webserver, and mailing list

- kicked off a run with input filtering on selected multi-line apps
  in hexapod:pg__2008-10-21__2-hour-run-MULTILINE-FILTERING
  • I volunteered to fix one of our main compute servers after its hard disk crashed since I wanted to feel somewhat useful to the lab, despite my lack of progress on research.
- did a ton of IT/sysadmin work setting up the new keeda (formerly
  bugspray), mailing list, svn/git, etc. - ughhh annoying but had to be
- spent the entire freakin' day trying to restore the mc MoinMoin wiki
  and getting screwed by SELinux - notes are in:

  ~/MyDocuments/Documents - Stanford 3rd year Ph.D./

- finally got [[expletive]] sort of working, though.  wrote a script that
  converts old mediawiki format to moinmoin format (doing some VERY
  simple conversions of links)
- looked at results in
  hexapod:pg__2008-10-21__2-hour-run-MULTILINE-FILTERING and didn't
  see anything new arising from having multiple-line inputs except for
  one bug in tac that I reported in

- a la daniel's suggestion, try to use klee_prefer_cex() with
  something like 'hello' as the preferred counterexample string for
  stdin and file A to see if we can group tests that way, and it seems
  to work pretty well to try to generate those inputs unless otherwise

- kicked off a full set of 1-hour runs (with klee_prefer_cex()) so
  that now hopefully unconstrained inputs can be clumped together:

- get Klee installed for Roger and orient him on how to use it to run
  his algorithms applications
  • Roger was an undergrad who worked on Klee partly under my supervision. I did a pretty bad job supervising since I didn't even know how to work well with Klee myself!
- another full day of IT excitement!

- scraped data off of keeda's failing (primary) hard disk and put onto
  netapp for backup
- seems like /web/metacomp has bad sector issues

- got wiki fully-functional again
- wrote backup script and added to cron tab
  - backs up important keeda files here:

- re-add people onto old mailing lists
  - Files located here
  - config.pck is simply a Python pickle file with a field 'members'
    which contains the list members
  - People on mc list: [[cut out private info]]
    - haven't added yet
- updated the wiki to document some of my IT labors, with the hopes
  that it will be easier to restore machines in the future

- updated scripts to generate summary reports of coverage and mismatch
  data for an entire run consisting of many cu/bb utilities
  - can now print out rough measure of coverage and % equivalent paths

- managed to resurrect the CS343 Spring 2007 wiki on keeda!
- created FT-Apps/util-scripts/ to graph out how
  frequently .bout files are generated
  - for most programs, tests (.bout files) are generated at a pretty
    consistent rate, but some programs experience spurts followed by
    relatively long stalls of inactivity

- Look at results in hexapod:pg__2008-10-17__30-min-run-ALL which
  generates all paths, not just those that cover new lines
  - hmmm, interesting, when we tell Klee to generate all tests
    regardless of whether new lines are covered, it generates tests at
    an alarmingly consistent rate (pretty much linear)

- Look at results in hexapod:pg__2008-10-24__1-hour-run-ALL, which
  uses klee_prefer_cex() rather than forcing files and stdin to
  contain printable contents, which seems to generate more tests than
  finds more potential mismatches due to lack of restrictions on inputs
  - hmmm, finds some diffs involving non-printable characters, such as
    tac (pg__2008-10-24__1-hour-run-ALL/tac.html#CATCO-BO)

  - lots of possible false positives resulting from null bytes in
    inputs ... should I disallow null bytes in inputs at least?
    - maybe stick to isprint() or '\n' or '\t'
    - one guy says ASCII 0-8, 11-31, > 126 are illegal
    - hmmm, it seems REALLY slow to enforce these bounds for all bytes
      in a file, ugh

- it seems like the 'twin' goals of generating more tests and
  creating fewer redundant tests are at odds with one another ...
  e.g., we generate two tests with seemingly identical arg0 as '//'
  but they are actually different sequences of bytes:
    '//\0' (3 bytes)
    '//\0{M 8{ \0' (11 bytes) - notice that the null-terminated prefix
                                is the same
  I guess technically command-line arguments should be parsed like
  normal null-terminated strings, so we should maybe impose some sort
  of null termination requirement, right?

- cristi and i both independently confirmed that the isprint()
  constraint on the symbolic stdin and file might be royally screwing
  us over; when running without it (and only constraining no null
  bytes or not constraining at all), it seems to be much faster

- kicked off a run on hexapod:pg__2008-10-27__1-hour-run-ALL after
  klee_cex_prefer files/stdin to be 'foo\nbar\n' for canonicalization
- refactored FT-Apps/util-scripts/
to make the code a bit clearer and stuff

- worked with cristi to refactor cross-checking code to hopefully make
  it faster and generate more tests
  - eliminated isprint() restriction for symbolic arguments; will
    hopefully speed things up
  - did various other code clean-ups

- used apache Virtual Hosts to sysadmin the cs343 website better

- manually look through and jot
  down handwritten notes on how I could reproduce these results
  semi-automatically (e.g., by post-processing the CCKExecutionResults
- created scripts like in order to generate
  separate zcov files visualizing executions for each .bout file so
  that I can get some intuition for how Klee's search works.

- looked at some path coverage information using zcov for 'dirname'
  and 'tr' and noticed that there were LOTS of redundant paths with
  identical branch decisions
  - (could output path files in Klee)
  - LOTS of identical paths in dirname.c itself, but probably
    different paths in library code that it calls
  - actually, a lot of redundant test cases (with nearly identical
    outputs) contain pretty much the same coverage in the utility
    programs themselves but probably differ slightly in constraint
    generation in libc code (which zcov doesn't pick up on)

- if you canonicalize the .pc files (using,
  then similar constraints usually yields similar .bout files
  which makes sense because .bout files are generated directly from
  path constraints in .pc file
  - .path files are also sorta similar - could do edit distance
  - could be used in clustering ... edit distance between .path files
    - the more similar the .path files, the more similar the
      constraints, and hence the more similar the .bout files

- wrote up a long-ass proposal email for directed search to eliminate
  dependencies, which i've posted on the wiki here:
- [[expletive]], zcov svn repo is an old version that's incompatible :(
  need to somehow port to new version
  - the problem was that the svn db format was '3' instead of '2'
  so I did a dirty-ass hack where I simply changed
  /netapp/projects/repositories/zcov/db/format to 2
  and changed the format of 'current' file slightly
  and moved around files in revs/ and revprops/ to match the structure
  of other repositories with db format 2, such as uclibc
  - seems to work!

- checked in a change to zcov-genhtml to add an --ignore-no-cov
  option in order to print out briefer zcov reports for measuring

- modify to take timestamp deltas from .bout
  files and assembly.ll and convert them into delta times and archive
  them in each CCKExecutionResults object ... that way, we don't have
  to mine the file modification dates using stat later on (so that the
  original files can be destroyed)

- woohoo, finally got Mailman mailing list working on keeda again!
- Kicked off a run at hexapod:pg__2008-10-31__1-hour-run-ALL with the
  new simplified code refactored by cristi and me (which gets rid of
  costly isprint() constraints on inputs)

- Created new mailing lists on keeda Mailman (decided to forget about
  trying to re-import the old lists since directions on the internetz
  seemed WAY too grimy and complicated)

- Looked at results of hexapod:pg__2008-10-27__1-hour-run-ALL
  - oh wow, some of these tests take way longer than 1 hour to
    terminate, up to 20 to 25 minutes overtime
  - good that it generates in the high hundreds or sometimes in low
    thousands numbers of paths!

- Meet with Dawson to talk about improving search and wrote up
  findings on wiki:

November 2008

- Read read-write set paper for EXE, Liblit's statistical bug
  isolation paper, and skimmed a paper on combinatorial group testing
  because Dawson recommended for me to read those while thinking about
  trying to improve search in Klee

- Sketched out algorithm for targeted search in Klee in my research
  notebook, incorporating some ideas from papers I read today
  • After my Klee cross-checking adventures (July-Oct 2008) totally failed, I spent the next two months on a new direction trying to improve its search algorithms so that it could hopefully achieve better test coverage on our benchmark programs.
- started implementing my custom targeted searcher by first
  re-orienting myself with the Klee searcher code so that I can get an
  intuition for what's going on and figure out how to easily diagnose
  whether changes I've made had positive or negative effects.
  - code understanding is the first step towards my goal
    - wow, cscope really comes in handy when trying to perform
      program understanding

- the default searcher in is:
  --switch-type=internal --randomize-fork --use-random-path
  --use-interleaved-covnew-NURS --use-batching-search
  --batch-instructions 10000
- this interleaves a RandomPathSearcher (coin flip at every branch)
  with a WeightedRandomSearcher using CoveringNew (trying to cover new code)

- disable root login in keeda and restart sshd for added security

- change svn/git check-in notifies back to the respective mailing lists rather than ones
  and alias the cs ones to keeda versions

- woohoo, can use klee_message() to print messages cleanly to
  messages.txt, which I can parse and post-process

- started implementing an experimental searcher on cluestick and added
  the --use-interleaved-min-similar-branches-NURS option for it

- can do apples-to-apples coverage comparison of different search
  strategies by running with --stop-after-n-tests=
  --dump-states-on-halt=false (just stop immediately after halting)
  - and using to print results of run.stats

- cool, preliminary tests show that my
searcher (which isn't even aware of achieving new coverage) fares a
bit better than the 'state-of-the-art' --use-interleaved-covnew-NURS
searcher on tough test cases like 'tr' and 'ls', although it runs
quite a bit SLOWER (but I can optimize that later)
  - it runs A LOT slower when you don't have DISABLE_COV_ONLY_NEW on,
    probably because it doesn't explicitly aim to achieve new line

- nice, initial results look promising!  created two searcher options:
    (minimize # similar branches + random)
    (minimize # similar branches + try to cover new code + random)

- created ~/FT-Apps/targeted-search directory for scripts dealing with
  targeted search

- ran an ad-hoc overnight run in:
  pgbovine@cluestick ~/FT-Apps/coreutils-6.11/obj-klee/src

- started a wiki page for the project at:
- looked at ad-hoc 30-minute runs on several benchmarks and initial
  results look encouraging:
    --use-interleaved-min-similar-branches does a few percentage
    points better in branch and statement coverage than default
    interleaved cov-new search

- try to visualize the output of my new searcher by printing out the
  counts for number of true and false branches taken at each point
  - hmmm, lib/Executor.cpp forks() at more than just branches in the
    code ... should I also visualize non-branch forks?

- dawson busted in and wasn't really feelin' my counts approach
  because he says that it doesn't capture the essence of what he's
  going after, which is how different that paths are from one another
  in terms of their components
  - he made some suggestions, which I wrote up on wiki page:

- added an --output-branch-cov-counts option to klee to periodically
  dump information about branch coverage to run.branches
  and --branch-cov-counts-write-interval to control frequency (default
  at 2 seconds)
  - hopefully I can visualize the output to get a sense of how search
    is working
- looked at ad-hoc 1-hour runs on several benchmarks and initial
  results look decent:

| Path                                        |          I |       Tr |  BCov |  SCov | Stmts |    Q# |       Tq |       Ts |     St |
| test-expr-DEFAULT_COVNEW-3600-secs          |  145679480 |  3611.49 | 82.32 | 87.00 |  1115 |  4799 |  2125.59 |  2173.99 |  22627 |
| test-expr-MIN_SIM_BRANCHES-3600-secs        |   62223463 |  3600.51 | 81.10 | 86.64 |  1115 |  1708 |   407.87 |   429.48 |  20275 |
| test-expr-MIN_SIM_BRANCHES_COVNEW-3600-secs |   78457854 |  3600.77 | 85.37 | 90.76 |  1115 |  1873 |   478.16 |   504.19 |  18695 |
| test-expr-RANDOM_PATH_SEARCH-3600-secs      |   91496858 |  3648.70 | 81.10 | 83.32 |  1115 |  3637 |  1689.59 |  1778.36 |  34953 |
| test-ls-DEFAULT_COVNEW-3600-secs            |  102508097 |  1337.11 | 30.51 | 38.02 |  6134 |  2048 |   250.17 |   299.38 |  14601 |
| test-ls-MIN_SIM_BRANCHES-3600-secs          |   84437032 |  3362.51 | 33.33 | 43.94 |  6134 |  5547 |   414.06 |   453.37 |  16988 |
| test-ls-MIN_SIM_BRANCHES_COVNEW-3600-secs   |   96949166 |  3601.09 | 33.85 | 44.34 |  6134 |  4437 |   298.97 |   387.59 |  17621 |
| test-ls-RANDOM_PATH_SEARCH-3600-secs        |   73465857 |  3606.49 | 32.65 | 40.77 |  6134 |  4757 |  1537.54 |     1709 |  18059 |
| test-sort-DEFAULT_COVNEW-3600-secs          |     647802 |   132.23 | 25.12 | 36.72 |  4807 |   504 |    46.39 |    47.37 |    196 |
| test-sort-MIN_SIM_BRANCHES-3600-secs        |     634151 |   261.91 | 22.77 | 33.99 |  4807 |   630 |    88.66 |    89.77 |    121 |
| test-sort-MIN_SIM_BRANCHES_COVNEW-3600-secs |     843678 |   358.37 | 28.96 | 39.71 |  4807 |   918 |   203.36 |   208.14 |    266 |
| test-sort-RANDOM_PATH_SEARCH-3600-secs      |     253339 |   181.54 | 10.52 | 17.25 |  4807 |   105 |     5.72 |     6.04 |    137 |
| test-tr-DEFAULT_COVNEW-3600-secs            |  104125391 |  3645.16 | 71.64 | 77.23 |  2262 | 12257 |  2238.39 |  2397.93 |  40234 |
| test-tr-MIN_SIM_BRANCHES-3600-secs          |   75757269 |  3601.78 | 65.92 | 71.93 |  2262 |  5998 |   637.18 |   710.21 |  38905 |
| test-tr-MIN_SIM_BRANCHES_COVNEW-3600-secs   |   80052969 |  3601.96 | 67.91 | 75.55 |  2262 |  8219 |   805.86 |   899.96 |  40594 |
| test-tr-RANDOM_PATH_SEARCH-3600-secs        |   19174264 |  3616.05 | 55.47 | 62.56 |  2262 |  5091 |  1968.38 |  2141.77 |  33244 |
| test-uniq-DEFAULT_COVNEW-3600-secs          |   56427498 |   991.00 | 85.16 | 92.94 |   652 |  4858 |   324.53 |   398.21 |  30263 |
| test-uniq-MIN_SIM_BRANCHES-3600-secs        |   59534576 |  3602.79 | 88.28 | 94.02 |   652 |  6492 |   395.37 |   467.02 |  33927 |
| test-uniq-MIN_SIM_BRANCHES_COVNEW-3600-secs |   60093089 |  3604.53 | 87.50 | 94.02 |   652 |  9651 |  1027.86 |  1129.14 |  32149 |
| test-uniq-RANDOM_PATH_SEARCH-3600-secs      |   26016740 |  3615.52 | 82.03 | 87.88 |   652 |  4206 |  1294.89 |  1470.91 |  32933 |
| Total (20)                                  | 1218778573 | 53581.52 | 39.86 | 48.71 | 59880 | 87735 | 16238.52 | 17701.84 | 446788 |

- fixed sendmail to get mail sending properly to outside of Stanford
  (using and hacked /etc/mail/access file to only
  allow mails to be sent from addresses (hopefully)

- committed my initial changes of my min-similar-branches searcher
  and diagnostics (--output-branch-cov-counts) into klee repo.

- Dawson seems to think that simple measures of line/branch coverage
  vs. time should be enough to get a sense of how well stuff is
  working, and we don't need more complicated metrics unless they're
  for diagnostic purposes.  Graph out how fast each achieves a given
  amount of coverage.
  - if the optimizations don't improve line coverage by much, then
    they're not worth doing

- began implementing searcher that tries to take adjacent pairs of
  branch decisions into account:

- however, when testing apps like 'ls' which contain LOTS of
  command-line option possibilities, it gets really stuck in option
  parsing :( gahhh, we don't have grammar-based input generation.
  - i guess dawson says that's what RWSets were about :/

- kicked off a set of 1-hour overnight runs on:
  pgbovine@cluestick ~/FT-Apps/coreutils-6.11/obj-klee/src
- created some scripts in FT-Apps/targeted-search to analyze
  run.branches and termStates.branches files in order to better
  compare efficacy of search strategies

- realized that results from different searchers might be skewed in
  terms of time performance because in some modes, we OUTPUT MORE
  TRACE DATA to log files (e.g., run.branches, termStates.branches)
  - fixed by always outputting the same amount of data regardless of
    the searcher used

- both min-similar-branches and min-adj-branches searchers sometimes
  dominate covnew searcher initially during the first 100 or so tests,
  but then covnew ends up catching up later, which suggests that a
  hybrid scheduling strategy might perhaps work well

- created FT-Apps/targeted-search/
and plotted out the branches taken by terminated states
  - upon first impressions, most states have an EXTREMELY similar
    (maybe even identical) prefix, perhaps corresponding to library or
    init. code

- 'formalized' the problem a bit as follows to dawson:

the problem as i see it is as follows:

you are given a set of terminated states T, each containing a sequence
of branches

you have a set of states S on your queue and need to pick one to run.
you want to pick the state in S that's "least similar" to all the states
in T, since that will give you the best shot (maybe) of covering new
code.  the problem is to define the scoring function that can score each
element in S against the elements of T.

the two searchers that i've implemented thus far are special-case
solutions to this problem:

1.) my original solution simply counts how many times a particular
branch appears in T
2.) the two-branch interleaving searcher counts how many times a
two-branch adjacent sequence appears in T
- removed --write-verbose-paths option from Klee cuz it's useless for
  me now and just clutters up the code

- re-factored code that deals with handling branch sequences to be a
  bit cleaner and more consistent

- added a Klee option called --only-track-two-way-branches
  to only track branch sequences for branches that can go both ways

- wow, if you only track branch sequences for internal forks and
  branches that only go both ways, then your sequences are actually
  quite short (a dozen or so branches rather than thousands)!

- implemented a Klee option called --only-count-covnew-state-branches
  to only count branch sequences for terminated states that have
  covered new code

- met with dawson and talked about ways to get me un-stuck on this
  search problem and jotted down lots of stuff in notebook

- started implementing a PGSearcher class to mess around with getting
  better intuition about how search works for realzz
  - use something like this to test:
    SEARCH="--switch-type=internal --use-pg-search" tr
- keep messing around with PGSearcher class to get a better feel for
  how search works and experiment with implementing some simple
  policies ...
  - use the following command-line to test:

pgbovine@hexapod ~/FT-Apps/coreutils-6.11/obj-klee/src
$ SEARCH="--switch-type=internal --use-pg-search
--output-branch-cov-counts --dump-states-on-halt=false" tr

compare with interleaved covnew + random:
$ tr --output-branch-cov-counts --dump-states-on-halt=false

- the biggest problem with using sets of adjacent pairs is that you
  quickly run out of unique unseen pairs ... perhaps do triples too?
  - at most there will be 1 pair that's unseen, just from my cursory

- kicked off a set of overnight runs on various benchmarks
on hexapod
- initial results from 30-minute runs of PGSearcher vs. default covnew
  search look promising!  PGSearcher terminates far fewer states than
  default covnew (since it's MUCH, MUCH slower in its current ugly
  unoptimized form), but for every state it terminates, it makes
  better progress in terms of coverage than covnew

results summary (for the LAST state that terminated for PGSearcher,
what is the coverage obtained by PGSearcher vs. covnew vs. random path):

  expr: PGSearcher terminated 3773 states
                   (# covered , # uncovered instructions)
      random path: (794,321)
      default:     (683,432)
      PGSearcher:  (796,319)

   ls: PGSearcher terminated 91 states
                   (# covered , # uncovered instructions)
      random path: (1641,4493)
      default:     (1688,4446)
      PGSearcher:  (1924,4210)

  ptx: PGSearcher terminated 251 states
                   (# covered , # uncovered instructions)
      random path: (1041,2194)
      default:     (1799,1436)
      PGSearcher:  (2098,1137)

sort: PGSearcher terminated 125 states
                   (# covered , # uncovered instructions)
      random path: (829,3978)
      default:     (1453,3354)
      PGSearcher:  (2137,2670)

  tr: PGSearcher terminated 1963 states
                   (# covered , # uncovered instructions)
      random path: (1347,915)
      default:     (1582,680)
      PGSearcher:  (1552,710)

uniq: PGSearcher terminated 2302 states
                   (# covered , # uncovered instructions)
      random path: (537,115)
      default:     (589,63)
      PGSearcher:  (595,57)

  - seems like for tests that already achieve high coverage,
    PGSearcher doesn't fare much better, but for ones with pathetic
    coverage, PGSearcher really shines (at least in the beginning)

- kicked-off a series of 5-hour runs only using PGSearcher to see how
  much coverage it can get if I run it for longer
- created FT-Apps/targeted-search/
  to plot statement coverage comparisons for termStates.branches files
  to compare different searchers in terms of search achieved per
  terminated state

- saved a copy of targeted-search results in:

- cristi pointed out that the coreutils tests used for the OSDI paper
  are all in EXE-Util-Tests/coreutils-6.10/exe-tests-3

- plotted out coverage graphs in

- started manually slicing the tr program as tr-slice.c in
  pgbovine@hexapod ~/FT-Apps/coreutils-6.11/src
  to see if I can get up to 100% coverage:

to compile with gcov, use:

cd ~/FT-Apps/coreutils-6.11/obj-gcov/src

gcc  -I. -I../lib -I.././../src  -I.././../lib   -g -fprofile-arcs
-ftest-coverage -MT tr-slice.o -MD -MP -MF .deps/wc.Tpo -c -o tr-slice.o

gcc  -g -fprofile-arcs -ftest-coverage  -Wl,--as-needed -o tr-slice
tr-slice.o ../lib/libcoreutils.a  ../lib/libcoreutils.a

to compile with llvm use:

cd ~/FT-Apps/coreutils-6.11/obj-klee/src

klee-gcc  -I. -I../lib -I.././../src  -I.././../lib   -g -MT tr-slice.o
-MD -MP -MF .deps/tr-slice.Tpo -c -o tr-slice.o .././../src/tr-slice.c

klee-gcc  -g  -Wl,--as-needed -o tr-slice tr-slice.o
../lib/libcoreutils.a  ../lib/libcoreutils.a ../lib/__setvbuf.o

- ugh, it seems like manually slicing is pretty damn hard :(
  - add klee_silent_exit(0) in as many places as covered code reaches
  as possible, in order to encourage Klee to zonk out early during
  those paths
  - ugh, it's really hard to figure out code what to truncate, because
  there are all sorts of dependencies

- ok, i'm giving up on the slicing idea for now since it seems
  non-trivial to get it to work well, and there are easier fish to fry
  right now

- at cristi's suggestion, used the --seed-out-dir klee option to
  replay seeds and see if my search can improve upon the best that
  covnew can do, since covnew flattens out after about an hour or so
  of trying hard :0

- added a --pg-search-covnew option to use in addition to
  --use-pg-search when you want to break ties when states fork (i.e.,
  when both sides end in two-branch sequences that have been seen
  before) by resorting to the CoveringNew heuristic
  - hmmmm, it seems just anecdotally from running 'tr', if I use
    --pg-search-covnew, it tends to get stuck on local maxima since
    the heuristic isn't perfect ... randomization might actually get
    me farther :0
- checked-out latest busybox from svn to see if I can still replicate
  cck bugs I found in:

- tried to replicate bugs found in cu-bb-cck-results.htm and first
  report the ones that are most likely to be mismatches via the
  busybox bug database under 'Standards Compliance' category:
  - made a note of reported bugs in:
      [[username/password omitted]]

- met for 2 hours with roger to answer more of his Klee questions and
  to get him started with a concrete project involving cross-checking
  student implementations of routers for the networking class that
  he's currently TAing
- saved results from --pg-search-covnew option in and looked at them:
  - doing --pg-search-covnew and results don't look nearly as good as
    random; it does generate more tests, though, but bottoms out kinda
    like default covnew search, so it sucks

- totally eliminated the --pg-search-covnew option since it seems
  useless for now ... can always revert in SVN later if I want to
  bring it back ... it just bloats up the code right now

- hmmmmmm, pg-search-covnew is crappier than regular pg-search in
  terms of coverage, but it does generate significantly greater
  numbers of tests, which makes me suspect that for some reason, the
  random picking of pg-search is picking shallower states which take
  longer to terminate, hmmm, dunno

- gonna focus efforts on trying to optimize default PGSearcher in
  order to boost number of tests generated per minute

- one issue with PGSearcher as it is currently implemented is that
  after a while, there will be no more unseen pairs and it will
  essentially devolve into a random search that favors SHORT paths, I
  guess with the hopes that it will go down 'different' paths :/

TIMEOUT=300 SEARCH="--switch-type=internal --use-pg-search
--output-branch-cov-counts --dump-states-on-halt=false" tr

- I am still concerned that after a while, ALL pairs would have been
  seen already, so we are back to square one again and just picking
  sorta randomly, favoring shallower paths ...
  - perhaps one solution is that after we've seen a few rounds with no
    states with any unseen branch pairs, clear out
    theStatisticManager->seenBranchPairs and start afresh again
    - or somehow use counts of how many branch pairs have been seen

- did a first-pass at optimizing PGSearcher and seemed to get pretty
  solid improvements in time performance, so kicked off an overnight
  run on 6 selected benchmarks
- after doing optimizations on PGSearcher, put results in:

- results look promising - optimized PGSearcher gets much farther than
  unoptimized PGSearcher and doesn't suffer in terms of coverage
  and is also sort of competitive with default covnew
  - it generates approximately the same coverage but much faster than
    covnew, but then covnew eventually catches up in most cases

- investigate why 'ls' and 'sort' bonk out early with errors in
  PGSearcher, 'sort' doing so almost immediately after launch
    See: EXE-Util-Tests/coreutils-targeted-search-results/2008-11-13

- 'ls' is complaining about too many open files, so up it somehow ...
  - add 'pgbovine hard nofile 100000' to /etc/security/limits.conf

ahhh clever, su as root and then su into your own username to run with
higher ulimit:

su - root
ulimit -n 100000
su - pgbovine
ulimit -n # check to make sure it's 100000 now

- to run 'ls', you MUST have a higher ulimit for number of open files!

- ah, it turns out that 'sort' was running out of memory due to a
  big-ass buffer.  The OSDI 2008 paper mentioned how they had to
  modify sort.c to make the buffer smaller, but cristi said that they
  didn't propagate the changes into GIT.  I made the change and it
  seems to work fine now:

diff --git a/coreutils-6.11/src/sort.c b/coreutils-6.11/src/sort.c
index 8b2eec5..0f9ade7 100644
--- a/coreutils-6.11/src/sort.c
+++ b/coreutils-6.11/src/sort.c
@@ -241,7 +241,9 @@ static size_t merge_buffer_size = MAX (MIN_MERGE_BUFFER_SIZE, 256 * 1024);
 static size_t sort_size;

 /* The guessed size for non-regular files.  */
-#define INPUT_FILE_SIZE_GUESS (1024 * 1024)
+//#define INPUT_FILE_SIZE_GUESS (1024 * 1024)
+// pgbovine - changed to make smaller so that Klee doesn't run out of memory

 /* Array of directory names in which any temporary files are to be created. */
 static char const **temp_dirs;

- in order to try to find benchmarks on which Klee currently does
relatively badly on, do the following:

pgbovine@hexapod ~/EXE-Util-Tests/coreutils-6.10/exe-tests-3
$ --sort=SCov *-data

candidates for [[expletive]]-ness:
  ls, sort, head, tail, tac, wc, ptx

- do an overnight run comparing apples-to-apples of PGSearcher and
  default covnew for all of these apps:
    ls, sort, head, tail, tac, wc, ptx
  making sure to set ulimit -n to 100000 first
- added results of latest run of PGSearcher vs. default covnew
  searcher in EXE-Util-Tests/coreutils-targeted-search-results/2008-11-15

- from a cursory review of latest results, it seems like PGSearcher
  always gets decently high coverage fast but then PLATEAUS and
  flatlines like nuts while covnew at least attempts to cover new code
  and make more forward progress, albeit a bit slower, which suggests
  that there is potential for improvements

- optimized PGSearcher by taking into account the NUMBER of times the
  final branch pair of a state has been seen before by any
  terminated state, not just WHETHER it's been seen before

- curious, when forking at multi-way switches, each branch direction
  has a final branch sequence pair that has been seen an IDENTICAL
  number of times by previous terminated states ... could investigate
  more thoroughly
  - hmmm, they all have the SAME final branches - that's odd!!!
  - added a TODO in the code to look into that behavior

- kicked-off an overnight run with my 'improved' PGSearcher, but I'm
  afraid that it might get stuck on local maxima still and might do
  even worse :/

- read over cristi's job application materials on paper and made
  comments on them
- looked at attempt to 'improve' PGSearcher and results SUCKED!
  in EXE-Util-Tests/coreutils-targeted-search-results/2008-11-16

- made another run with original_behavior boolean set to TRUE
  in EXE-Util-Tests/coreutils-targeted-search-results/
  and, fortunately, results seemed to match those on 2008-11-15
  albeit with some small disparities due to non-determinism

- made another run with original_behavior set to TRUE only in
  selectState() but not in update() to see its effects
- for the run with with original_behavior set to TRUE only in
  selectState() but not in update() to see its effects, it looks
  pretty damn similar to original_behavior set to TRUE on both, which
  implies that the NEW behavior on selectState() is responsible for
  the massive suckage in performance

- in conclusion, don't try to use counts of number of times branch
  pairs are hit by terminated states, because they sort of suck :/
  maybe they keep getting you screwed over instead of letting
  randomness take its course

- manually inspect coverage on 'wc' and 'tac', which seem to both
  skyrocket to high coverage and plateau almost immediately
  - for 'wc', the main impediment to 100% coverage is the MB_CUR_MAX
  macro, which effectively makes a sizable block of code dead
    - also a few other error lines uncovered, which are minor
    - hmmm, PGSearcher doesn't cover the "case 'w': print_words = true"
    option while covnew does, hmmm
    - wow, for PGSearcher, wc achieves the plateau coverage in only 6
      minutes vs. covnew takes a LOT longer to get there
  -for 'tac', both PGSearcher and covnew reach plateau coverage pretty
    - boring error cases not covered, which could hopefully be covered
      with telling system calls to fail
    - tac doesn't cover a small block of code due to the file (or
      stdin) output being WAY smaller than the buffer, so there's no
      need to run code that writes out full buffers to stdout:

/* The number of bytes per atomic write. */
#define WRITESIZE 8192

static void
output (const char *start, const char *past_end)
  static char buffer[WRITESIZE];          <-- notice STATIC!
  static size_t bytes_in_buffer = 0;      <-- notice STATIC!
  size_t bytes_to_add = past_end - start;
  size_t bytes_available = WRITESIZE - bytes_in_buffer;

  if (start == 0)
      fwrite (buffer, 1, bytes_in_buffer, stdout);
      bytes_in_buffer = 0;

  /* Write out as many full buffers as possible. */ <-- NOT COVERED
  while (bytes_to_add >= bytes_available)
      memcpy (buffer + bytes_in_buffer, start, bytes_available);
      bytes_to_add -= bytes_available;
      start += bytes_available;
      fwrite (buffer, 1, WRITESIZE, stdout);
      bytes_in_buffer = 0;
      bytes_available = WRITESIZE;

  memcpy (buffer + bytes_in_buffer, start, bytes_to_add);
  bytes_in_buffer += bytes_to_add;

similar problem with other block of code in 'tac' which isn't being
covered ... we simply aren't reading BIG ENOUGH files

- crap, for 'head', we get decimated and stuck at a local maxima:
    PGSearcher: 26.70% of 382
    covnew:     77.23% of 382
what prevents even covnew from achieving 100%?
  - again, certain EXIT_FAILURE error cases arising from failing
    system calls
    - e.g, diagnose_copy_fd_failure() function only called when
      copy_fd fails, in order to print a useful error msg.
  - again, seems like some code that's not covered is code that deals
    with larger files (e.g., buffering issues), e.g., with comments

/* If there's not enough room, link the new buffer onto the end of
   the list, then either free up the oldest buffer for the next
   read if that would leave enough lines, or else malloc a new one.
   Some compaction mechanism is possible but probably not
   worthwhile.  */

- gave cristi in-person and on-paper comments on his job application

- perhaps going down shorter paths is better since there's more of a
  chance that execution diverges more down those

- need to somehow interleave with hunting for new code!!!
  or else will suck on something like 'head'

- ugh, it's not that useful to look at the zcov results because
  there's not a tight mapping to what's going on with search, so I'm
  gonna try to stare at branch sequences instead

- try to simplify branch sequences by detecting cycles and duplicates,
  etc. in order to try to get a really compact and easily
  understandable and comparable representation
  - FT-Apps/targeted-search/

- got Klee installed on IBM Thinkpad T60p
  • Cristi (the first author of the original Klee papers) was applying for tenure-track faculty jobs at that time.
- refactored and cleaned-up Klee code to reduce the entropy that my
  experimentation with search in the past month have caused

- added more details to termStates.branches and a
  --detailed-branch-seqs option to output full gory details about
  branch sequences

- ah, i realize that when a state branches in:
  ExecutionState *ExecutionState::branch()
  the FALSE side gets its coveredNew reset to False, which makes sense
  so that it can keep on hunting for new code to cover ... the TRUE
  side has coveredNew set to True so that when it terminates, it will
  cover that new code while the False side lingers.  hmmm
  - so that's why on some branch sequences, coveredNew is not
    monotonic - it starts at false, then gets set to true, then might
    jump back to false when it's on the 'false' side of a two-way

- hmmm, came up with an idea to tabulate numbers of branch pairs taken
  by states that have coveredNew and NOT coveredNew increment count
  for every coveredNew terminated state and decrement count for every
  NOT coveredNew terminated state and try to pick state to run whose
  branch pairs have the highest count
  - implemented it in PGSearcher and replaced the !original_behavior
    with using these tallies rather than the raw counts

- kicked off a set of 1-hour runs with improved PGSearcher using the
  above tally heuristics
- looked at results of tally enhancement in:
and results seem promising ... still getting killed by covnew on
'head', though, and some other benchmarks don't generate nearly as
many tests, but end up achieving really good coverage
- crap, but with tallies doesn't seem to work significantly better
  than without tallies and just having default PGSearcher and leaving
  some things to random chance :/
  - sometimes a bit better, sometimes a bit worse, but nothing
    shocking or breathtaking

- my tentative conclusion is that I really need to go after new code
  somehow one way or another ... need to smartly interleave with
  covnew ... perhaps starting with PGSearcher and then when things
  start to flatline, trying to kick-start it with looking for new
  code, and then switching back, etc?

- created a new searcher called PGSupervisedSearcher that wraps around
  PGSearcher and runs it until its performance starts to suck, and
  then switches to an alternate searcher ... invoke with this to
  alternate PGSearcher and the default 'best' searcher:

SEARCH="--switch-type=internal --randomize-fork --use-random-path
--use-interleaved-covnew-NURS --use-batching-search --batch-instructions
10000 --use- supervised-pg-search"
KLEE_EXTRA_ARGS="--dump-states-on-halt=false --output-branch-cov-counts" tr

- added a convenience option so can be invoked this way:

SEARCH="--switch-type=internal --randomize-fork
KLEE_EXTRA_ARGS="--dump-states-on-halt=false --output-branch-cov-counts" tail

- kicked off a 1-hour overnight run - wish me luck :)
  • I meant a set of 1-hour experimental runs on the server, which would probably total around 10 hours so I would have to wait overnight for results the next morning.
- looked at results of initial runs with PGSupervisedSearcher
  versus default search:

- all the 1-hour runs took the entire hour except:
  - test-tail-SUPERVISED_PG_SEARCH-3600-secs ended early in 20 minutes
    - can't find error msg., maybe out of memory???
  - test-wc-SUPERVISED_PG_SEARCH-3600-secs ended slightly early in 50
    - lots of STP timeouts but can't find error msg.

- results look great overall :)

- try to use matplotlib for plotting since PyX SUCKS!
  on fedora: sudo yum install python-matplotlib.i386
  on ubuntu: sudo apt-get install python-matplotlib

- created FT-Apps/targeted-search/
  script for using matplotlib to make pretty plots of
  termStates.branches files for comparison

- made some kick-ass PNG plots comparing default, pg-search, and
  supervised-pg-search!!!  see:

- kicked off another 1-hour overnight run, reduced the switching
  threshold down from 60 sec to 30 sec, and added more tracing print
  statements for diagnostics
- installed a brand-new 320GB hard drive in cluestick and re-installed
  Fedora Core 9 on it and put it back in machine room (its old hard
  drives were fried or something)

- met with Roger and went over how to get the router code from his
  class into shape to do cross-checking

- looked over results with PGSupervisedSearcher set to a 30-second
  threshold rather than 60 seconds in:
  and results look sorta mixed
  worse on:     head ptx wc
  better on:    ls tail
  approx. same: sort tac

- added dynamic time-slice adjustment between searchers based on how
  well each searcher is doing ... e.g., if searcher A terminates more
  states with new coverage than searcher B, then A is given a
  time-slice equal to 1.5x its previous slice and B is given a
  time-slice equal to (1/1.5)x its previous slice.  however, if
  neither terminates any new states with new coverage, reset both to
  the default starting value of 30 seconds

- kicked off another 1-hour overnight run
- looked at results of dynamically adjusting timeslices and they were
  sorta mixed ...

- a fixed time-slice of 60 seconds seems to do pretty well

- added FT-Apps/targeted-search/
  and make a HUGE-ass set of 1-hour runs for ALL coreutils benchmarks
- gave cristi comments and revisions on his OSDI 2008 presentation and
  on his job materials

December 2008

- wrote some scripts to analyze and summarize the results of 1-hour
  runs on ALL coreutils benchmarks to compare Supervised PGSearcher
  to default:
  summary HTML file in:
- realized that most benchmarks have their coverage 'flatten out'
  after a very short period of time, so pointless to run them for

- benchmarks that only run for a few seconds are insignificant, since
  it doesn't really matter which searcher did better
  • Note how my workflow involved analyzing each day's experimental data and producing HTML files (e.g., summary.html) with tables and visualizations of results. I would create a new subdirectory (e.g., coreutils-targeted-search-results/2008-11-29/) for each day's custom experiments. Nowadays using computational notebooks like Jupyter would probably be a cleaner way to organize these experiments (although I'd still need to organize a collection or hierarchy of notebooks).
- ran benchmarks concretely with to get actual line
  coverage rather than LLVM instruction coverage, in order to make a
  fairer apples-to-apples comparison

- taking cristi's suggestion, ran tests for Klee OSDI paper in
  augmented with Supervised PGSearcher tests in:
  to see if I could make improvements

- wrote and refined these scripts for looking at searcher comparisons:

Looking at:

- pretty cool - 9 benchmarks have a > 3% improvement over OSDI (when
  run with OSDI tests + PGSearcher)
    du, sort, pathchk, ptx, ls, chmod, fmt, paste, fold

- hmmm, I'm sort of disturbed that the 1-hour run with extra tracing
  enabled to generate termStates.branches doesn't fare as well as the
  Klee OSDI 1-hour runs sometimes, probably because the Klee OSDI
  1-hour runs didn't have any tracing overhead

- cristi is happy about some of the coverage improvements over OSDI,
  even though they were only a few percent ... since he knew that it
  would be hard to get them even by running the default searcher for
  longer time
  - pretty cool that some of the ones where OSDI had lower coverage
    did better with PGSearcher

- another interesting observation is that LLVM instr. coverage
  sometimes doesn't have anything to do with benchmark .c line

- added FT-Apps/util-scripts/
  to streamline the process of replaying .bout files and making zcov

- cristi (and dawson) suggests trying to study one benchmark in
  particular and try to figure out how to get near-100% coverage on it
  - so i'm gonna pick 'ls' and study it in detail

- looking into ls in detail:
  - lots of code in 'usage' isn't being covered since 'ls --help'
    never invoked
  - ls has a [[expletive]]-ton of options, and cristi suspects that we're not
    nearly covering all of them
    - in reality, we seem to be covering most, but definitely not all
  - some unreachable code due to MB_CUR_MAX environment variable,
    which we can't do anything about:
  - the color-related code seems not to be run at all :(

- hmmm, it seems like we can get greater diversity in behaviors if we
  favor divergence EARLIER during execution rather than later (e.g.,
  at the big-ass 'switch' statement that determines which command-line
  option to activate)

- maybe the problem is one of input generation?  seems like i can
  achieve pretty damn good coverage myself if I just slam in some
  inputs from a grammar ... hmmm
  • I wouldn't ever get around to grammar-based input generation for Klee, but I did end up using that technique when I helped on the Hampi paper (see later entries).
- looking into producing better Klee coverage output so that I can get
  a better intuitive sense of what the heck the searcher is doing
  - ExecutionState class has a coveredLines field, interesting!

- hmmm, for 'ls' seems like we're tracking coverage for ls.c and
  system.h but not for coreutils-6.11/lib/ or uclibc stuff

- FT-Apps/coreutils-6.11/lib/functions.txt doesn't seem completely
  up-to-date since some functions aren't probably being excluded from
  being tracked for coverage
  - ok, solved - added a handful of more functions from libcoreutils.a
    to exclude list!
  - ok now i'm more confident that LLVM instruction coverage is more
    indicative now of line coverage and can be used for fair
    comparisons :)

- there can be a disparity between LLVM instruction coverage and line
  coverage in benchmark .c file itself since, for instance, 'rm' uses
  both functions in rm.c and in remove.c - if we just measure rm.c, we
  might get really high coverage, but in both, it might get low
  coverage since remove.c is like a library function that can be used
  in many ways.
    look at those entries with low LLVM instruction coverage but HIGH
    line coverage in single benchmark .c file itself
    - e.g., rm, readlink

- hmmm, sometimes states terminate with covnew even though NO new
  instructions have been covered ... this seems like a bug
  - simple to reproduce on readlink - see termStates.branches
  - ahh i see, say a state S reached new code early during execution,
    so its coveredNew bit is set, but later a state S' derived from it
    might reach that same code (sensible since it shares a common
    prefix with S) and terminate earlier than S, thereby generating a
    test case that covered that 'new' code, so by the time S
    terminates, NO new code will have been covered that S' didn't
    already cover.
  - ahhh, but remember that when a state branches, the original keeps
    coveredNew and coveredLines set but the new copy must start out
    with a brand new empty set and work up from nothingness again
  - so perhaps this isn't really a bug after all :/

- hmmm, a simple hack could be to have Supervised PGSearcher run
  DEFAULT search FIRST and then switch to PGSearcher when it hasn't
  made adequate progress ... that way, hopefully it does no worse than
  default on more cases?
  - DONE: implemented as --put-pg-searcher-last command-line option

- modified FT-Apps/targeted-search/ to
  set timeouts for each benchmark based roughly on how long it
  actually took to run in:

- kicked-off a big-ass run of all coreutils comparing default,
  PGSupervisedSearcher and PGSupervisedSearcher with
  --put-pg-searcher-last option
- Created this wiki page to begin investigating a possible ISSTA paper
  submission comparing Klee tests to manual developer tests:

- created FT-Apps/coreutils-klee-vs-manual/ to compare manual tests to
  Klee-generated tests

- hmmm, need to somehow incorporate failures into the picture to get
  even better coverage, so re-run with failures like:
  and then zcov-merge the .zcov files

- met with Dawson to catch up:
  - he doesn't seem that into the idea of an ISSTA case study paper
    [[cut out private info]], but could still be do-able
    - however, he still encourages me to look through the manual
      developer tests to see if I can gleam some insights
  - he's more into improving the search more and doesn't want me to
    give up on search so easily.  take the benchmarks where i suck and
    try to improve them!

- created zcov/src/zcov-genhtml-comparison to compare coverage from
  two .zcov files

- uploaded comparisons of Klee and manual tests at:
- manually looked through some benchmarks in:
  and made notes here regarding first-pass of Klee vs. manual tests:

- stopped about half-way through after 'stat' (sorting by decreasing
  line coverage for manual tests) because I got tired!
- finished the work I started yesterday with looking through benchmarks in:
  and (tentatively) finalized notes here regarding first-pass of Klee
  vs. manual tests:
  - emailed out summary of my observations to cristi and dawson

- started working on draft for USENIX '09 short paper submission about
  Coverity bug reports and Linux

- updated keeda's nightly backups to keep 7 backups, one for each day
  of the week, just to be a tad more paranoid of backup trauma
  • My USENIX 2009 short paper, which was a resubmission of a rejected USENIX 2008 submission last year. I submitted it in Jan 2019, and it was accepted! I was super proud of this paper since it was my first publication from work that I (somewhat) independently initiated the prior year, even though it didn't end up in my Ph.D. dissertation. But it did help me get an MSR (Microsoft Research) internship with Tom Zimmermann (see later entries), which then inspired my dissertation work.
- worked on draft of USENIX 2009 short paper submission, tentatively
  finishing up the body sections and starting to write introduction
- went over the basics of the Klee codebase with Peter to get him
  oriented and on-board with the project

- keep working on draft of USENIX 2009 short paper submission, and
  completed first full draft (that's 3/4 of a page over the 6-page
  • Peter originally worked on Klee in 2006-2007 as a visiting masters student (my 1st year of Ph.D.) and then returned in 2008 as a new Ph.D. student in our lab. He was already familiar with the project and the people.
- set up new wiki acct for Peter B. and a new acct named PhilGuo for
  myself since the old one got screwy somehow ... also hacked the
  access control list to enable permissions for us
- finished first draft of USENIX submission and sent it out to some
  people for comments
- met with Roger to answer his questions and to give him pointers on
  setting up an automated framework for doing cross-checking on
  student router code

- read over Adam Kiezun's proposed paper draft on Hampi string
  constraint solver and some related work to begin thinking about
  whether I want to help out on that project
  • Adam roped me back into helping out on an Ardilla follow-up paper on a system called Hampi. Once again this wasn't part of my main Ph.D. work at all, but I was happy to help since I desperately wanted to feel like I was making progress on something, since my attempts at improving Klee's search algorithms were majorly failing.
- thought more about how I could help out with integrating Hampi
  string solver into Klee and perform a compelling evaluation
  - wrote email to Adam about my initial thoughts:
so i did some reading and brainstorming tonight, and here are my current

there seems to be 2 ways to use Hampi with a concolic test generator for
C programs (e.g., EXE):

- as a replacement for standard C library string functions, in order to
  speed up run-times (like MSR work you referred me to)
- for grammar-based input generation (like your PLDI work with patrice)

unfortunately, i can't think of anything compelling to do with either at
the moment.  C standard library string functions are quite small and map
pretty closely to STP bitvector constraints, so I don't know if we'll
get immense performance speed-ups with Hampi (perhaps we will, i dunno).
i'm afraid that it might actually slow down things since we need to
externally interface to Java code or via IPC.  thus far, solving hairy
string constraints hasn't seemed to be a limiting factor in the programs
that the latest version of EXE (which is now called "Klee") has been
tested on.  please skim the latest paper to see what we're currently

and as for grammar-based input generation, i suppose we could use that
to test compilers or pre-processors, but that would essentially be a
copy of what you guys did with the javascript interpreter for PLDI.

if i'm to go thru the effort of integrating Hampi and Klee, i'd want a
visible pay-off in terms of increased coverage achieved or time needed
to achieve equivalent coverage, at least for a certain class of
programs.  can you think of a type of program (e.g., what programs
process strings heavily and don't do much else???) for which this could
work well?  Klee is robust enough to handle sizable C programs (it
compiles them down to LLVM bytecode and then interprets the bytecode).

anyways please let me know your thoughts, and i'll keep brainstorming in
the meantime.


one more thing to ponder before i go to sleep ... here's one possible
use i just thought of:

Use Hampi as a clean and expressive way to write input filters for
concolic execution: Write constraints on the string input to a target
application in a human-readable form (CFGs, regexps, etc.), use Hampi to
compile those down to STP constraints, and then feed those into Klee
(EXE) as pre-conditions for the input.

Currently we run Klee on Linux command-line apps that take as input
either a string from stdin or argv[].  these inputs are totally
unconstrained, so we hit many error paths in the code for invalid
inputs.  we can write C code to manually filter invalid inputs to avoid
exploring error paths, but that's laborious and infeasible for anything
but super simple filters (e.g., all letters should be [a-zA-Z] or insert
3 newlines in input).  We could pitch Hampi as a compiler that takes
human-readable string constraints and compiles them down to STP
constraints, thus making it easier to write input filters for concolic
executors.  The evaluation would be to show how writing really simple
and small input filters can lead to better results during concolic

  - also emailed Daniel and Cristic to ask for their input

- installed Hampi on laptop and started trying to play around with it
- started thinking more seriously about pursuing side project related
  to graph mining of conferences and citations by their papers to
  papers in other conferences
  - emailed Sunghun Kim at MIT to ask for his suggestions
- decided to put side project related to graph mining of conferences
  and citations on hold since the datasets kind of suck ...
  here are my notes for reference, though:

Mine ACM database to get proceedings of conferences/journals and which
universities have papers in them ... can trace academic geneaology of
authors too
- hmmm an interesting trend would be to see people in lower-ranked
  conferences/journals and how they reference works in higher-ranked
  conferences/journals and vice versa - there seem to be different
  sub-communities even within the same field
  - hypothesis: lower-ranked papers link up to higher-ranked papers
    more frequently than vice versa
  - hypothesis: there is great clustering in conferences since people
    from, say, top-tier schools form 'communities' around certain
- related work:
    - How can we investigate citation behavior?: a study of reasons
      for citing literature in communication
  [[username and password]]
- citeseerx has an XML option!
- the problem with citeseerx is that it doesn't have complete
  proceedings :(  ugh.  ACM Portal has full proceedings but isn't
  as easily parsable
- ehhh can't seem to get a good enough dataset to show what I want to
  show ... in particular, lower-ranked places don't seem to have as
  reliable citations, etc.
- read some papers related to using grammars to deal with input
  generation for concolic execution and thought more about how to
  integrate Klee and Hampi

- received comments and critiques from cristic in-person about my
  USENIX short paper submission

- talked to adam kiezun on phone about how i can contribute to his
  Hampi paper and decided to give a shot to using it to generate input
  filters for klee

- started trying to get hampi to print out STP constraints in
  STP-parsable form but not succeeding that well, emailed adam and
  vijay asking for help
  - Klee interacts with STP via its c_interface (and shared memory if
    it forks a separate STP process), so if i can hack c_interface to
    add the appropriate functionality for seeding with constraints,
    then i might be in good shape
  • I decided to take the plunge on helping out on Hampi since it would allow me to apply Klee to help out another system. In fact, I did publish research based on Klee, but it just ended up in the Hampi paper rather than my own independent Klee paper :) But I'm very proud of the fact that my experiences working so closely with Klee gave me the idea of how to improve its performance by integrating it with Hampi.
  • Vijay was a co-author on the Hampi paper and had also previously worked on an earlier version of Klee called EXE. Small world.
  • STP, a constraint solver project led by Vijay, was used as a component in both Hampi and Klee.
- talked to vijay on phone and also received his email, and he agreed
  to add an option to STP so that it can read in a constraints file
  (with no QUERY statements) from a file and parse it into an internal
  data structure, ready for Klee to add new constraints onto it

- started spelunkering (spelling?) thru the Klee code dealing with
  constraints, looking for possible places that I could insert code
  dealing with Hampi pre-conditions

  - lib/Solver.cpp - STPSolver::computeInitialValues()
    - also runAndGetCex() and runAndGetCexForked()
    - check out DebugPrintQueries option

  - Ah, I see: Klee adds a bunch of STP ASSERTs (one for each
    constraint) and then does an STP QUERY( FALSE ) to see whether the
    constraints are unsatisfiable.  If so, then that means that the
    original asserts were all satisfiable and there's a concrete
      ASSERT (x > 0)
      ASSERT (x < 5)
    This is conceptually ((x > 0) ^ (x < 5)) -> FALSE
    so the only way this entire formula is satisfiable is if the LHS
    is FALSE, which is impossible, so the entire formula is
    unsatisfiable, with a counterexample of x = 3 making the LHS TRUE,
    thus making TRUE -> FALSE, which is impossible

  - the big question is how to properly name the input string buffer
    from Klee properly so that it matches the name of the Hampi
    constraints on it ... hmmm, I could just manually do
    klee_make_symbolic_name() instead of relying on a symbolic argv[]
    provided by one of Klee's standard models - CUTE/DART (and their
    spin-offs) don't seem to have argv[] models, so they just
    explicitly make some input string buffer symbolic.
    - hmm klee_make_symbolic_name() doesn't seem to make a special
      name in the .cvc file, only in the .bout file, so that might not
      work to canonicalize the name

  - lib/util/STPBuilder.cpp builds up STP expressions from Klee
    - STPBuilder::buildVar(), STPBuilder::buildArray()

  - lib/SpecialFunctionHandler.cpp SpecialFunctionHandler::handleMakeSymbolic()
    contains code to handle when klee_make_symbolic() is called from
    the target program ... this then calls Executor::executeMakeSymbolic()

  - it seems like a singleton Klee STPSolver object is created and
    persists, being destroyed at the end of execution.

write email to klee-dev asking about how klee uses the STP C API:
Hi guys,

I'm trying to experiment with how Klee uses STP, so I started by digging
through the code, and now I have some questions.  Here's my current
impression of how Klee interacts with the STP c_interface API:

// execution begins
VC vc = vc_createValidityChecker()

// ??? presumably some objects are made symbolic here ???

// During execution, Klee builds up constraints in its own Expr data
// structures and keeps them along with each state as its path condition

// execute one STP query:
at each branch (or other place where STP is called):

  for c in current_state.constraints:
    vc_assertFormula(vc, c)

  res = vc_query(vc, FALSE)

  // find concrete counterexample if query unsat.:
  if res == UNSATISFIABLE:
    for object in symbolic objects:
      object.concrete_value = vc_getCounterExample(vc, object)


// execution ends


I have the following questions:
  - what's the effect on STP of making a value symbolic from within
    Klee?  If I make, say, a 10-byte array symbolic, what does that
    translate to in STP?  I presume a variable declaration, right?  What
    API call does this?

  - i noticed that if i do a simple klee_make_symbolic_name(), even with
    a sensible name:

      char x;
      klee_make_symbolic_name(&x, sizeof(x), "input");

    the .cvc file still prints out that variable with a
    garbled name like:

      arr149  : ARRAY BITVECTOR(32) OF BITVECTOR(8);

      - first, is there a way to make these names friendlier than arr149
        (or to get the names in some deterministic manner)
      - second, why is this variable a ARRAY BITVECTOR(32) OF
        BITVECTOR(8)?  I can understand making it BITVECTOR(8) since
        it's a 'char' variable, but why ARRAY BITVECTOR(32)?

  - if there is only one global vc object that persists throughout
    execution, how does its state 'reset' between queries?  is that what
    vc_push() and vc_pop() does?

  - is my mental model missing something significant about how Klee uses
    STP?  are there important functions I'm ignoring?

Thanks in advance for your patience with my n00b questions :)


  • In order to integrate Klee with Hampi, I had to understand not only how Klee worked but also how it interfaced with STP, the constraint solver that both it and Hampi were built upon.
- helped sbkim debug annoying configuration and compile errors on
  cluestick via email ... I think I fixed the problem, though, see my
  patch for further explanation:

+ pgbovine: arg, newer kernel versions have an inotify.h header
+ that leads to compile errors of the following form:
+In file included from /usr/include/asm/fcntl.h:2,
+             from /usr/include/linux/fcntl.h:5,
+             from /usr/include/linux/inotify.h:12,
+             from /home/sbkim/FT-Apps/busybox-08-12-08/miscutils/inotifyd.c:32:
+/usr/include/asm-generic/fcntl.h:117: error: redefinition of a header
+See here for solution, which is to use sys/inotify.h instead:
+The latest SVN busybox should compile properly, but since we're sticking
+with an old version, this patch is necessary :/

- updated the CrossChecking wiki page to add yet more GIT repositories
  that people need to check out before FT-Apps will work ... sigh
  dependencies :/

- played around with creating small buffers and manually using
  klee_silent_exit() in the form of:

#define silent_assert(expr) if(!(expr)) {klee_silent_exit(0);}

  and klee_assume() to constrain inputs.
  - found the curious fact that whenever you have a disjunction (OR
    clause), Klee will fork and create an additional test

  - after trying to manually write a simple STP -> Klee C code
    converter in Python, decided that it would probably be easier to
    work off of the official STP grammar
    - 'let' expressions really work like let* in Scheme - they can be
      nested, so simply compile those down into intermediate variable
      assignments in C
  • sbkim = Seungbeom
  • I used our lab's internal wiki a ton to document software setup instructions and other idiosyncrasies of how our research software worked.
- incorporated comments from cristic and drayside into my USENIX '09
- made minor edits to USENIX '09 draft to trim things down a bit and
  then sent it out to dawson for comments; won't plan to work on it
  until the new year
- read up on bison parser generator and started looking into how to
  write a STP -> Klee C code converter by leveraging STP's built-in
  parser.  I think a wise strategy would be to let STP's parser
  convert to AST, and then somehow traverse that AST and print out C
  - oh wow STP already has LispPrint() in AST class for
    pretty-printing, could write a KleeCPrint() or something
    - LispPrint1(ostream &os, int indentation) is the real meat
    - ASTNode::PL_Print(ostream &os, int indentation) prints out in
      PRESENTATION LANGUAGE, which is what we think of as the STP
      format, I think

- hacked stp/AST/AST.cpp to add a new Klee_Print() function to print
  expressions in a form that Klee will like

- added a test driver in ~/stp/c-api-tests, to compile:
    g++ -I$HOME/stp/c_interface stp-to-klee-test.c -L$HOME/stp/lib -lstp
    -o stp2klee

- got basic printing working that translates simple expression without
  'let' definitions:

  bv0 : BITVECTOR(32);
  ASSERT(bv0[7:0] = 0hex66
   AND bv0[15:8] = 0hex6F

   AND bv0[23:16] = 0hex6F


  assert((((bv0[0] == 0x66) && (bv0[1] == 0x6F)) && (bv0[2] == 0x6F)))

- got 'let' definitions working and have a decently working prototype
  in: stp/c-api-tests/cvc-to-c.c

- created my own backup script in:
- started investigating some possible bugs and misunderstandings in my
  .cvc to C translation, namely dealing with multi-byte assignments
  and comparisons (I was naively using single-byte operators)

- hmmm, consolidating several STP arrays into one seems to generate
  fairly decent strings belonging to grammar, but it still results in
  lots of bogus \x00 and \x7f values that could be annoying.
  I guess if you cut out trailing \x00 (null bytes), it's not as
  annoying, but there are still leading null bytes.  i guess it's cuz
  the grammars only specify that certain SUBSTRINGS match, not
  PREFIXES (maybe?)
- implemented a more generalized form of memcmp for multi-byte '='
  since we might span multiple bytes: e.g.,:
    "let_k_74 = bv1[79:0]"
  (use memcmp return values for implementing multi-byte >, <,
   etc? or punt on them for now since the semantics aren't clear)

  and a generalized memcpy for multi-byte LET assignments:
    "LET ... let_k_74 = bv0[79:0]"

- crap it seems like Klee works fine with ONE hampi assert but does
  funny things when there are several asserts, since they each create
  another bv* variable

- crap still has lots of funny null bytes, something is still not
  right here ... perhaps i made some mistake in translating from cvc
  to C code

- read over ICSE 2009 reviews and adam's rebuttals draft and made my
  own semi-smart comments in response, since Adam needs to send out
  rebuttals to reviewers soon
  • Rebuttals are author responses to initial peer reviews of a paper submission.
- realized that when you make assignments to symbolic variables based
  on boolean values, Klee will try to negate those in order to
  generate alternates ... e.g.,:

  unsigned char bv0[10];
  klee_make_symbolic_name(bv0, 10, "bv0");
  unsigned char poo = ((bv0[4] == 'a') || (bv0[4] == 'b'));

  it will create 3 tests, one where bv0[4] == 'a' (poo True), one
  where bv0[4] == 'b' (poo True), and one where bv0[4] == '\0' (poo

  - the effect of this is that the number of possible paths quickly
    explodes since there are many such boolean assignments, and Klee
    will try to go down paths to satisfy all of them

    - ha, but if I do a klee_silent_exit() when the condition is
      false, then it won't generate a ton of test cases ... hmmm

- hmmm, perhaps one solution is to write simpler Hampi input files so
  that simpler STP constraints are generated
  - if you just have ONE assert in your Hampi input file, things seem
    much cleaner, since you're just specifying, say, a CFL

- made minor enhancements to STP and C code output generation (e.g.,
  putting in variable and size declarations) in order to further
  automate the process of converting from Hampi to C code for Klee

January 2009

- gave adam kiezun a second round of comments on our ICSE 2009
  submission rebuttal
- re-organized files relevant to Hampi->Klee in
  ~/MyDocuments/Desktop-files/issta09-tmp/HampiKlee GIT repo.

- ideas for target programs (if grammars are mad-ass complicated or
  inputs are too big, Klee will fork too many paths and will get poor
  coverage, in my estimation)
  - ideally, want something where we get stuck in front-end error
    processing code without a grammar but do well with a grammar
    - perhaps look at what programs related papers have used:
      Majumdar and Xu ASE 2007: bc, logictree, cuetools, lua, wuftpd
        they used existing yacc/lex grammars, but we need to re-write
        in Hampi syntax

- added bc to FT-Apps, get it to compile for Klee, and try to alter it
  so that it takes input from a symbolic buffer rather than argv[],
  just to avoid having to use the standard environment model

- shieeettt, when I try to run bc, Klee croaks with the following error:

DEMAND FAILED:Executor.cpp:initializeGlobals:416: lib_begin() ==
m->lib_end()> is false: "XXX do not support dependent libraries"

perhaps because it includes -lfl (Flex library)???

woohoo, if I just exclude -lfl from the linker invocation command in, it works file :)

- added -DKLEE_BC flag when compiling for Klee, so that I can use
  #ifdefs in code

- piping an stdin buffer into bc seems to trigger it to evaluate one
statement and then exit, which is something that we want I think

- added ~/FT-Apps/bc-1.06/ to facilitate running Klee
  with a symbolic stdin

- good, good, running with with REALLY small-sized
  stdin, such as 1-byte, causes lots of parse errors and illegal
  characters and stuff, which is something that using Hampi can
  hopefully avoid :)  e.g.,:

illegal character: ^�
(standard_in) 1: illegal character: ^�
(standard_in) 1: illegal character: ^�
(standard_in) 1: parse error
(standard_in) 1: illegal character: (standard_in) 1:
KLEE: increased time budget from 14.5564 to 16.5114
illegal character: \177

- need to hack ~/klee/models/simple/ to restrict symbolic stdin buffer
  contents - woohoo added a filter_stdin.c file that can be
  programmatically generated

- kicked-off an overnight run on hexapod:~/FT-Apps/bc-1.06/
  to see how much performance sucks without an input filter

- received comments from Dawson regarding my USENIX '09 draft and
  placed in:
- edit USENIX '09 paper draft to address dawson's comments

- [[EXPLETIVE]]! I just switched to the officially-sanctioned stylesheet and
  my paper inflated by a whole friggin' page - shieeeetttttt!

- sent out a draft to drayside and dawson for review

- made a pass editing camera-ready of CHI 2009 paper and sent
  revisions to joel
  • We finalized the accepted CHI 2009 paper camera-ready version around this time.
- printed out USENIX '09 paper draft and read it once more, making
  on-paper edits and transferring them to LaTeX; now it's trimmed down
  almost to 6 pages

- moved 2009-01-05 results of running bc with NO FILTER to:
  and checked into GIT for later analysis

- started creating a Hampi grammar for bc in:

- one problem I noticed right off the bat is that Hampi simplifies the
  constraints too much before sending them to STP, so that it's
  adequate for finding one particular solution, but not for generating
  ALL possibilities - in particular, it chooses only ONE option for
  - emailed Adam about it and he responded pretty quickly with:
  Try changing,
  which seemed to work

- created a convenience script to convert Hampi -> C code:

- [[expletive]], Klee is hitting MEMORY LIMITS with 10-byte inputs, maybe make
  inputs smaller?  phew, 6-byte inputs seem to work fine

- kicked-off a 1-hour run of bc on hexapod with 6-byte inputs, which
  seems to be the sweet spot

- hmmm, multiple negations like '---' are causes of syntax errors and
  - gotta refine the grammar more based on bc.y

- also, lots of really 'boring' strings being generated - should I
  specialize the grammar a bit to get it to generate more interesting

- added a utility script to replay .bout files for bc:

Do this to properly run gcov (the -o option is key since object files
live in different places as .c files):
  pgbovine@hexapod ~/FT-Apps/bc-1.06/bc
  $ gcov bc.c -o ../obj-gcov/bc/
  # for some reason, it doesn't work well when you do *.c

- cool, for 6-byte input, with filter, it gets 18.67% of bc.y code
  executed, but without filter, it gets 12.35% executed, a 51%
  - bc.y implements the majority of 'central' code for bc, delegating
    out to functions in other files when necessary
- trimmed down USENIX '09 paper draft back to exactly 6 pages and
  pretty much finalized

- started learning to use PLY, since I want to eventually create a
  parser to transform .y files into .hmp files
  - got an initial quick-and-dirty version working within a few hrs.,
    pretty sweet
- final edits on USENIX '09 paper draft and SUBMIT
- played with cleaning up my Yacc->Hampi converter a bit more, but ...

- discovered that the PLY distribution has a Yacc parser example!
  ughhh ~/Desktop/ply-2.5/example/yply
  - ok, hacked yply to get a Yacc->Hampi converter sorta working:

- got a decent grammar going for bc in:

- kicked off a new 1-hour run with filter and 6-byte input in:

- stashed away results and replayed them for coverage using:
pgbovine@hexapod ~/FT-Apps/bc-1.06

$ python obj-gcov/bc/bc

- yes, initial results look promising - 26% coverage in bc.y using
  filter versus 12% without filter
  - as expected, without filter, hits more error-handling code

- added a script to
  hexapod:~/FT-Apps/bc-1.06/ to automate the coverage comparison

emailed Adam early results on bc:
i've got some early numbers for bc, which are quite encouraging.  i'm
running for 1 hour using a 6-byte symbolic input string for now - if we
go much bigger, Klee tends to barf and run out of memory :(  i've only
included coverage for the 'interesting' (non-trivial) files

NO filter:
(most inputs illegal and rejected - i can count exact number later)
27 inputs generated

File 'bc.y'
Lines executed:12.35% of 332
bc.y:creating 'bc.y.gcov'

File '.././../bc/execute.c'
Lines executed:6.05% of 380
.././../bc/execute.c:creating 'execute.c.gcov'

File '.././../bc/load.c'
Lines executed:17.16% of 134
.././../bc/load.c:creating 'load.c.gcov'

File '.././../bc/storage.c'
Lines executed:23.02% of 404
.././../bc/storage.c:creating 'storage.c.gcov'

File '.././../bc/util.c'
Lines executed:31.45% of 337
.././../bc/util.c:creating 'util.c.gcov'

WITH filter provided by Hampi grammar:
(most inputs legal and accepted, with some rejected due to semantic errors)
1434 inputs generated

File 'bc.y'
Lines executed:26.81% of 332
bc.y:creating 'bc.y.gcov'

File '.././../bc/execute.c'
Lines executed:37.11% of 380
.././../bc/execute.c:creating 'execute.c.gcov'

File '.././../bc/load.c'
Lines executed:40.30% of 134
.././../bc/load.c:creating 'load.c.gcov'

File '.././../bc/storage.c'
Lines executed:39.36% of 404
.././../bc/storage.c:creating 'storage.c.gcov'

File '.././../bc/util.c'
Lines executed:41.25% of 337
.././../bc/util.c:creating 'util.c.gcov'

improvements are noticeable :)  i'll keep hacking on this tomorrow,
perhaps trying different input sizes.

- also, it's a significant point that SO MANY more inputs were
  generated with the filter - because without filter, you tend to get
  stuck and [[expletive]] (maybe in big parsing tables???  ask cristi)

- kicked-off an 8-byte 1-hour run in hexapod ~/FT-Apps/bc-1.06
- At Adam's suggestion, created filters for all input sizes from 1 to
  10 bytes and run Klee separately on each to measure coverage:

- kicked off a set of 1-hour runs from 1 to 10 input bytes on hexapod using

- following on the examples used in the Directed Test Generation using
  Symbolic Grammars paper, downloaded logictree and tried to set it up
  to run with Klee:

- wrote a first (super rough) draft describing my Hampi/Klee
  experiments and emailed it off to Adam
  • By early January 2009 I had submitted my own USENIX short paper and was working mostly full-time on the Hampi paper by integrating Klee with it.
- kicked-off a set of 1-hour runs from 1 to 30 input bytes on hexapod
  and NO FILTER using ~/FT-Apps/logictree-0.2.0

- added cuetools program to test suite, so now we have 3 of the 5
  programs that the symbolic grammars ASE paper uses

  - we will test the cueprint utility within the suite (1 of 3 utils)
    in order to take input from stdin, cueprint must be run with:
      ./cueprint -i cue
    to specify that it should parse the .cue format - I'm gonna
    hard-code that in for running with Klee - DONE

  hmmm, perhaps a more interesting thing to run is the converter from
  cue to toc format - could involve more code to be run:
    default invocation: ./cueconvert -i cue -o toc
  hacked to translate from cue to toc by default so that it can simply
  take input from stdin
- converted logictree CFG from Yacc -> Hampi and played around with
  different input sizes and test to see what's the largest that
  doesn't run out of memory and terminate early
  - early terminations such cuz they result in 'incomplete'
    (i.e., underconstrained) inputs

  - decided to generate logictree input filters of sizes from 1 to 7
    bytes, which seems to work reasonably well
    - kicked off a set of 1-hour runs on hexapod

- moving onto working on cueconvert (from the cuetools suite), which
  is a program that converts .cue files to .toc files for music

  - tried to convert .cue file CFG from Yacc -> Hampi but got STUCK
    because I found a bug where Hampi thinks this is UNSAT:

var v : 10;

cfg program := "FOO" |
    "FOO" trailing_spaces;

cfg trailing_spaces := " "
    | trailing_spaces " ";

// Hampi can only deal with bounded CFGs
reg Pbound := fix(program, 10);
assert v in Pbound;
emailed to Adam, and he said he'd fix it

- met informally with dawson and peter, and it seems like dawson is
  pumped to have peter and me working on the underconstrained
  (tainted) execution stuff that he thought about way back in 2007 -
  he thinks that it's a great way to up the bug counts using symbolic
  execution versus static tools (right now symbolic execution can't
  find nearly as many bugs as static)
  - dawson suggested starting to check small pieces of code, such as
    library code, that are fairly stand-alone and to try to re-find
    the old bugs I found using FT back in 2007

- woohoo, adam fixed that bug, which had to do with Hampi not properly
  handling multi-character terminals (it thinks they're all

- crap, Hampi doesn't seem to support newline characters when
  specified as "\n", so a hack I'll do is to instead use the '#'
  character and then search-and-replace the hex values in the
  resulting C file before feeding into Klee.

- tried to simplify the cue.hmp grammar a bit so that it can scale

- kicked off runs of cueconvert both WITH and WITHOUT input filters
  • After my Klee search algorithm experiments from the past two months failed (and after my own USENIX and Hampi work were done-ish), I decided to give Klee one more shot with something completely different. Peter, who recently re-joined our lab, and I planned to work together on Klee-UC (underconstrained execution) from Jan-March 2009, but that ended up never getting submitted either. Afterward both of us quit working on Klee altogether, and Peter left the Ph.D. program a bit later.
- initial runs for Hampi/Klee experiments have settled, and started
  writing scripts in FT-Apps/grammar-based-input-filtering to analyze
  the results

- ah crap, i realized that bc doesn't accept CAPITAL letters for
  variable names, so have to tweak the grammar a bit and re-run

- whoa I think Klee found some infinite loops in logictree:
  echo '@x $x x' | ./logictree
  echo '@x $y x' | ./logictree
  echo '@x $y z' | ./logictree

  @ means FOR ALL, $ means THERE EXISTS

  wrote an email to Adam, Vijay, and David Dill asking about these
  weird infinite loops, as well as to the developer:
    Thadeu Cascardo [[email omitted]]

  both Vijay and David Dill seem to think that it's not supposed to
  infinite loop, so I will file it as a bug!

- re-run bc with filter from 1 to 6 bytes with tweaked grammar (to
  reflect the fact that capital letters aren't legal input)

- set up new Thinkpad T61 laptop and updated my hacked versions of
  Hampi and STP to the latest versions (and merged conflicts)
  • Finding bugs in research software prototypes (from both our own lab and others' labs) while trying to make progress on my research was also a recurring activity that year.
- met with roger to answer some of his questions regarding setting up
  the Klee cross-checking framework for student router code; he seems
  to be making good consistent progress

- collected results of new set of bc runs with tweaked grammar (that
  eliminates those pesky illegal capital letters)

- got coverage numbers in:
  and wrote more scripts to analyze them

- started writing up results table in Hampi paper itself
- keep writing up the Klee evaluation section in the Hampi paper draft
  - finished first draft of my section and emailed other authors
  • I wrote only the Klee evaluation section of the Hampi paper but wasn't too involved toward the end of the month as it was finalized and submitted.
- started making first full pass through the Hampi paper draft
- more light editing of Hampi paper draft

- re-read ISSTA '07 invited paper on under-constrained execution and
  started thinking about how to implement it in Klee
  • Dawson and Daniel's ISSTA 2007 invited paper, where they first laid out the high-level vision for Klee-UC. Daniel left the Ph.D. program early so he didn't have the time to implement this original vision. It was now up to Peter and me to turn this vision into a working system that we could hopefully publish.
- another pass at editing Hampi paper draft
- met with Peter and Dawson about starting to hack on
  under-constrained stuff

- went over some of the Klee code with Peter to get an understanding
  of how to begin implementing underconstrained execution

- oh, wow, there's already a klee_under_constrained() function
- and look in lib/ExecutorReferents.cpp for functions like:
    MemoryObject * Executor::lazyAllocate()

- started hacking on a new implementation of underconstrained
  execution from 'scratch'
- try to trace thru Klee code implementing load/store operations to
  get a sense of how things work, semi-painful :0

- gave derek rayside some feedback on his job application materials

- sit down with peter to look more at klee code before starting to
  implement underconstrained execution, especially get experience with
  looking at disassembled LLVM bytecodes to get a sense of how
  compilation works from C -> LLVM

- made up an initial set of regression tests for underconstrained

- Decide that Executor.cpp, esp. the handler for
  Instruction::GetElementPtr, is a great place to start hacking

- Made a final read-through of CHI 2009 camera-ready and gave minor
  comments to joel, so that he can finalize it
- cleaned-up STP -> C printing code to prep for an SVN check-in, but
  won't have the green light from vijay to check in until after ISSTA
  submission on Jan 30, for fear of rocking the boat

- start reviewing ECOOP 2009 paper submission for Viktor Kuncak

- work with peter on whiteboard to further solidify our understanding
  of what we want underconstrained execution to do on some simple test
  cases, and then start hacking on it

- woohoo got the first thing working, tracking what addresses are
  underconstrained and stuff ...
- created a new SVN branch called klee-pp-underconstrained so that
  peter and i can check in our changes to it without disrupting the
  rest of the tree

- keep hacking on underconstrained implementation

- met with dawson, roger, cristi, and seungbeom to orient seungbeom on

- get seungbeom set up in Gates 326

- installed Klee on new Thinkpad laptop and updated public wiki page
  for Klee installation instructions
  • Seungbeom was a new Ph.D. student who worked on Klee for a few years.
- finished ECOOP 2009 paper submission review for Viktor and emailed
  it to him
Read over email from Peter regarding UC implementation and ponder it:
I did do some more 'destructive thinking' Friday (sorry, I generally
can't start hacking until I have a clear picture in my head of how and
why this is gonna work ... which can be a bad thing, because it prevents
me from getting started, but that's why I got you ;). Anyway, I'm not
sure the is_base bit is the 'right' solution. The problem is when you
deal with something alike the p++ case (it should be equivalent to the
p++ case, but I think this example is more clear):

int *p, *q;
q = p + 4;
q[3] = 42;

The result of the 'p + 4' somehow gets stored into 'q'. This happens by
making 'q' symbolic and then adding the 'read(p) + 4' expression as an
update in q's update list. Next, you create a GEP(read(q), 3) and try to
dereference that... You see the problem? What's the base here? 'q'?
Ultimately 'p' of course...

So these update lists make the expression trees more complicated then
you'd think at first. Basically you have to traverse all the updates
recursively to get the complete tree. I'm not sure whether this really
complicates our ideas ... maybe not.

I'm not sure whether hardcoding an is_base bit is going to give you the
right answer all the time. Different objects can be used differently in
different expressions, but if this is_base bit is set somewhere in the
update list you might incorrectly consider that object the base in all
expressions that are built on top of it. Or maybe I'm just completely
wrong :) Just trying to understand why Dawson and Daniel believed they
needed these referents and what they are supposed to do...

So much for this monologue...
- made this into a regression test called derived-ptr.c
  - should I mark 'q' as UC also because 'p' is UC (or not)?

- look around the Klee code more to figure out how expressions and
  update lists are printed
  hmmmm, UpdateList objects have references to the corresponding
  MemoryObject, which can be used to get block size?
  - lib/Updates.cpp deals with update lists and friends, but not much
    interesting there
  - uses ExprPPrinter::printOne to print one Expr
    - update lists are printed when printing read expressions
    - when you print out something like arr8022, that means the update
      list is EMPTY and the ROOT MemoryObject has an ID of 8022

  - ObjectState knownSymbolics gets set during write operations

- start implementing is_GEP_base bit in Expr as an early hack
for detecting base expressions in GEP

- brainstormed more on paper and wrote email to Peter to better
  organize my thoughts:

don't worry about groking these right now ... i'll hopefully go over
them with you tomorrow afternoon if i'm awake (just wanted to dump some
hand-written notes to digital format to collect my thoughts)

Our current plan regarding dealing with UC data: "Whenever there is a
LOAD or STORE involving dereferencing an address whose base and/or
offset contains a READ of UC data, then we must take some action or else
the program will most likely crash"
  - this allows us to defer taking action until the last possible
    moment, that is, when the pointer needs to be dereferenced

I think this is true ... GEP (special Add) Expr nodes contain 2 Expr
children, each of which can be one of 3 types:
  - a constant
  - a read of data (e.g., a 32-bit read is actually a Concat of 4 8-bit
    reads), which can be concrete, symbolic, or symbolic+UC
  - a GEP node I don't believe that anything else is possible because
    LLVM exclusively uses GEP operations to implement address arithmetic

Therefore, given an arbitrary tree involving GEP nodes, how should we
handle reads of UC data?

For a GEP node N with base node N_b and offset node N_o:
  - If N_b is a read of UC data, then create a new memory block, mark
    all its bits symbolic+UC, change N_b to a constant that points to
    the new block, and mark it as no longer UC
  - If N_o is a read of UC data, then add the following constraint:
    assert(N_b <= N < N_b + size(MemoryObject referred to by N_b)) how
    to determine the size of MemoryObject referred to by N_b?  Keep
    following base pointers of GEB nodes until you hit a root, which
    must either be a constant or read - in either case, it must resolve
    to some MemoryObject
  - If N_b is itself a GEP node, then you don't have to do anything
    special; likewise if N_o is itself a GEP node

I don't know how UpdateLists fit into this picture.

Also, I don't know in which scenarios you would 'lose' track of the base
MemoryObject, thus requiring referrents.

Okay, sleep time.
- revised my mental model of what a GEP node looks like:
A GEP Add Expr node has 2 children (base & offset), each of which is
  - constant
  - multi-byte read (for base)
  - a multiply of a constant and a multi-byte read (for offset)
  - a GEP node

- cleaned up implementation of GEP Add Expr node by adding explicit
  'base' and 'offset' references inside of Add Expr and setting them
  explicitly in static factory methods

- with peter, figured out how to add constraints to the current
  ExecutionState using state.addConstraint()

- hacked some more on implementing handling for UC offsets

  - [[expletive]], the Expr trees generated by array-struct.c are sort
    of optimized and violate my invariants about offset fields only
    being a multiply of a constant and multi-byte read.

  ref offset = mo->getOffsetExpr(address);

  (Add w32 8
    (Mul w32 12
      (Concat w32
        (Read w8 3 arr8021) (Read w8 2 arr8021) (Read w8 1 arr8021)
          (Read w8 0 arr8021))))

  - crap, it seems like base/offset are OUT OF SYNC with left/right
  since Klee optimizes sequences of GEPs in array-struct.c example,
  so the original Exprs at the time of building are LOST :(

- okay I found a better way to handle UC offsets (while ignoring UC
  bases altogether for now) ... checked in my changes
- made another pass through Evaluation section of Hampi paper

- pair-programmed with peter to start implementing lazy allocation of
  memory blocks for underconstrained execution

- started looking through Sections 1-4 of Hampi paper
  • It was fun to work together with Peter doing pair programming on Klee since I was no longer working solo, but too bad it didn't last!
- finished another pass thru Sections 1-4 of Hampi paper and typed up
  and committed updates

- pair-programmed with peter and finished a VERY preliminary simple
  version of underconstrained execution that works on a few super
  simple test cases in our test suite :)

- attended cristi's job talk and gave him feedback on it

- gave seungbeom some advice on getting started with modeling fake
  socket in Klee

- made another pass over evaluation section of Hampi paper
  and typed up and committed updates
- went over mike's hand-written comments on Hampi paper and integrated
  the edits into the paper itself

- hacked more with peter on UC stuff

- made another pass over evaluation section of Hampi paper
  • UC = underconstrained execution (Klee-UC)
- made a final full pass over Hampi paper and made my comments; it's
  up to adam and vijay to submit

- met with roger and got him acquainted to the Gates 326 office and
  gave him some pointers about setting up scripts

- figured out with peter that a gnarly failure in UC tests was
  actually not an error after all!  look in:

- read VLDB 2008 Google deep web crawl paper and got some
  inspirational ideas from it to apply to improving Klee search

- the general goal is to improve coverage via search, but one specific
  goal is to try to get past the initial parsing stage of programs
  WITHOUT having to specify a grammar
  - the only requirement is to mark whether a path terminated in a
    parse error or not (probably easy by annotating a 'parse error'
    exit function)
  - collect sets of constraints from state that terminated with and
    without parse errors
  - try to direct execution by selecting branches that result in path
    conditions that are the 'most different' from terminated bad
    states (crap did that MSR TR try to use path condition diffs?)

- trace through programs that i ran with hampi grammars and see whether
  i could find patterns in their constraint sets upon termination of
  error paths in order to design a better searcher
  - complete constraints list for state stored in state.constraints
    can print them or compare them, etc., or store in some big-ass
    data structure whenever a state terminates
  - yyerror seems to be the function that's called for a syntax error
  - hmmm, seems like it gets stuck in LEXER code a lot, not even parser
  for logictree
    - yy_get_previous_state() gets called a lot and seems to take a long
      time, maybe due to the large array lookups in yy_accept and yy_ec
  hmmm, switching to bc since it gets more parse errors:
    - same deal with yy_get_previous_state() being called a lot, except
      now yy_accept array is A WHOLE LOT BIGGER :(
    - but lots of bc stuff times out because of max. instruction time
      exceeded, rats :(
  - hmmm, maybe something as simple as staying away from constraint
    solver calls that take FOREVER would be wise (to avoid timeouts)

  - oh wow if you look at constraints, the only ones that we really care
    about are the ones affecting the array (e.g., arr1656 for logictree)
    that specifies the concrete value for stdin ... the other ones might
    be fluff?  hmmmmmmm
    - holy [[expletive]], quite a few test cases have the stdin constraints
    - hmmmm, why don't we just focus on the constraints for the stdin
      buffer, which is only a few bytes long!
    - if the first state terminates with a constraint that the first
      character must be equal to X, then favor running states WITHOUT
      that constraint

February 2009

- started exploring more about implementing a searcher that prefers
  states with stdin constraints that are the most different from those
  of terminated states

- started implementing pseudo-code for a 'distance metric' between
  states as reported by how different their constraints are from one
  another w.r.t. stdin input

- started implementing a new WeightedRandomSearcher based on
  constraint distance metric

- crap, in initial tests, using constraint distance metric does WAY
  WORSE than covnew, probably since it doesn't try to aggressively
  seek out new code to cover

- did an overnight run at:
- crap, the initial run of the constraint distance metric weighted
  random searcher SUCKED [[EXPLETIVE]] compared to default covnew ...
  sigh, it's pretty damn hard to beat covnew with something simple :(

- give up on this stupid constraint distance metric thing for now and
  save what I have in:
  ~/MyDocuments/Documents - Stanford 3rd year Ph.D./research/

- wrote an email reply to Tom Zimmermann at MSR expressing my interest
  in working with him for an internship and briefly introducing myself
  and my experiences in the area of mining software repositories

- wrote an email to Andy Begel at MSR thanking him for introducing me
  to Tom and also expressing some interest in his group's projects

- met with Ranjit Jhala (prof. visiting from UCSD) 1-on-1; we talked a
  bit about implementing analyses for nasty C code - even though I was
  doing dynamic analysis stuff and he was doing static
  analysis/theorem proving, what we had in common was that he wanted
  to implement his techniques (easy in ML but hard for nasty C code)
  for nasty-ass C code - sent him a follow-up email reminding him that
  his students can feel free to contact me if they want to talk about
  choosing an implementation for C code:

Hi Prof. Jhala,

Great meeting you today!  I just wanted you to have my email address on
record in case you or your students want to contact me to talk about
analysis implementation frameworks for C (e.g., source-level vs.
bytecode-level, CIL vs. LLVM, etc.).  Since you guys are planning to
start from scratch with implementing the liquid types stuff for C, it's
important to pick an appropriate framework, or else implementation will
be quite painful, as I've learned from some of my previous hacking
experiences :)


- sent adam my hacks to hampi and my Yacc->Hampi converter python
  script, so that he can post it on the hampi website

- pair-programmed with peter to get this benchmark working
  the solution (for now) is to try to aggressively simplify offsets to
  - also figured out a more elegant and robust way to figure out
    whether Read expressions are reads of underconstrained values
  • Tom Zimmermann played a pivotal role in The Ph.D. Grind: Intermission. My USENIX 2009 short paper was what got his attention for interviewing me to become his first ever intern. Andy Begel (whom I cold-emailed in 2005-2006 when I was applying to Ph.D. programs) provided a vital introduction for me. Thanks, Andy!
  • Ranjit ended up remembering me from that one meeting so long ago and sending me a really kind email out of the blue after he read The Grind when it came out in 2012. He then gave me valuable advice during my first faculty job search in 2012-2013, almost four years before I came to UCSD.
- started automating the regression tests in:

- met with dawson and he seemed pretty adamant about shooting for the
  SOSP March 9 deadline

- fixed some more subtle bugs with peter to get pretty much all the
  tests in our UC regression suite working (fingers crossed!)

- got most of the tests in our regression suite working, with the
  exception of mixed-index-3.c

- woohoo, got 'make regressall' to work in
  ~/FT-Apps/underconstrained-tests/ by making the diffs invariant of
  pointer values and MemoryObject IDs, so that it's easier to take
- committed code into STP repository for converting CVC to C code

- created a new [[email address cut]] mailing list for
  on-campus MC research group members

- talked on phone with Tom Zimmermann at MSR, and he's interested in a
  summer project involving evaluating what types of bug reports are
  acted on by developers at Microsoft

- more 1337 klee-uc hacking with peter
  - started getting *a[i] to work, where i is UC
    by implementing symbolic UC shadow memory
- created an informal talk for software lunch (tomorrow), which is
  basically a survey of work in concrete+symbolic execution and some
  challenges that researchers in the field face

- did a very rough initial run-thru of talk with robert, and he gave
  me some useful comments

- did a night grind with peter to properly implement forking for a
  tricky case where we do *a[i] then *a[0] where i is UC, so we need
  to fork two cases, one where (i == 0) and the other where (i != 0)
- accepted MSR offer for summer 2009 internship

- revised my software lunch talk and presented it at software lunch,
which was (a bit surprisingly) VERY WELL RECEIVED
  - alex aiken complimented me on how he would like to see more survey
    talks like that since he doesn't have time to read all the papers
    in a certain area

- got yet another tricky peter test case working with peter, one more
  down, hopefully not too many more to go :0
- checked in yet another test that peter and i worked on together

- started looking into Klee regression suite, and it doesn't seem to
  automatically run ... probably needs some tweaks to get it to work
- read over Vijay Ganesh's BuzzFuzz ICSE '09 paper camera-ready draft
  and emailed him some comments on it

- added a --symbolic-means-UC option to Klee to make make_symbolic calls
  actually mean make_UC calls, in order to facilitate running regression
  tests with symbolic and UC
- filled out online 'paperwork' for my MSR internship application

- started porting over Klee regression suite to compare symbolic with
  UC, and thus far, found pretty much NO differences
  but that's probably because there's no way for the original klee
  symbolic tests to stress the UC aspects (or else they would've
  crashed long ago)

- got pretty much all the Klee regression tests moved over to compare
  with UC, and everything works the same way - unfortunately (or
  fortunately) didn't uncover any new bugs :0

- started looking into old Linux drivers work I did back in 2007 to
  re-acclimate myself (see hexapod:~/ft-on-linux-drivers/)
  • Peter and I dug up my first set of Klee experiments from 2007 (my first year of Ph.D.) to see if we could use Linux drivers as test cases for our current Klee-UC work.
- alright [[expletives]], downloaded linux- sources and try
  once again to find driver bugs - will need to set up lots of grungy
  infrastructure and some custom checkers
  - maybe focus on newer files since their developers are more likely
    to be around, and also they are less likely to have already been
    patched for stupid bugs
  - use online git summary:;a=summary

- start with linux-, which was created last
  year and has very few patches
  - hmmm, but it's not for x86 ...

- hmmm, did 'sudo yum install kernel-devel' to try to get some darned
  kernel headers on hexapod

- maybe helpful tutorial:

- used 'make -n' in the kernel sources base directory to see how an
  individual driver builds

- got hello world driver working

- worked with peter to get function skipping and unknown global vars
  marked as UC working
- setting up Makefiles to try to get linux drivers to compile for Klee

  - hmmm, a bit of nastiness with #includes ... by default, I have asm/
    set to x86, so I'll try to just stick with x86 drivers for now

- started porting over driver test harnesses and some stubs from 2007
  SOSP work

- read "Thorough Static Analysis of Device Drivers" EuroSys '06 paper

- one idea is to make 'int' return values of unknown functions
  symbolic rather than UC, because kernel functions often return int
  status codes

- started getting Klee to actually run on the dtlk.c driver!!!
  - fixed some issues with initializing globals that are of unknown
    size (simply create a 0-byte thing for them)

- got some rudimentary handling of inline assembly working, which is
  vital for kernel code, because there's lots of small inline asm

- hand-picking some more char/ drivers to get compiled - looking at
  online git histories to see which ones have been touched recently by
  developers (those have the highest chances of garnering a response)
  • Our 2007 SOSP conference paper submission work was documented in Year One of The Grind (search for 'March 2007' in that chapter).
- edited drivers Makefile to have Klee run on drivers with the set of
  'good' options used for OSDI 2008 paper

- starting to run init() of initial 20 char/ drivers that I got to
  compile and seeing which ones crash Klee
  - found a few Klee assert crashes that I'm going to debug with Peter
    later this afternoon
    - no biggie, added them to regression suite, phew:
    1.) don't flip out when a function pointer is passed as an
        argument to an external function (skip-function-5.c)
    2.) don't assume that inline asm will return any output -
        sometimes they don't have any output, so their return
        type is effectively 'void'

- had a lunch meeting with jbrandt and steven dow to brainstorm ideas
  about an empirical study they want to run on how programmers

- do better error reporting when inconsistent constraints are added to
  state, such as the "if (!p) {*p}" case

- better ported models/stubs from 2007 SOSP work and had to modify
  kernel header files to use my own model versions of functions (e.g.,
  locking functions) rather than defaults

- kicked off first-ever overnight run of 20 drivers testing init and
  shutdown on hexapod!
  • Steven is now my next-door office neighbor at UCSD. He was a postdoc at Stanford back then, and I think we talked only once during that lunch with Joel Brandt!
- looked at first batch of results and saw some potential bugs, but we
  need to get replay working and also a decent trace of what's
  happening during execution, in order to properly diagnose these

- worked with peter to get --replay-paths working with UC, but for
  some reason, it totally croaks ... crap we got defeated!
- started adding execution traces to Klee runs to help diagnose
  errors, taking into account Dawson's comments:

> pretty much all the drivers ran fine and generated lots of test cases, and
> we even got a few potential errors, but they're pretty hard to diagnose with
> just a backtrace ... we're gonna add some tracing to see which functions and
> branches each error path took (any advice for making these traces manageable
> and not too verbose?)

yeah; the old ft had some code for it.  you need to record all the
branch points and function calls.  then just print out
where event is a branch, or function call on the current stack.

you likely want to not include function calls that have already
returned unless an interesting event happened.  (e.g, acquisition)

then have a script rip out the code in each of these files (with a bit
of context) and spit out.

you also want to say what is symbolic, most likely.

  - print each trace as a test*.trace file in klee-out-*

- used some python and gcc magic to rename special functions within
  drivers to generic names like:
  so that they can be called more easily from my test harness

- hmmm instead just overrode registration functions to do it in a
  CLEANER way instead of doing python/gcc magic, e.g., override
  things like register_chrdev()

- kicked-off set of 1-hour overnight runs doing more comprehensive
  checking (with tracing now)
- added an --emit-all-errors option for running Klee to hopefully
  avoid the problem of end-of-execution asserts firing only ONCE (for
  the first failure) and then ignoring subsequent failures

- started trying to diagnose some potential bugs found by last night's
  overnight run of 12 char/ drivers

  - fixed a bug in my model where spin_unlock() wasn't calling the
    right stub function, so it erroneously reported inconsistent
    locking (pesky definitions within Linux header files)

  - added klee_silent_assert() macro to assert pre-conditions but not
    give errors for them

  - eliminated some more spurious errors by fixing up my model code

- one good side-effect of having external functions return UC values
  is that it will allow for branching both ways depending on return
  value checks

- kicked off another run of those same 12 drivers for the heck of it

- crap, a FP for read error for toshiba.c, since it uses a generic
  function for read() which it doesn't define:
    .read   = seq_read
  - hmmm, I should probably override those so that we don't get
    spurious UC errors, or just NOT report errors arising from UCness,
    which is the more general solution

- try a TON of char/ drivers, all of those with standard
  'file_operations' structs, in the hopes of getting more bugs
    checking more code means finding more bugs!!!
  - "grep 'struct file_operation' *.c" is a pretty good way of picking
    out driver candidates

  - crap an impediment to successful compilation is missing .h files
    due to different architecture-specific asm/ files
- turn OFF --allow-external-sym-calls for now since it sometimes can
  cause printf's to crash (when called with symbolic args)

- implemented ExecutionState::areConstraintsUC vector, which
  keeps track of which constraints added to the current path condition
  involve a UC value - these can be used to suppress/rank warnings,
  the intuition being that the more constraints there are involving UC
  values, the less likely something is a 'true bug'

- kicked off another overnight run, hoping that this time the
  areConstraintsUC will help us, as will having function pointer
  support (that Peter just added today)

- tried to add lots of more drivers, but there are different ways of
  initializing different drivers - e.g., PCI drivers initialize and
  register their callback functions in some special way, etc.
  or video drivers, etc.

  - woohoo got a list of 243 drivers that COMPILE (on x86
    presumably) with file_operation structs defined, listed
    in: FT-Apps/linux-drivers/scripts/all_COMPILING_drivers_with_fops.txt
- briefly look at results of overnight run from last night:
  - some 'drivers' terminated early because they aren't actually
    standalone drivers - they're simply .c files that don't even
    define module_init() and module_exit()
    - tty_ldisc, vc_screen, selection

  - lots of others fail due to this assert:
    DEMAND FAILED:ExprFindUC.cpp:visitRead:22:  is false: "Don't
    have an ObjectState for this read expr anymore, can't read shadow
    - apm-emulation, applicom, dtlk, efirtc, genrtc, hpet,
      istallion, lp, nvram, ppdev, rtc, stallion, sx, tlclk,

  - some do pretty well in terms of coverage:
    - ass-kicking 93.38% statement coverage in cs5535_gpio
      (that's probably counting our test harness code too, which we
      can exclude just like libc)

  - some don't properly init certain functions, e.g., sonypi,
    since they need a special 'probe' function to initialize
    (it's a platform_driver)

- added support for platform driver registration and probing

- oh [[expletive]], i think sonypi RUNS OUT OF MEMORY!!! std::bad_alloc()
  - ah interesting, it croaks while trying to PRINT OUT an
    operand of a spinlock_t parameter to an external function call
    (that's really weirddddd, but un-interesting from our perspective)
    - fixed by simply not printing out arguments to file
      (no biggie)

- got ~50 more drivers to compile by pulling in more local #include
  files, including BOTH .c and .h files

- met with roger to talk about his router cross-checking progress;
  agreed that he should look into hacking Klee searcher to try to
  prevent going down tons of redundant paths where, say, one version
  REJECTS the input and the other ACCEPTS in the input in tons of
  un-interesting ways

- model usb_register() and usb_register_dev() for registering USB
  drivers and accessing their read/write/ioctl/etc. functions

- found an LLVM bug (maybe?) involving struct forward declarations and
  emailed about it

hi all,

i'm trying to use LLVM to compile some linux kernel code, and i noticed
a mismatch with gcc.  here is a simplified test case:

struct foo {
  int a;
  int b;
  int c;

static struct foo x; // 'forward' declaration?

int bar() {
  printf("a: %d, b: %d, c: %d\n", x.a, x.b, x.c);

static struct foo x = {
  .a = 1, .b = 2, .c = 3,

int main() {
  return 0;

when this code is compiled with gcc and run, stdout prints "a: 1, b: 2,
c: 3", which means that it takes the true declaration of x, initialized
to 1, 2, 3.  however, when it's compiled with llvm, llvm emits the
following code for x:

@x = internal global zeroinitializer    ; <*> [#uses=3]

which seems to me like it's taking the first declaration of x, which is
a forward declaration.  is that the correct behavior?  i believe that
the kernel developers intended for the second (real declaration) of x to
be visible, even in bar(), but that's not what's happening with llvm.
is there an easy workaround where i can get llvm to emit code
initializing x to {1,2,3}?  thanks!

  - this prevents USB drivers from registering properly ughhhhhhh
    since their structs are declared using these funky forward decls

- looked into linux- to model
  get_user() and put_user(), copying from SOSP 2007 attempted models

- kicked-off a [[expletive]] GIANT run with all 291 drivers!!!
- started putting a paper outline into SVN:
  svn co svn+ssh://$

- started looking into diagnosing bugs with peter:
  - MAJOR FAIL on first attempt:
    - false positives arise due to the fact that when you pass
    a part of a struct into an external function, you make that ENTIRE
    struct UC - in this case, 'cam' is made UC
      video_register_device(&cam->vdev, VFL_TYPE_GRABBER, -1)

  - similar error in:
    where the entire object is made UC when it's passed as an argument
    to an external function

  - this has two mutex_locks() of different mutexes, which flags an
    error since our rules say don't sleep with spinlocks held ...
    maybe that restriction is too strong???
    at least our checker worked, although we don't know if it's too
    strict of a check :/

  - crap this case might be exposing a bug in our UC tracking, [[expletive]]:

    hmmm, weird, having ptr errors with things like p->q->r where p
    is UC, so q and r should have new blocks made up accordingly

  - woohoo re-found dtlk bug from 2007! (i don't think it's a *real*
    bug, but for the purposes of any checker, it's real)

  - woohoo found a fresh ioctl bug!

  - woohoo found some [[expletive]] in linux-
    - could be an int underflow ...
    - man ioctl's are profitable
  • Peter, Dawson, and I were going to try to make a Klee-UC submission to SOSP in March 2009, but we didn't end up getting good enough experimental results to make it. Cancelling paper submission plans was a recurring theme throughout that year.
- looked through more bugs and try to create minimized test cases of
  places where I think there's a bug in Klee-UC
  - write up findings in:

  - create simplified regressions from looking at these bugs, placed
    in: FT-Apps/linux-drivers/simplified-regressions

- met with dawson and peter to talk about strategy for the next 2 weeks

- have kfree() be a NOP so that we can still keep track of locks and
  semaphores, even if they were freed before end of execution (i.e.,
  those stashed in dynamically-allocated driver data structures)

- don't have ObjectState objects be freed (by their refcounting
  ObjectHolder containers), so that we don't lose them when we are
  trying to read out their UC values from Expr later

- kicked off another run of all ~290 drivers with updated models and
  version of Klee, which can now produce .uc to track what branches
  came from constraints with UC
- fixed a few bugs in the regression suite inspired by linux drivers
  (with peter) in FT-Apps/linux-drivers/simplified-regressions
  - make copy_from_user create a fresh buffer and memcpy it
  - made a larger buffer (3KB vs. 1KB) for UC base pointers, to avoid
    erroneous pointer errors resulting from overflows of those buffers
    from HUGE driver data structures
  -  for made-up UC buffers, return an address in the MIDDLE of the 3K
     buffer (1K past the beginning) in case the caller wants to do
     pointer arithmetic BACKWARDS (e.g., for kernel container_of()
     macro) so that we don't get erroneous pointer errors

- added more detailed printing of path constraints in *.uc files by
  adding file/line numbers and whether each fork could go both ways

- did another overnight run with today's bugs fixed
- cleaned-up regression suites to account for our new changes

- fixed our model to dis-allow request_irq of IRQ 0

- ugh we get misleading bugs at the end of execution (e.g., IRQs still
  being held) because we don't properly call some shutdown functions
  e.g., .release function for pci drivers

- attempted to triage more error reports with peter but got
  discouraged by the low signal-to-noise ratio

- greatly simplified stubs in an attempt to eliminate false errors :0

- made yet another giant overnight run
- added some more small regression tests due to kernel driver failures

- unified tracing of UC, function call, and branch events into .trace
  file to make it cleaner and more robust (hopefully)

- worked with peter to diagnose a weird-ass case of integer overflows
  :0 ahhhh
  - okay, so let's say we have a constraint: (x * 12 < 24)
    where x is unsigned.  we want to use the expression (x * 12)
    to index into an array of structs with the correct alignment - so
    we want the property that (x * 12) % 12 = 0
  - the 2 intuitive satisfying assignments are x = 0 and x = 1,
    and thus 0 * 12 = 0 and 1 * 12 = 12, both are 0 modulo 12.
    - now, however, due to [[expletive]] overflows, if you have
      x = 0x15555557, then x * 12 = 20, which is < 24, so that's also
      a satisfying assignment.  HOWEVER, since x * 12 = 20, we are going
      to index into byte 20 of our struct, which is BAD!  we only want
      to create indices which are 0 modulo 12.
    - one tentative solution is to add another constraint:
        ((x * 12) / 12) == x
      to ensure that we don't overflow
- fixed weird multiply overflow bug from yesterday

- was able to mark all globals in driver code (NOT in harness code) as
  UC using a --mark-all-globals-UC option

- added ability to link in external harness library in Klee using:
  --link-with-external-lib option

- added a klee_run_all_functions() special hook that tells Klee to run
  ALL functions within the driver - just blast it and shove code thru!

- kicked off another big-ass overnight run, this time running ALL
  functions in each driver file!!!
- re-factored some Klee-UC code so that it doesn't look as ugly and is
  easier to add new functionality

- added a --skip-funcs-after-depth option to implement customizable
  layers of function skipping

- supervised peter while he wrote his first python script ;)
    to do .trace file parsing

- kicked off a big-ass run with--skip-funcs-after-depth=1
- remember to turn off the --optimize option if we're doing
  klee_run_all_functions() or else some functions will be optimized

- minor code-cleanup re-factorings

- added a --eligible-fn-substr option to only run certain functions
  (matching desired substrings) when klee_run_all_functions() is

- patched up some error reporting to hopefully filter out more false

- kicked off another big-ass run with--skip-funcs-after-depth=1
- helped Roger get up-to-speed on hacking on the searcher to improve
  its performance for cross-checking (i.e., eliminating redundant
  'uninteresting' paths)

- helped Seungbeom get up-to-speed on running Klee better and
  diagnosing errors; unfortunately replay doesn't work too well :/

- triaged a few more bugs with peter but got discouraged; dawson
  doesn't seem too happy with our slowness thus far and the fact that
  we won't make the SOSP deadline - tried to boost morale for peter
  and me by going to peet's for coffee :)

- added a --write-traces option to Klee mainline branch because I
  think it can help Seungbeom and Roger with their debugging

- copied klee SVN repo to klee-sbkim and klee-rogliao (and turned off
  email notify hooks for them) in order to give Seungbeom and Roger
  their own branches to work off of, respectively.  cristi gets
  annoyed if you try to make a branch off of regular klee since the
  SVN repo wasn't created with a trunk/, so it gets cluttered, sigh
  • This was around when Peter and I knew that we didn't want to keep working on Klee, even though we persisted for a few more weeks. We took solace in Peet's coffee.
  • Also I was unofficially advising Roger (undergrad) and Seungbeom (new Ph.D. student), since technically I was the most senior student in the lab beside Cristi, and Cristi was super busy with faculty job applications and interviews that year. I did a bad job since I couldn't even make meaningful progress on my own projects back then :(
- made a quick fix to get extern global arrays to work properly -
  since we don't know their sizes, make up a 2K block for them and
  call it a day

- kick off another full run with calling all functions and skipping
  all function calls

March 2009

- again, as usual, try to triage false positives and create small
regression tests for them ...

- found lack of bounds check in drivers/media/video/saa5246a.c
  even though it's probably not a real bug because the caller will
  likely never pass in 64 arguments:

static int i2c_senddata(struct saa5246a_device *t, ...)
  unsigned char buf[64];
  int v;
  int ct = 0;
  va_list argp;
  va_start(argp, t);

  while ((v = va_arg(argp, int)) != -1)
    buf[ct++] = v; <-- possible overflow here,
                       but caller never passes that much in

    - same bug in saa5249.c

  - something funky with list macros causing possible FPs:

  list_for_each(entry, &dvb_adapter_list) {
    struct dvb_adapter *adap;
    adap = list_entry(entry, struct dvb_adapter, list_head);
    if (adap->num == num) <-- klee flags invalid ptr error here

  - something with this inline function causing possible FPs:

static inline int pci_write_config_word(struct pci_dev *dev, int where, u16 val)
  return pci_bus_write_config_word(dev->bus, dev->devfn, where, val);

- ahhh, I can now replay single functions to try to get test cases
  minimized, hmmmmm

- crap, klee seems to be thinking that some things aren't UC when in
  fact they should be, leading to lots of false pointer errors

- lots of linked-list [[expletive]] seems to screw us up

- hmmm, when I run functions individually, some of the errors seem to
  disappear, funky!  ohhh, is it because when there are globals that
  different functions can access, their constraints interfere with one
  - hmmm, that shouldn't really happen in theory, but it might be
    simpler to try to run one function at a time

- ahhhh, we might be getting spurious .readonly errors because our
  pointers resolve to some incorrect object in read-only land

- hmmm, for DAC960, it seems like we don't properly 'contaminate' a
  symbolic value when it's compared to a UC value ...

if (copy_from_user(&UserCommand, UserSpaceUserCommand,
         sizeof(DAC960_V1_UserCommand_T))) {
  ErrorCode = -EFAULT;
ControllerNumber = UserCommand.ControllerNumber;
    ErrorCode = -ENXIO;
if (ControllerNumber < 0 ||
    ControllerNumber > DAC960_ControllerCount - 1) <-- bounds-check
                                        -- ControllerNumber is symbolic
                                        -- but DAC960_ControllerCount is UC
                                        -- so ControllerNumber should be made UC
Controller = DAC960_Controllers[ControllerNumber];<--Klee reports ptr error here

  - added a regression for it:

- wow, peter pointed out and fixed a really subtle error where we
  aren't using 'wos' (writeable ObjectState) when we try to write,
  thus causing contamination across states and not obeying
  copy-on-write policy ... this fixed the following test:

- added simplified-regressions/tpm_bios.c to exemplify this demand

  DEMAND FAILED:Executor.cpp:executeMemoryOperation:3286:
  isGEP> is false: "GEP bases that are Add nodes MUST be GEP!"

'fixed' for now by not being so demanding on this check and instead
printing out a warning

- kick off another full run with calling all functions and skipping
  all function calls
  • FP = False Positive. Klee's job is to automatically find bugs in programs that it's run on; an FP occurs when Klee thinks it's found a program bug, but it's not a real bug. FPs are usually caused by bugs or known limitations in Klee itself.
- this exposes a readonly error - investigate:

env ONLY_RUN_FNS="--eligible-fn-substr=cpufreq_set_cur_state"
make linux-

  - added to simplified-regressions/processor_thermal.c

- fw-cdev bug seems like it could be meaty and maybe real --- look into it
  with peter
  - in the ioctl_queue_iso function
    - seems sorta real to me, added to ~/FT-Apps/linux-drivers/REAL-BUGS-FOUND

- lots of possibly annoying .uc.err errors
  - hmmm, with linked list stuff:

 * list_empty - tests whether a list is empty
 * @head: the list to test.
static inline int list_empty(const struct list_head *head)
  return head->next == head;

  added simplified-regressions/ucm.c

- hmmm, what about a loop with a UC bound?  should suppress error
  message - this comes up quite often, actually:

  for (i = 0; i < UC; i++)
    A[i] = 42; <-- could go out of bounds?

  - added a regression test in UC-array-bound.c

  or the dtlk case where we make a UC array and check for null values:
    static unsigned int dtlk_portlist[] =
      {0x25e, 0x29e, 0x2de, 0x31e, 0x35e, 0x39e, 0};

    for (i = 0; dtlk_portlist[i]; i++);

  this normally terminates since dtlk_portlist is a null-terminatd
  array, but when we make dtlk_portlist UC, we might never null
  terminate and instead go out of bounds

  - added a regression test in UC-null-term-array.c

  hmmm, a lot of these manifest themselves as 2-way UC on same line
  and then followed by a terminate 1-way NON-UC ... interesting

  e.g., drivers/idle/i7300_idle.c
    while (debugfs_file_list[i].file != NULL) { ...; i++; }

  where debugfs_file_list is a UC array

  - ok peter accounted for this in his ranking system, nevermind

- conversed with dawson in person and via email with dawson and peter
  about what should happen when concrete and symbolic values interact
  with UC values via comparisons
- drivers/macintosh/adb.c exposes another issue with overflowing a
  null-terminated UC array:

i = 0;
while ((driver = adb_driver_list[i++]) != NULL) { ... }

adb_driver_list is a null-terminated global array, but since we make
all globals UC, we're kinda hosed
- how do we properly taint 'i' with UCness in this context?
  - we see this situation in quite a few drivers

- woohoo found a seemingly real divide by zero bug in

- drivers/media/radio/radio-terratec.c
  could potentially underflow buffer ... hmmmm, can this path really

// freq1 is UC
static int tt_setfreq(struct tt_device *dev, unsigned long freq1)
  int freq;
  int i;
  int  temp;
  long rest;

  unsigned char buffer[25];
  freq = freq1/160;     <-- UC

  rest = freq*10+10700; <-- UC
  while (rest!=0)       <-- UC
    if (rest%temp == rest)
      buffer[i] = 0;
      buffer[i] = 1;
      rest = rest-temp;
    temp = temp/2;

hmmmm, if this loop can run more than 14 times, then we've underflowed
the array ... is that possible, or will the 'rest = rest-temp' condition
cause rest to be equal to 0 before that happens.

- figured out how not to get screwed by some optimization which was
  causing us to report some wesuck.err as pointer errors ... basically
  in executeMemoryOperation(), we were calling:
    address = state.constraints.simplifyExpr(address)
  which constant folded address before we could grab myBase base
  expression from it, thus causing us to miss finding bases of made-up
  blocks - we moved this optimization to later *after* we already
  found myBase, and things should work properly now

- eliminated lock checking code for simplicity's sake, since they're
  not very effective when running single functions anyways

- got these test cases to work:
  - simplified-regressions/yam.c
  - simplified-regressions/ide-tape.c
by disabling SimplifySymIndices optimization
(maybe at the cost of performance!)

- did another big-ass full run of all drivers, hoping for a cleaner
- keep on triaging bugs ... this one in isdn/hardware/eicon/divamnt.c
  is sorta a FP due to pre-conditions on the structure of the revision
  string, which always contains a '$' but i guess it should be
  defensive coding to do a null check

static char *getrev(const char *revision)
  char *rev;
  char *p;

  if ((p = strchr(revision, ':'))) {
    rev = p + 2;
    p = strchr(rev, '$'); <-- can return null, but doesn't really in
                              practice, but COULD in the future
    *--p = 0; <-- reports possible ptr error here if p is null
  } else
    rev = "1.0";

  return rev;

- made a provision not to report wesuck.err when the index is
  RIDICULOUSLY LARGE, since that's still likely to be a true overflow

- added a printStdout() method to Expr, so that we can print Exprs in

- diagnosed a weird possible bug in LLVM and added a test case here:

- made another overnight run, this time printing the # of elapsed
  instructions in the trace too
- added more 'fake' branch events when state terminates with error, in
  order to clean up the .trace file and make it easier to post-process

- diagnosed a particularly destructive test case in:
  - simplified-regressions/serio_raw.c

- kick off another usual run
  • As you can probably tell by now, I ended most days by kicking off an overnight (~10-hour) run on our servers with the latest Klee modifications from that day's work.
- revised peter's scripts/ to clean up the code a
  bit and also to implement a more comprehensive ranking metric that
  takes more than simply the FINAL branch line into account

- triaged some mysterious crashes (without assertion fails) from last
  night's run and put them in TODO list to investigate

- fixed a memory corruption bug in callFunctionUC() where the
  destination block of the 'memcpy' is too small, so there's actually
  an overflow in our own Klee code ;)
  - fixed by making the source block of the memcpy AT MAXIMUM the size
    of the destination block
  - also added some nicer bounds checks to lib/Memory.cpp

  These programs that used to segfault should no longer do so now:
  - make linux-
  - make linux-
  - make linux-
  - make linux-
  - make linux-
  - make linux-

- met with roger to debug some of his issues with trying to improve
  the searcher for cross-checking; he seems to be fairly up-to-speed

- kicked off another big-ass run, this time only keeping information
  for trace files that contain error reports!
- read over a nearly-complete draft of Dawson's CACM paper about his
  experiences commercializing static analysis tools

- fixed this error:
    KLEE: WARNING: unable to write output test case, losing it
  - could have to do with not properly writing out .bout files
      make linux-
      make linux-
- ah, I see, there's a problem writing 0-byte objects to .bout
  - patched for now by allowing writes of 0-byte objects to .bout

- cleaned up scripts for making Klee runs on Linux

- make a big-ass run of only ioctls to see what happens ;)

- checked out latest kernel GIT in:
  and tried to update my vcs-mining/ database to reflect the new data
  coming in over the past 2 years
  - woohoo created this new database:
- read over ~/FT-Apps/linux-drivers/notes/SOSP-2007-possible-bugs.txt
to get a sense of what kinds of bugs I found with EXE 2 years ago
  - a lot of those bugs are init/shutdown bugs (not doing a proper
    error return value for init, so shutdown is called for a
    non-properly-initialized driver) and other [[expletive]] that i'd
    need to run with a driver to find, or things that require more
    modeling than i'm willing to do at the moment :/
  - or with model-specific stuff like registering/unregistering
    drivers, modules, and other resources, which we can't easily do
    when we're unit testing

- start looking at ioctl results here:

- woohoo found ioctl bugs in vicam and cdev

- try to figure out how to get ALL drivers compiled, not just x86 ones
  ... seems like kernel symlinks include/asm -> include/asm-x86
  and runs arch/x86/Kconfig
  - each sub-directory has a Makefile and a Kconfig config file
    - for modules consisting of more than 1 file, we have something


// ah i see, this can either be obj-y or obj-m ... see
// ~/linux- and ~/linux-
obj-$(CONFIG_RIO) += rio.o

rio-objs := rio_linux.o rioinit.o rioboot.o riocmd.o rioctrl.o riointr.o \
            rioparam.o rioroute.o riotable.o riotty.o

it seems like rio.o is built from rio-objs, which in turn is built from
a bunch of .o files, each corresponding to a .c file

another idiom is the following:

obj-$(CONFIG_XILINX_HWICAP) += xilinx_hwicap_m.o

xilinx_hwicap_m-y := xilinx_hwicap.o fifo_icap.o buffer_icap.o

using the -y suffix, hmmm

modules ('m' config option) are first compiled to .o, then linked to .ko

check out scripts/Makefile.modpost

- created and refined scripts/ and
  Makefile to be able to build any module for x86 platform and try to
  compile as many modules as humanly possible ;)

- did another big-ass ioctl run with 1869 drivers!!!
- created a new version of scripts/
that works strictly better than the original version (as verified via
cross-checking over all the driver/ Makefiles)
  - this new version uses PLY to do lex/parse rather than my ad-hoc
    parser and has code that's a bit cleaner
- met with seungbeom to talk about some issues he's facing with
  running klee on servers - identified 2 issues:
    - atoi() takes freakin' forever, so one possible solution is just
      to write a model for it to return symbolic ints
    - the searcher gets stuck going down uninteresting paths that
      originate from a single 'bad' branch direction, so one possible
      solution is to make annotations on certain branches to favor
      only one side
- look into big-ass batch of ioctl results in:
  - woohoo! found some more real bugs
    - hmmm, interesting, some bugs come from strings received from the
      user not being properly null-terminated

  - unfortunately, we get lots of wesuck.err (made-up blocks not big
    enough), but at least it means that (hopefully) we're catching
    them fairly well

  - seems to have lots of FPs when dealing with linked-list-like

- created some regression tests out of examples that had FPs

- modified scoring function in scripts/
- brainstormed ideas with peter on whiteboard about how to reduce
  false positives

- improved and simplified ranking with peter the man

- looked into simplified-regressions/timerdev.c and sr_ioctl.c

- crap, with the use of container_of() and also the fact that we pass
  a pointer to the MIDDLE of made-up blocks, calls to free() will
  likely be bogus since we're trying to free the MIDDLE of a block of
  memory and not the beginning
  - Klee's resolveExact (called from executeFree) doesn't quite work
    for made-up objects, since we return a pointer to the MIDDLE of
    the block and you should only call free on pointers to the
    BEGINNING of blocks
- implemented fix for this:
  - Klee's resolveExact (called from executeFree) doesn't quite work
    for made-up objects, since we return a pointer to the MIDDLE of
    the block and you should only call free on pointers to the
    BEGINNING of blocks
    - one proposal is that we can detect whether you're trying to
      call executeFree on a pointer in the middle of a made-up block,
      and if you are, then FREE THE ENTIRE BLOCK (if you're pointing
      to an address <= the 'start', where 'start' is 1K into the 3K
      block ... remember we return pointers to middle of blocks)
  - see ~/FT-Apps/underconstrained-tests/simple/UC-free.c

- met with dawson and peter for a long time and came up with a game
  plan for after spring break - written up here:

- kicked off a HUGE-ASS run on hexapod that will likely last a whole
  • Peter and I lost momentum on Klee after missing the SOSP submission deadline in early March. We halfheartedly did a few more experiments here and there, but spring break was also coming up, so that felt like a good time to step back and reassess.
- coalesced notes for USENIX '09 camera-ready and made a TODO list of
  what needs to be done before April 21 deadline:


  - knocked off some minor things from TODO list

- emailed [[name omitted]] to thank him for his help on Coverity dataset
  (and then sneak in a question about sending an updated dataset, so
  it doesn't sound like such a favor)
  • I spent late March and most of April finalizing the camera-ready version of my accepted USENIX 2009 paper, which gave me some sense of concrete progress.
  • Coverity is a company founded by Dawson and his first batch of Ph.D. students; they make static program analysis tools. We got some data from them about bugs found in Linux kernel code, which I used for my empirical software measurement research.
- dig through old scripts in ~/vcs-mining/2007-12-01-good-db/ to
  re-orient myself

- crap, realized that there was a bug in how I calculated the numbers
  for Coverity bug reports vs. fixes for user-reported bugs
  (I was counting fixes for ALL bugs, which includes the Coverity
  ones, ughhhh) ... re-calculated!

- cleaned up the section "Static analysis bug reports as indicators of
  user-reported bugs" in USENIX '09 paper

- Look into bug and triage rates vs. days since last patch (since we
  can make the argument that files that nobody touched for a long
  time, it's likely that nobody is around to fix those bugs)

- see ~/vcs-mining/2007-12-01-good-db/experiments/
  - this says that Coverity bug reports are much more likely to affect
    files that have been modified recently (for the INITIAL SCAN only,
    which cuts out recency biases for subsequent scans)

- created ~/vcs-mining/2007-12-01-good-db/experiments/
  to try to explore whether certain parts of the codebase are more
  likely to be buggy
  - hmmmm, results don't look very promising, nix it for now

- added in a table and section from USENIX '08 rejected paper showing
  that triaging initial scan reports in a file increases the chances
  of reports for that file in subsequent scans being triaged

- One reviewer suggested: "complementing the numbers with a picture of
  the types of files against which these bugs were reported.  Are
  there any interesting correlations there?  E.g., are driver
  developers substantially less likely to respond to bugs than
  filesystem developers?"
  - squeezed in some info about this stuff in the paper; dunno how
    interesting it is, though

- did a bunch of thinking about cannibalizing the first half of my
  canned ICSE '09 submission into an ASE '09 submission
  • I was thinking about submitting some other results that I never submitted to ICSE 2009 last year (from my 2nd-year work on empirical software measurement of Linux development) to ASE in May 2009, and then pushed that draft back to submit in Sep 2009 for ICSE 2010. But I never ended up submitting anything since I got too busy with my MSR internship work that summer.
- transferred over cannibalized portions of ICSE 2009 draft to ASE
  2009 draft, and wrote down comments as notes in the Intro

- create a new directory structure for the updated dataset:

- created first experiment using new dataset, which explores skew in
  developer contributions over time:
  and integrated into paper as a LaTeX table

- started thinking about survival curves for developers here:

April 2009

- started playing around with graph theoretic measures of devs. in:
  and wrote something rough up in LaTex

- re-ran developer lifetimes experiment on new dataset:

- re-ran % of patches written by top 1% devs for files with varying
  levels of patch activity on new dataset:

- updated overall long-tail plot of developer contributions in:
- worked with Joel Brandt on polishing up IEEE Software journal
  version of our CHI paper
- re-ran % of patches of certain sizes and properties written by
  prolific vs. normal devs in:
  and added new comprehensive table to paper draft

- met with roger to help him a bit with improving searcher for

- worked with peter on Klee-UC again to start fixing the broken:
    p is UC
    q = p
    p = NULL
  case ... got it somewhat working but not fully working yet
- look through Coverity Scan log messages to see if I can get any
  qualitative insights for USENIX '09 paper camera-ready
  - hmmm can't really see any useful insights off the bat, since our
    paper only cares about whether reports were triaged or not, not
    whether they were marked as false positive, true bug, etc., so if
    a report wasn't triaged, there's no log message for it :0
  - did get one juicy quote from adrian bunk, though, and put in paper

- look through LKML for developer discussions about coverity checkers:
interesting quotes:

  """The very fact that these "check after use" cases were not fixed
  yet means that they are not hit frequently, if ever, regardless of
  how popular the driver is. This means that we have (relatively)
  plenty of time to fix them. At least Coverity or a similar tool will
  keep reminding us to take a look and decide what must be done after
  we carefully checked the code."""
  """Coverity and similar tools are a true opportunity for us to find
  out and study suspect parts of our code. Please do not misuse these
  tools!  The goal is NOT to make the tools happy next time you run
  them, but to actually fix the problems, once and for all. If you
  focus too much on fixing the problems quickly rather than fixing
  them cleanly, then we forever lose the opportunity to clean our
  code, because the problems will then be hidden."""

  """ Considering the very important flow of patches you are sending
  these days, I have to admit I am quite suspicious that you don't
  really investigate all issues individually as you should, but merely
  want to fix as many bugs as possible in a short amount of time. This
  is not, IMVHO, what needs to be done. Of course, if you actually
  investigated all these bugs and are certain that none of these
  checks can be removed because pointers can actually be NULL in
  regular cases, then please accept my humble apologizes and keep up
  with the good work."""

  Regarding the difficulty of trying to launch an effort to
  systematically audit Linux for security flaws:

  "Read" and "Audit" are two different things.  I can read a
  changeset and see that a[10] got a 20 character string into it; I
  will NOT see that a particular execution path 17 function calls long
  under one obscure but possible and deliberately activatable
  condition causes memory corruption.
  I've been monitoring the kernel security stuff for a long time too.
  There are several obvious trends and I think most are positive

  - Tools like coverity and sparse are significantly increasing the
    number of flaws found. In particular they are turning up long time
    flaws in code, but they also mean new flaws of that type are being

- try to correlate number of patches with number of coverity bugs
  at the time of the initial scan:
  - ugh, don't use these results because they're sorta dubious

- keep debugging Klee-UC simplified-regressions/q-equals-p-2.c
  regression test with peter, and got it to work, yay!

- kicked off a big-ass run of all ioctls with the q-equals-p-*.c
  case working (hopefully)

- dawson suggested throwing out all arch/ files because they might not
  be scanned by coverity
- for the final section of coverity bugs predicting user-reported
  bugs, add in probabilities per file or module of having
  user-reported bugs when there is a coverity bug, to make the
  discussion more concrete and tie it into the quote that developers
  actually use coverity bugs as 'dousing rods' for triaging more shady
  places in the codebase
  - only taking initial scan bugs into account (maybe excluding
    arch/), what's the probability of having a bugfix patch within N
    months for a file with coverity scan bugs vs. file with no
    coverity scan bugs:


    (or, more generally, how many bugfix patches per file?)

- updated this experiment from USENIX '08 submission:
  and stuffed it into the USENIX '09 camera-ready paper

- read over imran's paper draft and give him comments on it:
  "Hard Data on Soft Errors: A Large-Scale Assessment of Real-World
  Error Rates in GPGPU"

- met with dawson to talk about camera-ready draft, he really wants to
  figure out something actionable to say about the results
- added:
to get summaries of triage, resolution, and false positive rates for
different checker types
  - put results in LaTeX by expanding the table to fill up both columns

- re-ran this script:
  and put updated results in LaTeX

- added:
  to record more accurate information about triage/fix/FP rates per

- polish, polish, polish draft
- polish, polish, polish USENIX '09 draft, re-factoring the paper to
  its (hopefully) finalized 8-page form and moving some sections

- sent an updated 8-page draft to george candea for approval, and to
  dawson ... gonna context-switch to some other stuff for the time
  being ...

- found some related work on Linux developers for ICSE '10 draft ...
  mostly [[details omitted]] from economics and social science perspective
  • For the rest of the year (after leaving Klee) I extended my 2nd-year project (empirical software measurement of Linux development) into an ICSE paper to be submitted in Sep 2009. But I ended up not submitting since I prioritized my MSR summer 2009 internship work instead, which itself resulted in a successful ICSE paper.
- met with roger to help him with his Klee cross-checking searcher

- went over some light Klee-UC whiteboard strategizing with peter
- started creating 2009 Computer Forum poster for USENIX '09 paper
  - finished an initial draft with 12 slides, which is the identical
    number that I used for last year's Computer Forum poster
- implemented:
to double-check false positive rates since dawson thought that they
looked unusually high
  - the numbers seem right, though

- read over an on-paper draft of USENIX '09 paper and made some small
  revisions, as well as thinking about some more experiments to run

- added
  for comparing number of lines in Coverity fix and user-reported

- added
  to explore Pr(triage bug in file | previous bug in file was triaged)
  vs.        Pr(triage bug in file | previous bug in file was marked FP)
  vs.        Pr(triage bug in file | previous bug in file was marked true bug)
- integrated results of:
  $ python
into USENIX '09 paper

- updated various TODOs in the paper and tightened it up a bit more

- double-checked coverity false positive results in response to
  email from coverity guys, and resolved some conflicts there
  and edited the paper accordingly
  - since there's so much contention about false positive rates, i'm
    just gonna put their RANKS (so that i can justify calculating
    spearman's rho) and not the actual numbers

- strategized more about Klee-UC with peter at Peet's

- sent latest draft out to cristi and dawson for comments
- diagnosed more Linux bugs with peter using Klee-UC results from:
    pgbovine@hexapod ~/FT-Apps/linux-drivers/2009-04-07-ioctl-run
  found some new bugs, re-found some old ones, and got some new
  insights ... not bad!

- received some helpful comments from cristi about USENIX '09 paper
- integrated cristi's comments into USENIX '09 camera-ready

- finalized poster and present at Computer Forum annual poster session

- look over more Linux bugs with peter using Klee-UC results from:
    pgbovine@hexapod ~/FT-Apps/linux-drivers/2009-04-07-ioctl-run
- continue looking over more Linux bugs with peter using Klee-UC
  results from:
    pgbovine@hexapod ~/FT-Apps/linux-drivers/2009-04-07-ioctl-run

- received comments on USENIX '09 paper from dawson via phone call
- created
  to explore clustering in triaged vs. untriaged reports WITHIN one
  particular scan for every file, sorta like Ted's FSE 2004 paper
  - ooooooh, made a pretty 2-D clustering diagram like Ted's paper ;)

- keep hacking on USENIX '09 paper camera-ready

- met with Tony Dong (Ph.D. student in Construction Engineering and
  Management) for the first time to introduce him to Python and SQLite

- re-calculate the numbers for 'time before triaged' section and found
  that lots of the percentage differences aren't statistically
  significant, [[expletive]] :(
- integrated david maxwell's comments into the paper

- keep hacking on USENIX '09 paper camera-ready

- integrated george candea's comments to try to mention the word
  'Coverity' less throughout the paper

- sent out another draft for dawson to look at
- for USENIX '09 camera-ready, made a pass on a hard-copy and then
  integrated my own comments and bug fixes

- integrated dawson's comments that he sent via email

- sent out an almost-finalized draft to george candea for final
  approval (fingers crossed!)
- double-checked accuracy of related work citations in USENIX '09

- made yet more edits lalala

- took dawson's suggestion to run a last-minute test to try to
  quantify the fact that checker type was most important factor in
  determining whether a bug report will be triaged

- did a late-night grind to integrate more of my on-paper edits,
  TIRED!  zzz
- made a final pass and submitted USENIX '09 camera-ready!!!
- re-read Ardilla paper and some related work on XSS attacks and
  concolic testing for web apps. in order to prepare to make my ICSE
  '09 talk on Ardilla
  - emailed some questions to adam kiezun et al. to ask for some
    clarifications as i begin to create my presentation

- emailed a copy of my USENIX '09 camera-ready to these people:
  - Bill Pugh (UMD & Google)
  - David Maxwell (Coverity)
  - John Penix and J. David Morgenthaler (Google)
  - Tom Ball and Nachi Nagappan (MSR)
  - Suhabe Bugrara
- started creating Ardilla ICSE '09 presentation
  • Since Adam (the first author) couldn't make it to ICSE that year in Vancouver to present the Ardilla paper, his advisor (who was also my undergrad/masters advisor) paid for me to go present the talk instead. It was a fun trip!
- keep working on Ardilla ICSE '09 presentation, finished 1st complete
  draft (22 slides for a 25-minute talk)

- gave a practice talk in front of ridddlr, got some feedback, and
  improved the slides slightly

May 2009

- helped roger debug more router cross-checking stuff

- gave Ardilla ICSE '09 presentation at software lunch and received
  some feedback

- updated Ardilla ICSE '09 presentation using feedback I got

- got Suhabe setup with Klee stuff
  • Suhabe was a 4th-year Ph.D. student who transferred to our lab to work on Klee for the remainder of his Ph.D. years. By this time I was no longer working on Klee but was still getting new lab members oriented with it.
- updated Ardilla ICSE '09 presentation a little bit and sent to
  mernst for review
- updated Ardilla ICSE '09 presentation slightly according to mike's
  comments and also tweaked the color scheme more, then sent him a PDF
  for review
  • Mike was my former undergrad/masters thesis advisor, and also the faculty advisor of the Ardilla and Hampi papers.
- augmented the linux-Feb_2002-Mar_2009.db database with
  checkin messages from GIT repo, so that i can prepare to
  automatically group them into categories

- added:
  where i try to do a better job of automatically classifying log
  messages into categories
  - here are some sources of false categorizations:
    - some patches falsely marked as TRIVIAL:
      - ffc659fb4fc423f4a9570415f825fb81cc16e28f "trivial cases", a
        pretty big patch still, though, so isn't that trivial
      - but some trivial patches are actually pretty big when they
        affect multiple files, so don't discount them simply due to
        size alone ...
    - hmmm, add 'cosmetic' as a synonym for TRIVIAL (probably place it
      down pretty low priority so that it doesn't override other stuff)
    - some BUGFIX_SPARSE FPs:
      - 02e10b90cd478bda81b4644102b0009bcd1d14ab
        - 'This produces a very sparse radix tree'
        - This modified 24 files with +209,-120 lines, so it's a HUGE
          patch for SPARSE and likely not to be a true SPARSE fix
      - 02eaba7ffd145ef1389f4adc420f94af20ab3068
        - also there are some BIG patches that are flagged as
          'miscellaneous' (or 'misc') fixes that happen to include SPARSE
          fixes, but we shouldn't classify those as SPARSE per say
      - 'FIXED' by implementing a hacky filter that disallows
        BUGFIX_SPARSE (and BUGFIX_COVERITY) patches over 100 lines

  - my goal is to reduce the number of unclassified ('OTHER') patches
  or at least think about how I can re-classify them under a new
  category ... well I'll call it REGULAR_UPDATES for now ;)
    - 1/3 of patches are 'uncategorized', which isn't too horrible
- randomly-sampled 100 elements out of all categories and counted
  number of falsely-classified reports:
    - 807c453de7c5487d2e5eece76bafdea8f39d249e
    - 142956af525002c5378e7d91d81a01189841a785
    - bead9a3abd15710b0bdfd418daef606722d86282
  - TRIVIAL:         3 FPs
    - ba35e02bdcbd3a25238421a7e20efdb69436d3cf
    - ee1aaacdf386f0f786de11524995b0513a314c8c
    - 8e7a9aae91101916b86de07fafe3272ea8dc1f10
  - REVERT:          1 FPs
    - ecc21ebe603af31f172c43b8b261df79040790ef
  - REFACTORING:     12 FPs
    - 810b38179e9e4d4f57b4b733767bb08f8291a965
    - 0036731c88fdb5bf4f04a796a30b5e445fc57f54
    - 8ce13b01c190a7e663ea146ba89d1153240884bb
    - 5453c1a575df6232f84fad1502b2d559909265cc
    - 5d52013cbb3d39bde9f5a6023193058eeb112e98
    - a3efe8ce9136e9179d5831cd82400da644a9fdca
    - 71efc9104156892215aa6564fd5dbc8ac61e08b8
    - 7d12e780e003f93433d49ce78cfedf4b4c52adc5
    - 83f03fa5adbad0a829424241ad24ef9e4b4ba585
    - 64194c240466500ab7e1c36ac6df402e46c682de
    - 6abd219c6eab92172038e4719e02905acb7b68d4
    - 270acefafeb74ce2fe93d35b75733870bf1e11e7
  - CLEANUP:         7 FPs
    - 9a8d248e2d2e9c880ac4561f27fea5dc200655bd
    - 8f5c7579caba587a72ed91c7d76028efb8adb168
    - 6d4bf8d1ef2c785d7a056c528ea8ea3608fa9439
    - defba1d8f233c0d5cf3e1ea6aeb898eca7231860
    - 5d96551541a8f5521dcc8c634a18d42a3d349ec9
    - f16b34aa13e8c55085f346bcf07afb2312c56c0a
    - 5b057c6b1a25d57edf2b4d1e956e50936480a9ff
    - f212a548a8d378ab5ba3fda265ba3624c670ab72
    - 29dd3609f2fc70e02ce48123485b14d9417b834f
    - 4f4cf51e2d20ca2952ec6d9b5d497dbbb0df7e60
    - 06bf2f495aabfdbe9d5db7b910fa75dd7f72131a
    - 8419dc8a34d765f2ff1aa4051084d54cfeff09cf
    - 897cc188f7f0e402b92a4a6a9e234b45c612eb42
    - 65d3618ccfe686e8d7b3f01a838d0578182406df
    - ab56dbddc8a23ff3f4602855aaf0fcb3c814118b
    - e423b9ecd6aa434ce9ba72a21fdc61079e620e0a
    - a8bbf12ad8a8ad532cea0b67f0127ad90d336b04
    - ef7eeb693e590be8d7087a909daeadd64342d962
    - 684753599afc76aa8f66c731bafb7204b39265b8
    - 5d5d7727a8cde78f798ecf04bac8031eff536f9d

- drats, some patches are COMPOUND - like they combine the effects of
  several changes, e.g., some trivial and some cleanup, etc.

- keep refining classifier until we get fairly low FP rate
- wrote up methodology for categorizing patches in ICSE '10 draft

- filled out NDSEG 3rd-year fellowship report

- updated paper to ICSE '10 LaTeX template

- calculated the relative percentages of each category of patch
  written by prolific vs. normal developers, for several canonical
  splits --- 1%/99%, 10%/90%, 20%/80%, 50%/50%
  - and also how patch category percentages change over time
  • I was funded by the NDSEG and NSF graduate fellowships for five of my six years of Ph.D., which gave me tremendous flexibility in selecting projects. I was very fortunate.
- started typing up notes for related work for ICSE '10 draft

- created a cross-checking HTML reporter for roger's router
  cross-checking project:

- reading some related work on wikipedia visualizations by Fernanda B.
  Viegas et al. and some other random papers to write up some more
  related work in paper draft

- filled out MSR internship relocation form
- updated ICSE '10 draft with stats for developers who have only
  written ONE patch versus those with more than 1 patch
  - seems like a good intuitive split because it answers the question
    of "if you were some dude who just popped your head in and wrote
    ONE patch, what would that patch look like?"

- created:
  to track the aggregate progress of developers over time as they
  contribute more and more to the codebase
  - hmmm, there doesn't seem to be a super strong experience effect -
    in fact, lots of people write lots of patches clustered up-front,
    and then die down and build up again
  - the effect across TIME doesn't seem that regular, so I'm gonna
    explore the effect across the first N patches, for those devs. with
    at least N patches (also in same directory)
    - eh looking at the first 20 patches of devs. with >= 20 patches
      seemed to show some decent patterns
- incorporated Mike's comments for Ardilla slides

- slightly improved organization for ICSE '10 draft in preparation for
  handing a draft to Dawson on Friday
- gave TWO Ardilla (ICSE '09) practice talks, one at Stanford security
  lunch and the other at Samsung - revised slides slightly afterwards

- make plots and tables for:
  in ICSE '10 draft

- created:
  which tracks the average number of OTHER developers that have
  subsequently modified files during someone's first N patches (whoa
  that sounds confusing, I need to clarify better)
  - crap, that effect might just be due to the fact that there is less
    time available to make patches as N gets higher, or maybe not ...
    well I extended N to 50 patches to see that it does LEVEL OUT and
    doesn't really dip towards 1, so the effects of less time being
    available isn't too ridiculous
- try to polish up ICSE '10 draft a bit more before handing it over to
  dawson tomorrow
- for prolific vs. normal devs, find mean/median number of files/dirs
  modified by each developer:
  - great clean signal!

- simplify bugfix categories by smushing BUGFIX_SPARSE and
  BUGFIX_COVERITY all into the general bugfix category, for
  simplifying the exposition

- added a histogram showing relative amount of committer experience
  (in terms of num. days committing patches at the time each patch was

- finished the first proto-draft and emailed it out to dawson before i
  leave for vancouver tomorrow :)
  • travelled to Vancouver to give the Ardilla talk at ICSE 2009
- revised Ardilla presentation based on adam kiezun's comments
- revised Ardilla presentation based on Mike's comments
- delivered Ardilla presentation at ICSE '09
- emailed out a proposal for a Dr. Dobb's Journal article: "The
  benefits of programming with class invariants, and how to implement
  them in Python using decorators and metaclasses"

- skimmed through lots of ICSE '09 papers and typed up notes,
  concentrating on parts that could be relevant to my ICSE '10 draft

- integrated some relevant tidbits from ICSE '09 papers into my ICSE
  '10 draft

- trolled some kernel-related mailing lists looking for qualitative
  insights, but it was skim pickings :/
  - kernel mailing lists usually discuss pretty grungy
    implementation-level issues and nothing really meta-
  • ha, remember when Dr. Dobb's was a thing? I wanted to write a technical programming article for them (unrelated to my main research, but it felt like a productive way to procrastinate). They never got back to me, though.
  • ha, it's trawled, not trolled. I didn't troll the mailing list!
- went over my ICSE '10 draft on paper and made some revisions

- designed and emailed another questionnaire to kernel developers
- met with roger to talk about writing up his end-of-year report on
  his cross-checking work

- started making USENIX '09 presentation
- integrated 3 more related works [[redacted]] into ICSE '10 draft

- downloaded all related work papers and cleaned up bibliography

- sent out questionnaire to 2 more auxiliary kernel mailing lists
  (kernel mentors and kernel newbies)

- met with dawson to talk over my ICSE '10 draft; he gave me lots of
  useful insights that I've transcribed into draft and work-TODO.txt
- revised and delivered USENIX '09 presentation at software lunch

- top 10 most prolific linux patchers and their # patches written (not

  Andrew Morton|7456
  David S. Miller|4279
  Alexander Viro|3744
  Russell King|2547
  Greg Kroah-Hartman|2368
  Adrian Bunk|2299
  Linus Torvalds|2227
  Alan Cox|2058
  Dave Jones|1990
  Stephen Hemminger|1883

- top 10 most prolific committers:

  Linus Torvalds|48463
  David S. Miller|18615
  Greg Kroah-Hartman|10605
  Jeff Garzik|8549
  Ingo Molnar|6321
  Russell King|5858
  Mauro Carvalho Chehab|5349
  James Bottomley|4859
  Jaroslav Kysela|4204
  Paul Mackerras|4188

- hmmm, trends for Top 10 authors don't look particularly interesting
  (in addition to top 1% authors) ... nix it

- addressed some TODOs in ICSE '10 paper from dawson meeting yesterday
- wrote first draft of introduction for ICSE '10 paper
- emailed some kernel devs personally asking whether they could
  respond to my questionnaire

- wrote first draft of related work section for ICSE '10 paper

June 2009

- made one edit pass thru intro and related work for ICSE '10 paper

- emailed the intro/related work draft to a few people for review:
  derek rayside, joel brandt, adam oliner

- wrote methodology section

- added ~/vcs-mining/2009-03-10-dataset/experiments/
  to assess prolific vs. normal dev contributions by module

- re-factored tables and prepared ICSE '10 draft to get it in shape
  for writing the first section of results: prolific vs. normal devs.
- started consolidating developer responses to my questionnaire

- used committers as an explicit class of prolific (elite) developers,
  and re-ran some numbers using them

- started writing up "Section 4: Roles of prolific vs. normal
- finished writing first crappy draft of "Section 4: Roles of prolific
  vs. normal developers"

- attempted to write "Section 5: Socialization of individual
  developers" and re-organize some data tables, with not much avail :(

- sent ICSE '09 stuff to [[admin]] to get reimbursed

- wrote a follow-up email to kernel dev Greg KH who responded to my
  questionnaire with more questions than answers :0
- finished writing first REALLY CRAPPY draft of ICSE '10 paper
  • I spent the past month working pretty much full-time on this ICSE draft (which extended my 2nd-year empirical software measurement work on Linux kernel development patterns) since I wanted to get it done-ish before starting my MSR internship. I planned to submit it for the Sep 2009 deadline. It never ended up really coming together, though, so I didn't submit it.
- made a first-pass edit on really crappy ICSE '10 draft

- knocked off a bunch of small TODOs in paper draft
- made another pass through ICSE '10 draft, making it less crappy this
  time around, and sent it out to dawson for comments
- wrote Limitations section of ICSE '10 draft, and did some other
  minor polishing
- made one more round of on-paper edits in 3 results sections

- refactored and beefed-up the introduction

- sent out a draft to dawson and cristian
- sent out an ICSE '10 draft to adam oliner

- read over roger liao's router cross-checking write-up draft and gave
  him comments on it
- received on-paper comments back from adam oliner and integrated some
  of them

- made a webpage of data sources for ICSE '10 paper and uploaded:
- read over first few (short) chapters of cristi's Ph.D. thesis and
  marked down some on-paper comments
- got feedback from Dawson about USENIX '09 presentation and about
  research in general

- got feedback from cristi about ICSE '10 submission
  • The academic year ended around then, so I went home to Southern California and drove to San Diego to give a talk on my USENIX 2009 paper.
- dreamed dreamed and dreamed about writing a Klee that targets Python
  programs and emailed dawson
- got dawson's tentative approval for a 'Klee for Python' and started
  a Wiki page here:
- read 2 papers on plane ride from LAX -> Seattle, which are the
  closest related work on concolic execution for high-level dynamic
  - Wassermann's ISSTA 2008 paper on dynamic test input generation for
    PHP web apps
  - Artzi's 2009 tech report on Apollo tool (extended version of ISSTA
    2008 paper) on dynamic test input generation for PHP web apps
  • Flying to Seattle to begin my summer 2009 MSR internship marked the official end of Year 3 of my Ph.D.
- read Pacheco's Randoop paper and jotted down notes on it and a few
  other papers in:

- emailed dawson my summer plans of moonlighting on churning on ideas
  about KleePython

- started creating unit tests for EventBrowser application and jotting
  down notes for how that process could be automated:
- thought quite a bit about building an Ardilla-Lite for PHP which
  would be much simpler than regular Ardilla and could still be just
  as effective - see my notes in:
    ~/MyDocuments/Documents - Summer 2009/Ardilla-lite-notes.txt
  Key insight: we don't need to get CONCRETE inputs, so we don't need
  a powerful constraint solver ... all we need is to get down certain
  - the hope is that branches in PHP apps mostly involve strings and
    not really integers and such, so no need for a powerful integer
    constraint solver
- wrote up my plans for Klee-PHP in a semi-coherent design document:
    ~/MyDocuments/Documents - Summer 2009/PHP-attack-gen-design.txt
  • At the start of my MSR internship, I spent some nights and weekends toying with the idea of returning to Klee once more, but this time focused on more dynamic languages like Python and PHP rather than on low-level C/C++, which Klee currently targeted. That dream dissolved pretty quickly as I got deeper into my MSR work, which would directly inspire the start of my Ph.D. dissertation work in Year Four and beyond.
Subscribe to email newsletter
Donate to help pay for web hosting costs

Created: 2019-07-17
Last modified: 2019-07-19
Related pages tagged as Ph.D. student life: