Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

Hey, your Python code is unreadable!

Summary
I argue that code written in Python is not necessarily any more easily readable and comprehensible than code written in other contemporary languages, even though one of Python's major selling points is its concise, clear, and consistent syntax. While Python code might be easier to comprehend 'in-the-small', it is not usually any easier to comprehend 'in-the-large'.

Blasphemy

Let me begin by pointing out that Python is my favorite language to program in, the one I use by far the most frequently, and the one I evangelize most profusely (especially for teaching to beginners). That said, I'm now going to make a blasphemous claim:

Code written in Python is not any more readable 'in-the-large' than code written by the same programmer in other contemporary languages.

Blasphemous! Python evangelists have long touted the virtues of clean Python syntax to brag about how much more readable and easily comprehensible its code is than code written in other contemporary languages (e.g., Perl, Ruby, Java, C++). Although I agree that the simplicity, consistency, and conciseness of Python syntax is one of its greatest strengths, what I will argue in this article is that these virtues only help people trying to understand code 'in-the-small' rather than 'in-the-large'.

Understanding code 'in-the-small'

Understanding code 'in-the-small' means being able to look at a screenful of code in an editor and understanding what specific detailed task it's supposed to perform. Python evangelists use toy examples such as simple algorithm implementations to demonstrate how easily-comprehensible the 20 lines of Python are versus, say, the analogous 100 lines of Java. I don't take issues with these claims at all; in fact, that's one of the reasons why I advocate teaching Python to beginners. I strongly believe that Python code is definitely easier to understand 'in-the-small' than code written in other contemporary languages.

When code is self-contained and can be understood as a whole from start to finish, Python shines. Examples of self-contained code include core algorithms, one-page scripts that do simple file processing, and inane 20-line toy examples that people post on their blogs to show the 'expressive power of Python', which inevitably get posted onto geek news websites (e.g., "look, I made MapReduce in Python in 20 lines!!!" or "check out Y-combinator in Python!!!").

I don't mean to be facetious. Code readability 'in-the-small' is not a trivial virtue: Because Python allows programmers to be so expressive while writing so little code, oftentimes small 50-line programs that fit on one screen is sufficient for performing useful, real-world scripting tasks. If you imagine the number of programs with a certain level of complexity as a 'long-tail' distribution, then the vast majority of programs people write (especially in languages like Python that encourage lightweight hacking and scripting) will be extremely short in length! There are only a dozen or so monolithic programs (e.g., operating systems, office productivity suites, web browsers); there are far, far more programs on the long tail that are are short scripts hacked to scratch specific itches, and this is precisely the domain in which Python excels.

Understanding code 'in-the-large'

If you want to extend or tweak any moderately complex program, then you will have to end up understanding much more than a single screenful of code in isolation; you will need to understand the program's structure and organization 'in-the-large'. In my daily research work, I sometimes need to modify scripts that my colleagues have written (mostly to do data processing and report generation). I've found through these experiences that Python's concise and clear syntax provide no advantages when trying to understanding someone else's code 'in-the-large' (i.e., understanding how an entire program works rather than an isolated code snippet).

Here is an example code snippet from a ~1000-line script that parses data generated by gcov and prints out an HTML report:

# Invert allCovData into a form for easy processing by output
# files. path -> line -> [lines]
def remap((path,data)):
    lineData = {}
    for wf,lines in data:
        for ln in lines:
            lineData[ln] = lineData.get(ln,set())
            lineData[ln].add(wf)
    return (path,lineData)
covMap = dict(map(remap, allCovData.items()))

An experienced Python hacker will be able to understand what this code is doing 'in-the-small' (the comments also help those like myself who are less proficient). After this snippet executes, covMap is some altered version of allCovData. Great, I could fully understand 100% of code snippets such as this one but still have no idea how the program works 'in-the-large'! As a corny analogy, I can easily see and comprehend all the individual trees, but I still won't know the properties of the forest.

What's contained within allCovData and covMap (which I presume are both dicts)? What are the types of the keys? What are the types of the values? More importantly, what is the meaning of the keys, values, and their mapping? What do these objects represent in the grand scheme of the entire program, and how can I best leverage them to do what I want to do? Unfortunately, nothing short of having the programmer write high-level comments and/or personally explain the code to me can possibly provide me with such knowledge.

It's not Python's fault, though; I would've faced the same comprehension barriers with analogous code written in Java or C++. Nothing is wrong with Python in this regard, but unfortunately its clear syntax cannot provide any advantages for me when trying to understand code 'in-the-large'.

The curse of dynamic typing and language flexibility

It took me a long time to make even a minor change to the gcov-parsing code, not because it wasn't well-written (my colleagues are far better hackers than I am), but simply because it wasn't well-documented. Languages like Python are great for hacking up small scripts, since there's no compiler or strict static type system to get in your way; however, the curse of dynamic typing and extreme flexibility is that you aren't forced to explicitly document your assumptions. For example, here is a function definition from the gcov-parsing code:

def parseGCovFile(gcovPath, baseDir):
  ...

There are no comments telling me what to expect from the arguments and return value (i.e., pre-conditions and post-conditions). At least in Java, I would've been able to ascertain their static types. I can surmise from the variable names that this function expects some path to the gcov-generated file and some base directory path, but when I call this function, in what format can I expect the returned data? I have no freakin' clue! So logically, I grep for parseGCovFile to find its callers, and I find one call site here:

entries = []
for res in kGCovFileAndOutputRE.finditer(data):
  path,output = res.groups()
  path = os.path.abspath(path)
  entries.append((path, parseGCovFile(output, basePath)))

Okay, so it appears that the results of parseGCovFile get appended to some sort of list. What happens to that list?

return GCDAData(os.path.abspath(gcdaPath), entries)

It gets passed into the constructor of the GCDAData class, which itself doesn't have any comments either.

My purpose in bringing up this example was not to disparage the code or programmer; on the contrary, this code is as clean and concise and understandable 'in-the-small' as any talented Python hacker could write. My point is that, without proper comments and documentation, even the cleanest Python code is incomprehensible 'in-the-large'.

(To be fair, static types aren't a panacea either: If I showed you the same Java code filled with type definitions, then it would be easier to understand what this function is doing in terms of its concrete types, but without comments, you still won't be able to understand what this function is doing in terms of its actual underlying purpose, which inevitably involves programmer-intended 'abstract types'.)

The curse of Python's ease-of-use

Having a language with the expressiveness and flexibility of Python encourages people to hack and hack without writing many comments, because it's so easy and fast to get code up-and-running bug-free! (well, don't just take my word for it; read ESR's testimonial!). Thus, hackers are less prone to writing comments, because they're more likely to get the code working properly the first time around, so there's no motivation to comment judiciously.

People (myself included) tend to write Python code like they write blog posts: quickly, furiously passionately, and without careful up-front planning or down-the-line editing or proofreading. Such 'fast-and-furious' uses of Python are wonderful for expressive creativity and productivity when only one person deals with the code, but whenever someone else tries to work with it, most of the advantages of Python's syntactic clarity are lost. Thus, it is ironic that the ease and rapidity of development in Python works counter to the goal of writing clear and well-thought-out comments.

The cliched remedy: write more comments and asserts

So how can you make your Python programs more understandable 'in-the-large'?

I've personally gotten into the habit of writing copious amounts of comments and especially assertion statements (a.k.a. 'executable comments') in my Python code, sometimes more than actual code itself; the dynamic nature of the language means that I am not forced to document much of anything, most notably types (e.g., "Wanna add a new field to your object? Sure, just do it at run-time!"). For instance, whenever I create a non-trivial dict, I try to write comments describing the keys, values, and relationships between them; whenever I write a function, I try to document the types and assumptions about its parameters and return values.

None of these practices should at all be surprising to an experienced programmer, but I feel that they are even more crucial (albeit more difficult) habits to develop when programming in a language like Python that provides so much flexibility and freedom versus when programming in a language like Java that forces programmers to explicitly declare lots of information about program types and structure.

Created: 2008-10-31
Last modified: 2008-11-09
Related pages tagged as programming: