Friday, April 13, 2012

The Naturalness of Computer Code

I am not proud of the gap in this journal. I have been taking the NLP class offered online by Stanford profs Jurafsky and Manning and, together with my teaching and laziness, have contributed to this gap. But I am compelled to write about my thoughts on the naturalness of computer code. This is in response to two papers I have recently read. The first, Learning a Metric for Code Readability by Buse and Weimer (together with a presentation of a paper by Posnett which extended this work) and On the Naturalness of Software by Hindle, Barr, Su and Devanbu. They touch on a topic I find I have strong opinions about and this journal entry is my first attempt to work these out.

The second paper states a conjecture that "most software is also natural, in the sense that it is created by humans at work...". The first attempts to derive a metric which is predictive of a human's ability to comprehend the code. I would suggest that neither go far enough in their theses. I propose that all software is a specialized type of non-fiction writing that fits into a continuum with other non-fiction writing. That is to say that there is no bright line between computer code and other forms of human utterances that are committed to paper for the purpose of achieving some economic (as opposed to artistic) end. I must confess I am not even convinced of the latter part of this assertion but I will take on the first before considering the second.

Casting a piece of computer code in this light, let us consider some of the maxims of non-fiction writing. The most oft repeated is to know your audience. I think most people naively believe that the audience, if they even think of it that way, for a piece of computer code is the computer itself. I think that this assertion fails almost as quickly as it is uttered. It cannot be denied that the code must be "comprehensible" to the compiler and that the language uttered by the compiler must be understood by the machine it is targeted for. However if machine understanding were all that is required there would not be so much ink spilt over the language debates. In fact I believe I can support the assertion that the human comprehensibility of the code may be the single most important factor in achieving "high-quality" code (whatever that term may mean to you). Computer code has many different human readers. First is the author themselves. Mechanistically writing is viewed as some form of transcription from the author's mind to some artifact. However anyone who has proceeded beyond high school English understands that the process of writing is not simple transcription from some internal form to its external artifact. Writing is thinking; writing is editing; writing is analysis; writing is synthesis. All of these processes occur when one writes computer code and to deny that is to deny the human element of the software creation process.

Even once the product has been committed to some final form that is held both acceptable to the author and the machine in which it will become embodied, human eyes are destined to read it again. As our software artifacts have extended lives, the need to modify the code become more common for correction and extension. Given the separation in time between the original authorship and the modification, even when the author is the one to make the modification, they are a different person. So code modifiability is often a software quality that has economic impact on the owner and the metrics of this quality are worthy of study. This underlies the justification for the research on code readability, one important component of modifiability.

A central thesis of SEI is that software structure is not dictated by functionality but by required qualities. (ok, my statement of their thesis). That thesis is nowhere as clear as it is when attention is given to code readability. Buse, Weimer, and Posnett identify many identifiable traits that frequently encouraged in texts about good software writing; intelligent use of whitespace, good naming standards, complexity of statement structure are three that are sufficient to make my point. Two programs with exactly the same functionality in the same language and the same qualities in use can differ dramatically just from the manipulation of these three variables. I think Posnett showed this quite convincingly in his illustration of a piece of code structured to show the letter pi.

It is obvious that non-fiction writing also has the need for readability and in many cases, modifiability. Organizational procedures manuals, civil codes, want ads, and web sites are only four of a multitude of human media that depend upon modifiability to achieve their purpose in our society. It is no accident that the structure of these artifacts become as important as the text (and images) they contain. Processes and procedures exist to ensure consistency, accuracy, integrity, and availability. Is it just me, or does this not begin to sound a lot like computer code? After thinking about this a great deal, I'm convinced there is no meaningful distinction to be made there. So returning to the theme of "know your audience", we can directly confront the challenge of comprehensibility of code.

Writers strive for clarity in non-fiction writing. Clarity is high when the reader follows the text without confusion, frustration or the need to put it aside to check some reference. The need is the same in computer code. A complicating factor, especially when teaching how to program, is the invisibility of any other human reader and the lack of temporal distance between the author who writes the code and the author who modifies the code. This robs the student of the ability to experience the weaknesses in their own writing. Grader comments help, no doubt, but are often lost in the race for a final grade and a general lack of meaningful experience in which to assimilate the information, assuming of course that the grader has done anything more than enforce a few rubrics given in class. Students rarely learn how to write good code until they have been employed for a time and been taught by journeymen in the art of clear computer code.

This art and its maxims are not unknown, at least in part. As the Buse argument goes, complexity plays a part. This cannot be a surprise since it plays just as much a part in non-fiction text. I once suffered from non-fiction text that had the complexity and rhythm of computer code. Ditty-ditty-dah. Ditty-ditty-dah. The lack of any grace in my text embarrassed me even when it was perfectly functional prose. But that has as much to do with art and the finer aspects of writing as it does clarity. That text was completely clear and comprehensible, just boring and artless. I was not born a good writer and do not consider myself to be good, just better. The tie between complexity and art is that artistic effect comes from the near infinite variation possible in human language and the way that plays upon the reader.

To be clear and artful is a laudable goal for human writing but far beyond what is needed for computer code. Yet the ability to vary the complexity in code exists just as much in computer code as in non-fiction text due to the audience needs. If I am writing code for exemplars for a beginning programming course, I will use very simple statements that don't go beyond ditty-ditty-dah; verb object constructs in a procedural language. But I am also just a capable at railing at code written by very sophisticated programmers who manage to chain a dozen tokens together with dot constructs making the code all but incomprehensible to anyone beyond the author, assuming even they remember how this all worked. The complexity of the statement structures are invariably dictated by the needs of both the author (who after all is the one reader who is never ignored) and anyone else who has influence over the author.

I didn't see it cited in the papers I've read so far on code readability but I assume there was some influence from the work on text readability from Flesch and others. (William F Buckley come to mind). While I don't completely buy into their reduction of text readability into a simple number I must accept the practical use of this when used over large corpora, especially when they are from the same or similar authors such as from newspapers. I believe the number gives an indication but that the true behavior change is more subtle. However given its ease of use it is a good place to start. I take the Buse direction in that same spirit. "You cannot manage what you cannot measure" (who originally said that?) throws down a gauntlet to empiricists and I doubt Buse, Weimar or Posnett believe they wrote the final chapter in that book. But I don't want to get lost in the thicket of numbers before reflecting on the intuition behind the factors that detract from code readability.

I can't resist a digression into a discussion I had with a colleague about a recent article in ACM about Turing. He was derisive of the article in large part because of the obtuse nature of its prose. This was a direct example of how a writer must not become self-indulgent in their style of writing but must treat themselves as a servant in service to the reader. There was no doubt that this author allowed a stuffy English-bred Oxford style of prose permeate their argument to the point where only the most devout reader was likely to extract whatever wisdom existed in that piece. Before we could have even begun to discuss the merits of his argument we would have been forced to agree on what it was he was trying to say. In the end it just wasn't worth the work to us that day.

This brings me to the present where I am looking at the successes of NLP and the assertion that code is a form of natural language that will lend itself to the same processes that have worked with other natural languages. I find resistance in myself to completely embracing this research direction and this is the motivation for this posting. Yes, I see the value in the readability metric but it isn't very difficult to construct code that can generate a "good" number in that metric yet be utterly incomprehensible. The number is far from complete and I have some reservations in believing that comprehensibility can ever be captured in a number while driven to attempt to find one. Before I pursue the quantitative, I want to do some qualitative work in this direction.

For instance, why does white space matter? We know it does but I have not seen a good discussion yet that demonstrates an understanding of something so basic. This brings to mind some lessons from layout design and the language of visual communication. Computer science is far too focused on the linear model of a one dimensional string of text. But code is never comprehended from a linear reading of the text. Our visual mechanism is highly adapted to chunking things and the insertion of white space is no different from punctuation in providing structure to the code. We insert it to make paragraphs of the statements we write. And like a paragraph, we almost immediately reduce a paragraph to some abstraction of that paragraph. We could reduce it to a sub-routine or other linguistic device that explicitly reduced it to a smaller number of tokens. But this loss of linearity can detract on the flow of thought and increase, rather than decrease the readability. I experienced this when I was trying to illustrate how 3 numbers could be sorted by a series of if statements and one student introduced subroutines to capture recurring statements. My instinct rejected that construction since the parallelism was lost in that exercise by the distraction of referring to non-inline code. There are times when the cohesion is maximized by the absence of abstraction.

A pet peeve of mine is the poor use of vertical formatting in many pieces of code. We are all familiar with the need to "pretty print" control structures to clearly show the conditional blocks. The eye naturally sees these as the comprehensive blocks they are and immediately grasps which are "inside" and which are "outside". But I have often seen, and been complimented on, how a long statement has had its comprehensibility improved by the judicious use of line continuation and vertical columnar alignment of repeating constructs. Take a complex if statement with many predicates but one that also has a parallelism that makes it easy to understand once that parallelism is understood. If you put those predicates in a long list with haphazard line continuation the communication of that parallelism is all but lost. A text formatter can easily give you the proper indentation for control structures but one that can bring out the parallelism of this kind of statement cannot simply because the first is purely syntactic while the second is semantic. The comprehensibility of these two code segments will be very different even while their numeric readability score can be made identical. At its core comprehensibility cannot be divorced from the human communication that is being performed by these utterances. To focus too quickly on the numeric assessment of a piece of code in my mind distracts from the inherently human activity that is the ultimate aim of this metric.

Given the inaccessibility of the inherently human activity of comprehension, it is too easy to reject a quantitative approach. I have not reached that level of certainty given the successes of NLP and the low entropy of human utterances. I accept without question that any artifact created by a human will display statistical evidence of authorship. I feel certain it would be possible to determine the authorship of computer code as easily, perhaps even more easily, that it is to assess human authorship in fiction or non-fiction prose. What is even more intriguing is that computer code does not exist in a vacuum but ordinarily exists in the context of many artifacts. While entity naming may vary from person to person, the shared context of these utterances should result in some statistically significant correlation that could potentially tie testing, requirements and code together mechanistically. My research dreams see the cohesion of all project artifacts together with the organizational artifacts to create a transparent and comprehensible system that can demonstrably connect the needs of the organization to the mechanisms that enable the solution.

With that I'll end this stream of consciousness and get back to some real work.


Thursday, April 5, 2012

Analysis and synthesis in software engineering

"In engineering, as in other creative arts, we must learn to do analysis to support our efforts in synthesis. One cannot build a beautiful and functional bridge without a knowledge of steel and dirt, and a considerable mathematical technique for using this knowledge to compute the properties of structures. Similarly, one cannot build a beautiful computer system without a deep understanding of how to 'previsualize' the process generated by the code one writes."
~Abelson and Sussman


Engineering, unlike art, aims to satisfy a client's needs. This involves tradeoffs between the different qualities in the final artifact. To achieve an acceptable tradeoff requires that the designer be able to predict the qualities that the final artifact will exhibit before it is built. Software Engineering currently has a poor track record of predictably achieving a balance of qualities in the final software product. We have achieved some success in predictive models for performance and availability, at least I hope we have. But there are many other qualities that we still lack good measures for no less predictive models for assessing design. End-user usability, code comprehension, modifiability, tracability are just some of the ones that come to my mind. I think this quote is one of the best for capturing the essence of the problem.