Friday, May 18, 2012

Is computer code really a language?

I've already discussed the paper that is the main topic of this post. (http://www.blogger.com/blogger.g?blogID=8640583147187562680#editor/target=post;postID=7393746261888193647) I am trying to refine that post and setup for further work in this area.

This post title is a complete rip off of a paper I just finished reading (again) titled On the Naturalness of Software by Hindle, Barr, Gabel, Su, Devanbu and has been accepted for ICSE2012. The hypothesis is that code utterances are amenable to the same kind of simple language models that have worked well for NLP over the past decade. The paper suggests that token completion and suggestions in an IDE can be considerably enhanced by the use of a (relatively) simple language model. What catches my interest is that the going-in position is that the "naturalness" of the coders use of the language must be established. I am glad this is being done but I find this a nearly established fact in my mind.

Clearly computer languages and they way they are used are not human languages in the sense that they primarily serve human-to-human communication. The obvious receiver for the messages is the computer. And we know how limited the language capabilities of compilers and interpreters are. But code is written as much for other humans as it is for the computer. The use of white space is mostly compliant with norms in the industry. But when you teach an intro programming class you realize how much of that is a cultural norm and not a requirement of the language. Not only are pretty-print nicities like the vertical alignment of parallel structures but naming conventions, when a temporary result is committed to a variable and the choices someone makes for helper methods are all hallmarks of an individuals style. When that style is divergent from norms and not consistent or clear, reading the code is simply painful. This alone is enough to convince me that human communication is an inherent property of computer code and thereby prove that computer code that offers expressive capabilities is a natural language in this capacity.

My own research interests are less with computer code itself than with the broader context in which computer code is created. In particular I am fascinated with the transformations of language that span the life-cycle: starting with problem recognition; project definition; problem statement; requirements definition; specification; code construction; and all the feedback loops reversing the waterfall. I am all too aware how difficult the research challenges are outside of code and approach it with great caution. I need to start small.

Having asserted that code is still (at least in part) a human language, I am now concerned with asserting another hypothesis: that pre-code artifacts written in a common natural language are in fact more structured than other non-fiction prose. That is, a requirements document can be shown to use a more restricted form of its natural language and may have clues as to how the language can be more narrowly defined so as to improve the subsequent quality of the product to be built. Even this is a challenging assignment.

In preparation for some of the challenges that are inherent in the above two research directions, I am interested in doing a narrow study to see if the text artifacts in an OSS can be mechanically and successfully categorized in a way that makes a reasonable prediction or correlation to some attribute of the later construction. My first stab at this would be to look at the text in bug reports, use a relatively simple language model and try training that corpus with some hand coded bugs using different categories that I come up with through intuition and common practice. As I think of this research it sounds like an exploritory survey of the corpus. I would do this first for one of the larger and well respected OSS projects (Apache?) to see if any categorization can achieve a reasonable prediction using that language model.


No comments:

Post a Comment