Award Abstract # 1414030
SHF: Large: Collaborative Research: Exploiting the Naturalness of Software

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: CARNEGIE MELLON UNIVERSITY
Initial Amendment Date: June 6, 2014
Latest Amendment Date: September 16, 2016
Award Number: 1414030
Award Instrument: Continuing Grant
Program Manager: Sol Greenspan
sgreensp@nsf.gov
 (703)292-7841
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: July 1, 2014
End Date: June 30, 2018 (Estimated)
Total Intended Award Amount: $666,667.00
Total Awarded Amount to Date: $666,667.00
Funds Obligated to Date: FY 2014 = $489,653.00
FY 2016 = $177,014.00
History of Investigator:
  • William Cohen (Principal Investigator)
    wcohen@cs.cmu.edu
Recipient Sponsored Research Office: Carnegie-Mellon University
5000 FORBES AVE
PITTSBURGH
PA  US  15213-3815
(412)268-8746
Sponsor Congressional District: 12
Primary Place of Performance: Carnegie-Mellon University
5000 Forbes Avenue
Pittsburgh
PA  US  15213-3815
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): U3NKNFLNQ613
Parent UEI: U3NKNFLNQ613
NSF Program(s): Information Technology Researc,
Software & Hardware Foundation
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
01001617DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7925, 7944
Program Element Code(s): 164000, 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This inter-disciplinary project has its roots in Natural Language (NL) processing. Languages such as English allow intricate, lovely and complex constructions; yet, everyday, ``natural? speech and writing is simple, prosaic, and repetitive, and thus amenable to statistical modeling. Once large NL corpora became available, computational muscle and algorithmic insight led to rapid advances in the statistical modeling of natural utterances, and revolutionized tasks such as translation, speech recognition, text summarization, etc. While programming languages, like NL, are flexible and powerful, in theory allowing a great variety of complex programs to be written, we find that ``natural? programs that people actually write are regular, repetitive and predictable. This project will use statistical models to capture and exploit this regularity to create a new generation of software engineering tools to achieve transformative improvements in software quality and productivity.

The project will exploit language modeling techniques to capture the regularity in natural programs at the lexical, syntactic, and semantic levels. Statistical modeling will also be used to capture alignment regularities in ``bilingual? corpora such as code with comments, or explanatory text (e.g., Stackoverflow) and in systems developed on two platforms such as Java and C#. These statistical models will help drive novel, data-driven approaches for applications such as code suggestion and completion, and assistive devices for programmers with movement or visual challenges. These models will also be exploited to correct simple errors in programs. Models of bilingual data will used to build code summarization and code retrieval tools, as well as tools for porting across platforms. Finally, this project will create a large, curated corpus of software, and code analysis products, as well as a corpus of alignments within software bilingual corpora, to help create and nurture a research community in this area.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov "Revisiting Semi-Supervised Learning with Graph Embeddings" ICML 2016 , 2017
Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, William W. Cohen "Encode, Review, and Decode: Reviewer Module for Caption Generation" NIPS , 2016

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

What does it mean for a computer to "understand" a technical domain,such as software? To test understanding for a person, one wouldprobably try and measure performance on a number of different tasks.

One test of knowledge would be to explain how entities from the domainare related.  For example, a "button", a "link", and an "option" are all similar in that they are names of GUI elements.  We looked at ways to infer relationships between software entities from text aboutsoftware.  Specifically, we collected data from the Q&A websiteStackOverflow, and used NLP methods and statistical machine-learningmethods to find a set of relationships that were frequentllymentioned, and internally consistent. The quality of the top-scoringrelationship was good (as much as 90% for certain measures) whenevaluated by domain experts.

As another test, one might ask a person read some computer code and explain in English what it does--i.e., to produce meaningful comments for the code.  We built a system could produce English comments on a new program that it was presented with.  This system was developedu sing machine learning methods: it was trained on actual comments, written by programmers to explain code that they had written.  The learning method we used works by reading the program into a neural memory ("encoding" it), then producing the comment one word at a time, updating the neural memory after each word is produced ("decoding").The best-performing model for comment generation was a novel learning system we called "encoder-reviewer-decoder", which allows some additional neural processing to be performed after the source code isloaded, but before the first word of the comments are produced.  The encoder-reviewer-decoder model is intended to model a person's mentalprocess of thinking about the program after reading it, but before explaining it.  Although this model cannot produce good-enough program comments on its own, it is accurate enough to help a programmersubstantially, filling in enough of a comment automatically to reduce typing by over a third.

Another common way to test understanding for people would be to ask them to answer question about terms from the domain.  We looked at answering "fill-in-the-blank" or "cloze" questions, such as


  Django is a free and open-source web framework, written in ____,  which follows the model-view-template (MVT) architectural pattern.


(The correct answer here is "Python").  From StackOverflow, a web site with questions and answers about computer programming, we extracted thousands of definitions, and constructed over 35,000 cloze questions. These questions are very hard: two experienced, senior programmers answered only 46.8% correctly, on average.

We explored a number of approaches to answering these questions automatically.  The first approach modeled what a human might do.  The system first searches for documents that might contain the answer,then "reads" the documents returned by the search, and finally produces the answer.  To "read" the documents use existing methods called reading comprehension (RC) systems.  These methods performed reasonably well, with the best existing question-answering methods reaching accuracy of 32.1%.  However, further experiments showed than one could do about as well (accuracy of 33.6%) using another approach called a "neural language model" (LM), which simply guesses a plausible word to fill in the blank using general statistics about English sentences in the software domain.

Motivated by this, we explored a new approach to answering cloze questions about software. The insight we had was that often, related software entities (e.g., "Django" and "Python") were used together as tags for StackOverFlow posts. (Every posted question can be tagged by the user with a set of software entities, to make it easier for possible question-answerers to find relevant questions.)  Often entities that are semantically related are used together as tags ofthe same post.  We thus designed a new method specifically designed for answering Cloze (fill-in-the-blank) questions using this sort of text with tagged entities.

Our method is called CASE (Context-Adjusted Syntax Embeddings).  It's a  hybrid of an neural LM and a co-occurrence model, which predicts an answer based on co-occurrence with the entity.  This allows a useful "division of labor" between the two models.  The LM can predict the appropriate “type” of the answer entity based on the question syntax (in the example, a programming language is needed), while the co-occurrence model picks out the entity of that "type" based on co-occurrence with the term defined.  CASE far outperforms the other methods: it achieved an accuracy of 44.9%, which is close to the performance of human experts.


Last Modified: 07/26/2018
Modified by: William Cohen

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page