
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | June 6, 2014 |
Latest Amendment Date: | September 16, 2016 |
Award Number: | 1414030 |
Award Instrument: | Continuing Grant |
Program Manager: |
Sol Greenspan
sgreensp@nsf.gov (703)292-7841 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2014 |
End Date: | June 30, 2018 (Estimated) |
Total Intended Award Amount: | $666,667.00 |
Total Awarded Amount to Date: | $666,667.00 |
Funds Obligated to Date: |
FY 2016 = $177,014.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
5000 FORBES AVE PITTSBURGH PA US 15213-3815 (412)268-8746 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
5000 Forbes Avenue Pittsburgh PA US 15213-3815 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Information Technology Researc, Software & Hardware Foundation |
Primary Program Source: |
01001617DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This inter-disciplinary project has its roots in Natural Language (NL) processing. Languages such as English allow intricate, lovely and complex constructions; yet, everyday, ``natural? speech and writing is simple, prosaic, and repetitive, and thus amenable to statistical modeling. Once large NL corpora became available, computational muscle and algorithmic insight led to rapid advances in the statistical modeling of natural utterances, and revolutionized tasks such as translation, speech recognition, text summarization, etc. While programming languages, like NL, are flexible and powerful, in theory allowing a great variety of complex programs to be written, we find that ``natural? programs that people actually write are regular, repetitive and predictable. This project will use statistical models to capture and exploit this regularity to create a new generation of software engineering tools to achieve transformative improvements in software quality and productivity.
The project will exploit language modeling techniques to capture the regularity in natural programs at the lexical, syntactic, and semantic levels. Statistical modeling will also be used to capture alignment regularities in ``bilingual? corpora such as code with comments, or explanatory text (e.g., Stackoverflow) and in systems developed on two platforms such as Java and C#. These statistical models will help drive novel, data-driven approaches for applications such as code suggestion and completion, and assistive devices for programmers with movement or visual challenges. These models will also be exploited to correct simple errors in programs. Models of bilingual data will used to build code summarization and code retrieval tools, as well as tools for porting across platforms. Finally, this project will create a large, curated corpus of software, and code analysis products, as well as a corpus of alignments within software bilingual corpora, to help create and nurture a research community in this area.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
What does it mean for a computer to "understand" a technical domain,such as software? To test understanding for a person, one wouldprobably try and measure performance on a number of different tasks.
One test of knowledge would be to explain how entities from the domainare related. For example, a "button", a "link", and an "option" are all similar in that they are names of GUI elements. We looked at ways to infer relationships between software entities from text aboutsoftware. Specifically, we collected data from the Q&A websiteStackOverflow, and used NLP methods and statistical machine-learningmethods to find a set of relationships that were frequentllymentioned, and internally consistent. The quality of the top-scoringrelationship was good (as much as 90% for certain measures) whenevaluated by domain experts.
As another test, one might ask a person read some computer code and explain in English what it does--i.e., to produce meaningful comments for the code. We built a system could produce English comments on a new program that it was presented with. This system was developedu sing machine learning methods: it was trained on actual comments, written by programmers to explain code that they had written. The learning method we used works by reading the program into a neural memory ("encoding" it), then producing the comment one word at a time, updating the neural memory after each word is produced ("decoding").The best-performing model for comment generation was a novel learning system we called "encoder-reviewer-decoder", which allows some additional neural processing to be performed after the source code isloaded, but before the first word of the comments are produced. The encoder-reviewer-decoder model is intended to model a person's mentalprocess of thinking about the program after reading it, but before explaining it. Although this model cannot produce good-enough program comments on its own, it is accurate enough to help a programmersubstantially, filling in enough of a comment automatically to reduce typing by over a third.
Another common way to test understanding for people would be to ask them to answer question about terms from the domain. We looked at answering "fill-in-the-blank" or "cloze" questions, such as
Django is a free and open-source web framework, written in ____, which follows the model-view-template (MVT) architectural pattern.
(The correct answer here is "Python"). From StackOverflow, a web site with questions and answers about computer programming, we extracted thousands of definitions, and constructed over 35,000 cloze questions. These questions are very hard: two experienced, senior programmers answered only 46.8% correctly, on average.
We explored a number of approaches to answering these questions automatically. The first approach modeled what a human might do. The system first searches for documents that might contain the answer,then "reads" the documents returned by the search, and finally produces the answer. To "read" the documents use existing methods called reading comprehension (RC) systems. These methods performed reasonably well, with the best existing question-answering methods reaching accuracy of 32.1%. However, further experiments showed than one could do about as well (accuracy of 33.6%) using another approach called a "neural language model" (LM), which simply guesses a plausible word to fill in the blank using general statistics about English sentences in the software domain.
Motivated by this, we explored a new approach to answering cloze questions about software. The insight we had was that often, related software entities (e.g., "Django" and "Python") were used together as tags for StackOverFlow posts. (Every posted question can be tagged by the user with a set of software entities, to make it easier for possible question-answerers to find relevant questions.) Often entities that are semantically related are used together as tags ofthe same post. We thus designed a new method specifically designed for answering Cloze (fill-in-the-blank) questions using this sort of text with tagged entities.
Our method is called CASE (Context-Adjusted Syntax Embeddings). It's a hybrid of an neural LM and a co-occurrence model, which predicts an answer based on co-occurrence with the entity. This allows a useful "division of labor" between the two models. The LM can predict the appropriate “type” of the answer entity based on the question syntax (in the example, a programming language is needed), while the co-occurrence model picks out the entity of that "type" based on co-occurrence with the term defined. CASE far outperforms the other methods: it achieved an accuracy of 44.9%, which is close to the performance of human experts.
Last Modified: 07/26/2018
Modified by: William Cohen
Please report errors in award information by writing to: awardsearch@nsf.gov.