
NSF Org: |
CCF Division of Computing and Communication Foundations |
Recipient: |
|
Initial Amendment Date: | July 20, 2015 |
Latest Amendment Date: | July 20, 2015 |
Award Number: | 1526118 |
Award Instrument: | Standard Grant |
Program Manager: |
Sol Greenspan
sgreensp@nsf.gov (703)292-7841 CCF Division of Computing and Communication Foundations CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2015 |
End Date: | August 31, 2019 (Estimated) |
Total Intended Award Amount: | $200,000.00 |
Total Awarded Amount to Date: | $200,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
800 WEST CAMPBELL RD. RICHARDSON TX US 75080-3021 (972)883-2313 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
800 W. Campbell Rd. Richardson TX US 75080-3021 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Software & Hardware Foundation |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Software systems contain large amounts of textual information captured in various software artifacts, such as, requirements documents, source code, user manuals, etc. The productivity of software developers and the quality of the software they produce directly depends on their ability to retrieve and understand the textual information present in software. Since humans cannot process and comprehend so much text, researchers proposed the use of text retrieval techniques to help software developers with many of their daily tasks. In order to be useful, these techniques need to be properly configured, which requires calibrating many parameters. As most software developers are not experts in text retrieval, they need help in determining the best text retrieval configuration in a given software engineering context. The configuration problem is one of the main obstacles in the adoption of such techniques in the software industry, because many approaches proposed by researchers do not generalize well. The outcomes of this project will transform the way software developers address many of their daily tasks, allowing them to easily adopt the use of text retrieval during software development. The results of this research will also be used in software engineering courses to support students in their projects. The new practices that the students will acquire will help them become better software engineers. The proposed research also brings together work from different computing research communities: software engineering and information retrieval and it will bring new knowledge in both fields. Existing approaches using text retrieval in software engineering will become more practical, rather than just promising, facilitating migration from the lab into industry and academia.
The outcome of this research will be: (1) a novel approach (called TRinSE2.0), which will achieve automatic, runtime query-based text retrieval configuration; and (2) improvements to important software engineering tasks, in practical settings, focusing on feature and bug location, impact analysis, traceability link recovery, and bug triage. TRinSE2.0 will be evaluated on open source data, in the classroom, and in industrial settings. The proposed work will transform the way text retrieval configuration is done in software engineering applications. New, software-specific measures, as well as proven linguistic-based measures will be used to capture query properties in the context of software engineering tasks and data sets. Machine learning algorithms will find the best configuration for a given query. When writing a query to retrieve information from a software project, developers will get the best results, saving them time and effort, improving their productivity and the quality of their work. The text retrieval configuration problem will no longer be heuristic-based, but it will become data-driven.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Text Retrieval approaches, which allow searching in large amounts of text and extracting the most important information in software artifacts, have been applied to more than 20 software engineering tasks.
We found that the retrieval performance of various text retrieval engines varies based on the query issued by a user, when used to retrieve source code artifacts. We improved applications of text retrieval in software engineering, by addressing the problem of configuring the parameters of text retrieval engines based on queries issued by developers. We developed a new technique and tool, called QUEST, which is the first step towards automatically determining the best text retrieval (TR) configuration for a given query. QUEST uses a supervised learning approach and the properties of a query to recommend the TR engine and parameter configuration that is most likely to work best for that query. We evaluated QUEST in the context of feature and bug localization, using a data set with more than 1,000 queries and found that it leads to better results than using a single TR configuration for all queries in a system. Tools and techniques like QUEST can be are very useful for leveraging the potential of TR approaches in a practical setting, with high adoption chances in the open source community as well as in industry.
We improved previous work on predicting the quality of queries in the context of TR applications in software engineering by integrating new, post-retrieval quality metrics. We evaluated the new approach on two tasks: bug localization and traceability link recovery. We found that this improved approach leads to better results in predicting the quality of queries than our previous work, which made use only of pre-retrieval quality metrics. Detecting the quality of queries is a crucial step towards the query-based configuration of TR approaches, because a poorly formulated query may lead to poor TR results no matter what configuration is chosen for it. Therefore, poor queries need to be detected and reformulated before TR approaches are applied to it.
One main application of TR in SE is on concept location/bug localization, and improving such approaches is one of the main goals of this project. In addition to the settings of TR engines, retrieval in these applications is impacted by the vocabulary present in the bug reports, which are often used as queries for bug localization. We investigated the level of vocabulary agreement between users, which occurs when reporting similar issues. Empirical data on more than 13,000 pairs of duplicate bug reports and stack overflow questions indicate that 12% of them do not have any common vocabulary, while for the rest, the pairs share in average 30% of their vocabulary. In our quest for improving TR-based bug localization we developed a new query reformulation technique that utilizes the observed behavior description from bug reports to reformulate low quality queries. The results indicate massive improvements over baseline approaches, as the reformulated queries improve TR-based bug localization for all approaches by 147.4% and 116.6% on average, in terms of MRR and MAP, respectively.
We developed an approach that uses parts of the bug descriptions to reformulate queries used in bug descriptions to improve text retrieval-based duplicate bug detection. We found that using the observed behavior and the title of the bug report as a query improves duplicate bug detection, compared to using the entire bug report as a query.
We developed a technique for summarizing class APIs, to help during software retrieval. Unique to this summarization technique, compared to related work, is that the summaries are specific to the queries issued by the users. The technique is based on the use of software engineering-specific knowledge graphs.
We developed and presented four Technical Briefings in top venues in the field, on ?The Use of Text Retrieval and Natural Language Processing in Software Engineering?, plus a tutorial and a technical briefing on ?Source Code Summarization?.
Three graduate students from underrepresented categories (one woman and two Hispanics) worked on topics related to this project. This grant contributed to the training, professional development, and the fostering of networks for the supported graduate students, by allowing the PI to train and involve the students in research and by allowing the students to (i) present their papers at conferences and get feedback on their work from other researchers from academia and industry, (ii) attend research presentations at conferences and enrich their knowledge about the field, and (iii) interact with other students, researchers, and practitioners in the field, therefore building a professional network that can benefit their future careers. One student graduated with his PhD and another graduated with her MS.
Last Modified: 12/29/2019
Modified by: Andrian Marcus
Please report errors in award information by writing to: awardsearch@nsf.gov.