Award Abstract # 1526118
SHF: Small: Collaborative Research:Text Retrieval in Software Engineering 2.0

NSF Org: CCF
Division of Computing and Communication Foundations
Recipient: UNIVERSITY OF TEXAS AT DALLAS
Initial Amendment Date: July 20, 2015
Latest Amendment Date: July 20, 2015
Award Number: 1526118
Award Instrument: Standard Grant
Program Manager: Sol Greenspan
sgreensp@nsf.gov
 (703)292-7841
CCF
 Division of Computing and Communication Foundations
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2015
End Date: August 31, 2019 (Estimated)
Total Intended Award Amount: $200,000.00
Total Awarded Amount to Date: $200,000.00
Funds Obligated to Date: FY 2015 = $200,000.00
History of Investigator:
  • Andrian Marcus (Principal Investigator)
    amarcus7@gmu.edu
Recipient Sponsored Research Office: University of Texas at Dallas
800 WEST CAMPBELL RD.
RICHARDSON
TX  US  75080-3021
(972)883-2313
Sponsor Congressional District: 24
Primary Place of Performance: University of Texas at Dallas
800 W. Campbell Rd.
Richardson
TX  US  75080-3021
Primary Place of Performance
Congressional District:
24
Unique Entity Identifier (UEI): EJCVPNN1WFS5
Parent UEI:
NSF Program(s): Software & Hardware Foundation
Primary Program Source: 01001516DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7944
Program Element Code(s): 779800
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Software systems contain large amounts of textual information captured in various software artifacts, such as, requirements documents, source code, user manuals, etc. The productivity of software developers and the quality of the software they produce directly depends on their ability to retrieve and understand the textual information present in software. Since humans cannot process and comprehend so much text, researchers proposed the use of text retrieval techniques to help software developers with many of their daily tasks. In order to be useful, these techniques need to be properly configured, which requires calibrating many parameters. As most software developers are not experts in text retrieval, they need help in determining the best text retrieval configuration in a given software engineering context. The configuration problem is one of the main obstacles in the adoption of such techniques in the software industry, because many approaches proposed by researchers do not generalize well. The outcomes of this project will transform the way software developers address many of their daily tasks, allowing them to easily adopt the use of text retrieval during software development. The results of this research will also be used in software engineering courses to support students in their projects. The new practices that the students will acquire will help them become better software engineers. The proposed research also brings together work from different computing research communities: software engineering and information retrieval and it will bring new knowledge in both fields. Existing approaches using text retrieval in software engineering will become more practical, rather than just promising, facilitating migration from the lab into industry and academia.

The outcome of this research will be: (1) a novel approach (called TRinSE2.0), which will achieve automatic, runtime query-based text retrieval configuration; and (2) improvements to important software engineering tasks, in practical settings, focusing on feature and bug location, impact analysis, traceability link recovery, and bug triage. TRinSE2.0 will be evaluated on open source data, in the classroom, and in industrial settings. The proposed work will transform the way text retrieval configuration is done in software engineering applications. New, software-specific measures, as well as proven linguistic-based measures will be used to capture query properties in the context of software engineering tasks and data sets. Machine learning algorithms will find the best configuration for a given query. When writing a query to retrieve information from a software project, developers will get the best results, saving them time and effort, improving their productivity and the quality of their work. The text retrieval configuration problem will no longer be heuristic-based, but it will become data-driven.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 21)
Chaparro, O., Florez, J. M., Marcus, A. "On the Vocabulary Agreement in Software Issue Descriptions" 32nd IEEE International Conference on Software Maintenance and Evolution (ICSME 2016) , 2016 , p.448 10.1109/ICSM.2013.70
Chaparro, O., Florez, J. M., Marcus, A., "Using bug descriptions to reformulate queries during text-retrieval-based bug localization" Empirical Software Engineering , v.24 , 2019 , p.2947 10.1007/s10664-018-9672-z
Chaparro, O., Florez, J. M., Singh, U., Marcus, A. "Reformulating Queries for Duplicate Bug Report Detection" 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, (SANER 2019) , 2019 , p.218 10.1109/SANER.2019.8667985
Chaparro, O., Marcus, A. "On the reduction of verbose queries in text retrieval based software maintenance" 38th ACM/IEEE International Conference on Software Engineering (ICSE) , 2016 , p.716 10.1145/2889160.2892647
C. Mills and S. Haiduc "A Machine Learning Approach for Determining the Validity of Traceability Links" 39th ACM/IEEE International Conference on Software Engineering (ICSE?17) , 2017 , p.121 10.1109/ICSE-C.2017.86
C. Mills and S. Haiduc "The Impact of Retrieval Direction on IR-based Traceability Link Recovery" 39th ACM/IEEE International Conference on Software Engineering (ICSE?17) , 2017 , p.51 10.1109/ICSE-NIER.2017.14
C. Mills, G. Bavota, S. Haiduc, R. Oliveto, A. Marcus, and A. de Lucia "Predicting Query Quality for Applications of Text Retrieval to Software Engineering Tasks" ACM Transactions on Software Engineering Methodology (TOSEM) , v.26 , 2017 , p.1 10.1145/3078841
J. Escobar-Avila, E. Parra, and S. Haiduc "Text Retrieval-based Tagging of Software Engineering Video Tutorials" 39th ACM/IEEE International Conference on Software Engineering (ICSE?17) , 2017 , p.341 10.1109/ICSE-C.2017.121
Laura Moreno and Andrian Marcus "Automatic software summarization: the state of the art" Proceedings of the 40th International Conference on Software Engineering: Companion Proceedings , 2018 , p.530 10.1145/3183440.3183464
Liu, M., Peng, X., Marcus, A., Xing, Z., Xie, W., Xing, S., Liu, Y. "Generating query-specific class API summaries" ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, (ESEC/FSE 2019) , 2019 , p.120 10.1145/3338906.3338971
L. Moreni, A. Marcus "Automatic software summarization: the state of the art" 39th IEEE/ACM International Conference on Software Engineering (ICSE'17) , 2017 , p.511 10.1109/ICSE-C.2017.169
(Showing: 1 - 10 of 21)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Text Retrieval approaches, which allow searching in large amounts of text and extracting the most important information in software artifacts, have been applied to more than 20 software engineering tasks. 

 

We found that the retrieval performance of various text retrieval engines varies based on the query issued by a user, when used to retrieve source code artifacts. We improved applications of text retrieval in software engineering, by addressing the problem of configuring the parameters of text retrieval engines based on queries issued by developers. We developed a new technique and tool, called QUEST, which is the first step towards automatically determining the best text retrieval (TR) configuration for a given query. QUEST uses a supervised learning approach and the properties of a query to recommend the TR engine and parameter configuration that is most likely to work best for that query. We evaluated QUEST in the context of feature and bug localization, using a data set with more than 1,000 queries and found that it leads to better results than using a single TR configuration for all queries in a system.  Tools and techniques like QUEST can be are very useful for leveraging the potential of TR approaches in a practical setting, with high adoption chances in the open source community as well as in industry.

 

We improved previous work on predicting the quality of queries in the context of TR applications in software engineering by integrating new, post-retrieval quality metrics. We evaluated the new approach on two tasks: bug localization and traceability link recovery. We found that this improved approach leads to better results in predicting the quality of queries than our previous work, which made use only of pre-retrieval quality metrics. Detecting the quality of queries is a crucial step towards the query-based configuration of TR approaches, because a poorly formulated query may lead to poor TR results no matter what configuration is chosen for it. Therefore, poor queries need to be detected and reformulated before TR approaches are applied to it.

 

One main application of TR in SE is on concept location/bug localization, and improving such approaches is one of the main goals of this project. In addition to the settings of TR engines, retrieval in these applications is impacted by the vocabulary present in the bug reports, which are often used as queries for bug localization. We investigated the level of vocabulary agreement between users, which occurs when reporting similar issues.  Empirical data on more than 13,000 pairs of duplicate bug reports and stack overflow questions indicate that 12% of them do not have any common vocabulary, while for the rest, the pairs share in average 30% of their vocabulary.  In our quest for improving TR-based bug localization we developed a new query reformulation technique that utilizes the observed behavior description from bug reports to reformulate low quality queries.  The results indicate massive improvements over baseline approaches, as the reformulated queries improve TR-based bug localization for all approaches by 147.4% and 116.6% on average, in terms of MRR and MAP, respectively. 

 

We developed an approach that uses parts of the bug descriptions to reformulate queries used in bug descriptions to improve text retrieval-based duplicate bug detection.  We found that using the observed behavior and the title of the bug report as a query improves duplicate bug detection, compared to using the entire bug report as a query.

 

We developed a technique for summarizing class APIs, to help during software retrieval. Unique to this summarization technique, compared to related work, is that the summaries are specific to the queries issued by the users.  The technique is based on the use of software engineering-specific knowledge graphs.

 

We developed and presented four Technical Briefings in top venues in the field, on ?The Use of Text Retrieval and Natural Language Processing in Software Engineering?, plus a tutorial and a technical briefing on ?Source Code Summarization?.

 

Three graduate students from underrepresented categories (one woman and two Hispanics) worked on topics related to this project.  This grant contributed to the training, professional development, and the fostering of networks for the supported graduate students, by allowing the PI to train and involve the students in research and by allowing the students to (i) present their papers at conferences and get feedback on their work from other researchers from academia and industry, (ii) attend research presentations at conferences and enrich their knowledge about the field, and (iii) interact with other students, researchers, and practitioners in the field, therefore building a professional network that can benefit their future careers.  One student graduated with his PhD and another graduated with her MS.

 


Last Modified: 12/29/2019
Modified by: Andrian Marcus

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page