Award Abstract # 2031736
RAPID: Advanced Topic Modeling Methods to Analyze Text Responses in COVID-19 Survey Data

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF MARYLAND, COLLEGE PARK
Initial Amendment Date: May 4, 2020
Latest Amendment Date: July 17, 2020
Award Number: 2031736
Award Instrument: Standard Grant
Program Manager: Tatiana Korelsky
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: May 15, 2020
End Date: April 30, 2023 (Estimated)
Total Intended Award Amount: $176,785.00
Total Awarded Amount to Date: $191,785.00
Funds Obligated to Date: FY 2020 = $191,785.00
History of Investigator:
  • Philip Resnik (Principal Investigator)
    resnik@umd.edu
Recipient Sponsored Research Office: University of Maryland, College Park
3112 LEE BUILDING
COLLEGE PARK
MD  US  20742-5100
(301)405-6269
Sponsor Congressional District: 04
Primary Place of Performance: University of Maryland College Park
MD  US  20742-5103
Primary Place of Performance
Congressional District:
04
Unique Entity Identifier (UEI): NPU8ULVAAS23
Parent UEI: NPU8ULVAAS23
NSF Program(s): COVID-19 Research,
Robust Intelligence
Primary Program Source: 01002021DB NSF RESEARCH & RELATED ACTIVIT
010N2021DB R&RA CARES Act DEFC N
Program Reference Code(s): 096Z, 7495, 7914
Program Element Code(s): 158Y00, 749500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070
Note: This Award includes Coronavirus Aid, Relief, and Economic Security (CARES) Act funding.

ABSTRACT

As the COVID-19 pandemic continues, public and private organizations are deploying surveys to inform responses and policy choices. Survey designs using multiple choice responses are by far the most common -- "open ended" questions, where survey participants provide a longer-form written response, are used far less. This is true despite the fact that when you allow people to provide unconstrained spoken or text responses, it is possible to obtain richer, fine-grained information clarifying the other responses, as well as useful ?bottom up? information that the survey designers did not know to ask for. A key problem is that analyzing the unstructured language in open-ended responses is a labor-intensive process, creating obstacles to using them especially when speedy analysis is needed and resources are limited. Computational methods can help, but they often fail to provide coherent, interpretable categories, or they can fail to do a good job connecting the text in the survey with the closed-end responses. This project will develop new computational methods for fast and effective analysis of survey data that includes text responses, and it will apply these methods to support organizations doing high-impact survey work related to COVID-19 response. This will improve these organizations? ability to understand and mitigate the impact of the COVID-19 pandemic.

This project?s technical approach builds on recent techniques bringing together deep learning and Bayesian topic models. Several key technical innovations will be introduced that are specifically geared toward improving the quality of information available in surveys that include both closed- and open-ended responses. A common element in these approaches is the extension of methods commonly used in supervised learning settings, such as task-based fine-tuning of embeddings and knowledge distillation, to unsupervised topic modeling, with a specific focus on producing diverse, human-interpretable topic categories that are well aligned with discrete attributes such as demographic characteristics, closed-end responses, and experimental condition. Project activities include assisting in the analysis of organizations' survey data, conducting independent surveys aligned with their needs to obtain additional relevant data, and the public release of a clean, easy to use computational toolkit facilitating more widespread adoption of these new methods.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Hoyle, A. and Goel, P. and Hian-Cheong. A. and Peskov, D. and Boyd-Graber, J. and & Resnik, P. "Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence" Advances in neural information processing systems , 2021 Citation Details
Hoyle, Alexander Miserlis and Goel, Pranav and Resnik, Philip "Improving Neural Topic Models using Knowledge Distillation" Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2020 https://doi.org/10.18653/v1/2020.emnlp-main.137 Citation Details
Squires, Allison and Clark-Cutaia, Maya and Henderson, Marcus and Arneson, Gavin and Resnik, Philip ""Should I stay or should I go? Nurses Perspectives About Working During the Covid-19 Pandemic in the United States: A Summative Content Analysis Combined with Topic Modelling" International Journal of Nursing Studies , 2022 https://doi.org/10.1016/j.ijnurstu.2022.104256 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Survey data has been, and continues to be, essential for public and private organizations in order to inform responses and policy choices related to the COVID-19 pandemic, and it is equally important for researchers in understanding the pandemic's impact. In general, surveys tend to use closed-end responses like multiple-choice questions or ratings on a 1-to-5 scale; open-ended questions, eliciting a written response, are used far less. This is true despite the fact that unconstrained text responses can yield rich and nuanced information as well as bottom up information that the survey designers did not know to ask about.

One of the central obstacles to using open-ended questions, though, is that analyzing people's language in survey responses is a costly and labor-intensive enterprise. Generally analysts read all the responses in a slow, systematic process in order to manually identify themes or categories of response. In more rigorous analyses, additional effort is expended to establish that multiple analysts agree on what the categories are and that they make sense.  As an alternative, there are automated ways to identify categories in bodies of text and such methods are highly scalable; however, in practice there is no single accepted way to automate the process, and many survey analysts are unsatisfied with the ability of fully automatic techniques to produce sufficiently high quality results. In addition, there tends to be a sociological gap between the survey research and technological research communities, resulting in automated techniques not being widely adopted.

This project developed new computational methods for addressing limitations of open-ended responses in surveys, and it has applied these methods to support partners conducting and analyzing surveys related to the pandemic. On the technical side, the project focused on topic models, a category of computational methods that can extract human-interpretable categories from large collections of text. One of the project's surprising findings was that, although newer "neural network" topic models had been claimed to improve on the earlier topic models introduced in the early 2000s, flaws in the way that neural models are evaluated cast significant doubt on the validity of those claims. This led to a new approach to computational model evaluation that is much more directly tied to the processes that survey researchers actually follow in practice when they analyze text responses in the traditional, labor-intensive way. Results of careful and comprehensive experimentation found that the earlier "classical" models are actually superior to the newer neural methods when it comes to criteria that survey analysts care about. Building on that finding to address more generally the crucial issue of analyst trust in the results of such computational methods, the project developed TOPCAT (Topic-Oriented Protocol for Content Analysis of Text), a software-enabled, human-centered process designed around the traditional qualitative content analysis process, with the goal of widespread utility for "qual" researchers who analyze open-ended responses and other collections of text.

In addition to its technical aims, the project aimed from the very beginning to use its methods and the knowledge being developed in an ongoing way to help organizations "in the trenches" with their surveys during the pandemic. This included successfully assisting in the analysis of open-ended responses for a number of surveys: one in collaboration with the CDC/National Center for Healthcare Statistics; one  conducted by the Pandemic Crisis Services Response Coalition, a national organization focused on mental health issues; one conducted by the NYU School of Nursing looking at the experiences of front-line healthcare providers; a nationally representative survey conducted by Westat and the Stanford Medical School on the social and economic impact of COVID-19; a large survey by the Global Consortium of Nursing & Midwifery Studies with both English and Spanish responses from the U.S. and Latin America (with further plans to continue working with surveys in multiple languages being conducted in the Caribbean, Europe, the Middle East, Asia, and Africa); and a second nationally representative survey on COVID impact designed and deployed in collaboration with Westat.  Although not directly related to COVID-19, analysis was conducted using TOPCAT with 16,000 responses in a Reddit thread asking formerly suicidal Redditors what had gotten them through the dark times, with two suicide prevention experts providing the necessary subject matter expertise, providing insights that will be valuable to suicidologists and clinicians encountering people in suicidal crisis.

This project supported the training of several graduate students, and the TOPCAT protocol, including detailed analyst instructions and supporting software for the end-to-end process, is being made publicly available.

 


Last Modified: 08/07/2023
Modified by: Philip Resnik

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page