Award Abstract # 2007656
NSF-BSF: Collaborative Research: RI: Small: Multilingual Language Generation via Understanding of Code Switching

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF PITTSBURGH - OF THE COMMONWEALTH SYSTEM OF HIGHER EDUCATION
Initial Amendment Date: August 3, 2020
Latest Amendment Date: August 3, 2020
Award Number: 2007656
Award Instrument: Standard Grant
Program Manager: Tatiana Korelsky
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2020
End Date: September 30, 2024 (Estimated)
Total Intended Award Amount: $154,193.00
Total Awarded Amount to Date: $154,193.00
Funds Obligated to Date: FY 2020 = $154,193.00
History of Investigator:
  • Melinda Fricke (Principal Investigator)
    melinda.fricke@pitt.edu
Recipient Sponsored Research Office: University of Pittsburgh
4200 FIFTH AVENUE
PITTSBURGH
PA  US  15260-0001
(412)624-7400
Sponsor Congressional District: 12
Primary Place of Performance: University of Pittsburgh
PA  US  15213-2303
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): MKAGLD59JRL1
Parent UEI:
NSF Program(s): Robust Intelligence
Primary Program Source: 01002021DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 014Z, 7495, 7923
Program Element Code(s): 749500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Human language technology has recently matured to the extent that computational systems can generally interact with users in ways that are natural to humans, not just to machines. However, most people in the world today are multilingual, and current approaches to language technology do not reflect the reality that multilingual communication is ubiquitous; that is, current technology can interact naturally with monolingual speakers, but not with multilingual ones. Computational systems should be able to generate language that sounds equally natural to these users, and this includes being able to accommodate nonnative speakers. This project first creates a large-scale, broad coverage dataset, reflecting conversations between humans and an automatic system that is sophisticated enough to generate fluent multilingual (i.e. 'code-switched') utterances, but is simple enough for controlled experiments. The dataset is far larger than ones that are currently available, and is based on a much more detailed understanding of language-switching strategies. Second, this dataset is used to develop new methods to incorporate code-switching into contemporary deep-learning language generation, including dialogue systems, question answering, assistive technologies, summarization and machine translation. This innovation should benefit a dramatic number of multilingual computer users, including less privileged users who are currently required to interact with machines in a language they do not speak fluently. Successful completion of the research program will pave the way for the development of natural language technologies that are more accommodating to such users, building bridges over the digital divide.

The overarching goal of this project is to develop multilingual and contextualized language generation technologies that are more controllable and more adaptable to multilingual users. The project achieves this goal by completing the following objectives. (1) It develops psycholinguistically-grounded, scalable approaches to collecting corpora for studying how multilingual speakers adapt to each other's linguistic choices in text conversations. These methodologies are employed to collect large-scale, rich datasets of multilingual human-machine conversations. These datasets, as well as additional corpora of human code-switched interactions, should shed new light on the theoretical understanding of cross-lingual usage patterns, allowing for better understanding of how people employ code-switching in written language. (2) It uses the linguistic insights obtained through this endeavor to define classifiers that predict code-switching. (3) Novel approaches are developed for efficient, large-vocabulary neural language generation that incorporate these classifiers, allowing generation systems to introduce code-switching in a way that sounds natural to multilingual users. Consequently, this project should dramatically advance our understanding of code-switching, especially in the relatively unexplored territory of written dialogue. In addition, its contributions benefit a broad range of applications that rely on language generation, including dialogue systems, question answering, assistive technologies, summarization and machine translation.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Berríos, Juan and Swain, Angela and Fricke, Melinda "Implementing the map task in applied linguistics research: What, how, and why" Research Methods in Applied Linguistics , v.2 , 2023 https://doi.org/10.1016/j.rmal.2023.100081 Citation Details
Ostapenko, Alissa and Wintner, Shuly and Fricke, Melinda and Tsvetkov, Yulia "Speaker Information Can Guide Models to Better Inductive Biases: A Case Study On Predicting Code-Switching" Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2022 https://doi.org/10.18653/v1/2022.acl-long.267 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The primary goal of this project was to improve language generation technology for multilingual users. Most people are multilingual, yet language generation technologies (like chat bots) assume that users will prefer to communicate in a single language - often English, which is not even the most common native language in the world.

In the first phase of the project, we aimed to better understand why bilinguals switch languages (or "code-switch") when conversing with other bilinguals, and to improve computers' ability to accurately predict when code-switches will take place. We trained state-of-the-art deep-learning models to predict code-switch points within an already-existing corpus of Spanish-English conversations, testing several new approaches to improving such predictions (see Figure 1 from Ostapenko, Wintner, Fricke, & Tsvetkov, 2022). Our main findings were that (1) predictions were better when models were given information about speakers' language background and preferences, and (2) our speaker-aware models were also able to learn useful code-switching-related linguistic cues (which the non-speaker-aware models missed).

Next, we began developing a new method for collecting our own large-scale data sets of multilingual conversation. To collect data related to the needs and preferences of multilingual users, we adapted a well-known task from the linguistic literature, the Map Task, for use in text conversations with Spanish-English bilinguals. The Map Task has often been used by linguists for studying patterns in human conversations, because it results in natural, conversational speech while still being somewhat constrained. In the standard version of the task, one participant (the "direction giver") has a map with a path passing through various landmarks, while the other participant (the "direction follower") has a map with no path. The goal is for the direction giver to successfully direct the follower along the correct path, without either participant being able to see the other's map.

In Berríos, Swain, and Fricke (2023), we explained how we developed our maps using extensive testing with English and Spanish speakers. We verified that our maps resulted in linguistically interesting dialogs, and also documented the types of vocabulary items they elicited. We provided a thorough overview of methodological considerations for researchers wishing to develop their own Map Task, and described the task's utility for researchers working in a variety of research areas.

In the next phase, we used our maps to collect a large corpus of English-Spanish human-machine text dialogs (see Figure 1 from Geckt, Fricke, & Wintner, under review). To this end, we also developed a chat bot capable of completing the Map Task while conversing in code-switched English and Spanish (see Figure 2a from Geckt et al.). In the course of developing our bot, we tested how participants responded to different "strategies" of code-switching; that is, we programmed our bot to code-switch following different linguistic patterns, to see whether such differences affected participants' task performance or satisfaction.

We found that users strongly preferred code-switching patterns that aligned with their previous experience and expectations. When our bot switched languages randomly, users reported lower task enjoyment and more difficulty communicating with the bot. When our bot consistently used a grammatical structure that has previously been described in linguistic studies of Spanish-English code-switching, users completed the task faster and more accurately; when the bot used an analogous structure that has not been attested (and is therefore quite unexpected), users reported lower task success and enjoyment, and greater difficulty communicating.

Our findings underscore the importance of language technology "getting it right". If chat bots are unable to produce the linguistic patterns expected by their users, then user satisfaction, ability to communicate and understand, and even non-linguistic task performance will suffer. Language technologies must therefore strive to be maximally adaptable to their users, and importantly in our highly multilingual world, this must include the ability to adapt to multilingual users. Our work has already produced a large data set (1,556 dialogs) that can help researchers understand multilingual users' language preferences during human-machine interactions, and we have made this data set publicly available, along with the computer code and documentation necessary to implement our multilingual chat bot (https://github.com/HaifaCLG/MapTask). While our corpus alone makes a valuable contribution to the fields of bilingualism, corpus linguistics, and language technology, by documenting and sharing our methods, future researchers will be able to build on our progress and develop their own multilingual chat interfaces and large, multi-useful corpora.

Finally, by taking steps to improve language generation methods for multilingual speakers and those who use English as a second language, our project not only helps improve technological accessibility for potentially marginalized populations; it also advances the state-of-the-art more generally by moving towards more adaptable, person-aware language technology that can benefit all users.


Last Modified: 02/05/2025
Modified by: Melinda Fricke

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page