Award Abstract # 1760475
STREAMLInED: Shared Tasks for Rapid, Efficient Analysis of Many Languages in Emerging Documentation

NSF Org: BCS
Division of Behavioral and Cognitive Sciences
Recipient: UNIVERSITY OF WASHINGTON
Initial Amendment Date: June 8, 2018
Latest Amendment Date: April 12, 2024
Award Number: 1760475
Award Instrument: Standard Grant
Program Manager: Rachel M. Theodore
rtheodor@nsf.gov
 (703)292-4770
BCS
 Division of Behavioral and Cognitive Sciences
SBE
 Directorate for Social, Behavioral and Economic Sciences
Start Date: June 15, 2018
End Date: September 30, 2024 (Estimated)
Total Intended Award Amount: $124,985.00
Total Awarded Amount to Date: $124,985.00
Funds Obligated to Date: FY 2018 = $124,985.00
History of Investigator:
  • Gina-Anne Levow (Principal Investigator)
    levow@uw.edu
  • Emily Bender (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Washington
4333 BROOKLYN AVE NE
SEATTLE
WA  US  98195-1016
(206)543-4043
Sponsor Congressional District: 07
Primary Place of Performance: University of Washington
4333 Brooklyn Ave NE
Seattle
WA  US  98195-0001
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): HD1WMN6945W6
Parent UEI:
NSF Program(s): Robust Intelligence,
DEL
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1311, 7495, 7556, 7719, 9179
Program Element Code(s): 749500, 771900
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.075

ABSTRACT

This project aligns the research interests of two separate scientific and engineering communities in order to push the boundaries of automatic speech processing technology and bring its benefits to the urgent task of endangered language documentation. Automatic speech processing technology has become familiar in the everyday lives of many speakers of English and other widely spoken languages through tools such as automatic captioning and voice-driven personal assistants. Meanwhile, linguists are rushing to document and analyze the thousands of languages that by the end of this century, will no longer be acquired by children. Such work would be greatly assisted by automatic processing of recorded spoken endangered language data. Modern automatic speech processing tools, however, require training data sets orders of magnitude larger than what is available for endangered languages. This project will advance scientific knowledge on this problem by structuring a "shared task evaluation challenge" around language documentation-based data sets. Better language documentation puts communities in a better position to undertake language revitalization, which in turn can be a key component of community development for marginalized populations. Broader impacts also include the benefits of bringing speech technology that works with small data sets to widely spoken but understudied languages, often languages of communication in regions of geopolitical and economic importance to national interests.


Language documentation projects typically begin with large quantities of recorded speech. Turning that spoken signal into a transcribed form is a major bottleneck in the language documentation process. Similarly, language archives house recorded, unanalyzed data from many languages with no living fluent speaker, but which have communities interested in revitalizing their heritage languages. At the same time, the development of technology that can work effectively with very small training data sets is an open and interesting challenge for speech researchers. The shared task evaluation challenge framework provides the structure of a friendly competition in which different research groups can explore and compare approaches that are evaluated with standardized data and metrics. This strategy for focusing research effort has advanced the frontiers of language technology for decades. This project will apply it for the first time to the specific challenges of endangered language documentation: working with truly low-resource languages, with often noisy or other imperfect recording conditions. The specific tasks the challenge will focus on include: identifying the language and speaker of each segment of a recording, identifying the genre (e.g. story telling vs. dialogue) of segments of recordings, and aligning short partial transcriptions to the spoken recordings. The researchers will prepare the data (based on existing data sets identified in language archives), set up functioning baseline systems that task participants can use for comparison and/or build on further, establish evaluation metrics, and execute the shared task. The shared task structure will encourage and support participants in making their contributions open source, with an eye towards ensuring they are available to language documentation researchers. The project will also include outreach to the language documentation community in order to train such researchers in the use of the technology developed.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Levow, Gina-Anne "Assessing Pre-Built Speaker Recognition Models for Endangered Language Data" , 2024 Citation Details
Levow, Gina-Anne "Investigating Speaker Diarization of Endangered Language Data" Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages , 2023 Citation Details
Levow, Gina-Anne and Ahn, Emily and Bender, Emily M. "Developing a Shared Task for Speech Processing on Endangered Languages" 4th Workshop on Computational Methods for Endangered Languages , 2021 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Language description and documentation efforts are currently at a critical stage. There are probably not enough linguist-hours left to assist all communities interested in documenting languages not currently being transmitted inter-generationally. At the same time, recent advances in natural language processing and speech technology hold the promise of being able to automate many of the more repetitive tasks taken on by field linguists, greatly speeding up the process of documenting a particular language and allowing more thorough description of more languages before they lose their last fully fluent speakers.

Based on discussions among researchers in endangered languages and those in speech and language technologies at the Shared Task Evaluation Campaigns with Endangered Language Data (EL-STEC) workshop held in 2016, the project team identified key pain points in endangered language research as well as tasks in speech and language processing which address these issues.  The current project focused on creation of Shared Tasks which support the development of speech and language technologies to accelerate the processing of new speech materials for language documentation. The project addressed a cascade of speech processing tasks, which take raw recordings of endangered language speech through the steps, such as speaker diarization, speaker identification, language identification, and transcription alignment, to support semi-automatic annotation and enrichment of endangered language recordings.
 
For the various subtasks, the project created corresponding shared tasks to focus research effort in the speech and language processing community. The project developed a pipeline to transform recordings and annotations from existing endangered language archival material into standardized training and test data sets. The project further implemented baseline systems for the shared tasks and developed a methodology for encapsulating these baselines in virtual machines, to lower the barrier of entry for task participants and to facilitate the dissemination of resulting systems to various users.  As a final step, evaluation suites of test data and metrics were also created, and a methodology for deployment to public shared tasks platforms was devised. These strategies were developed and tested on a range of data sets drawn from multiple geographic regions and language typologies, providing a rich and challenging basis to push the state-of-the-art. The project also demonstrated integration of baseline systems with interactive annotation tools widely used in language documentation, making the outcomes of the project available to a larger user base.

The project further conducted detailed comparisons and analysis of baseline systems, encompassing both classical machine learning approaches and current state-of-the-art, typically more data-hungry neural network methods.  Experiments demonstrated successful application of the methods to the project's newly created mutlilingual endangered language data sets and also demonstrated that the strongest neural network models significantly outperformed the classical models on this new data, despite the small size of the datasets.  However, importantly, those systems' effectiveness on the project's novel endangered language data sets significantly lagged that observed on the high-resource language data on which they had been developed and previously evaluated.  The findings highlight both the current limitations and potential strength of these speech processing tools for endangered language data and demonstrating the value of these new data sets and tasks in further propelling the state-of-the-art.

By creating shared tasks in the context of endangered language archive materials, the project contributes to the development of systems that are useful in language documentation and language reclamation projects. The creation of a structured shared task draws the attention of system developers and thus can help guide the research community towards identifying methodologies and technologies that are applicable to the specific needs of language documentation projects. The development of these shared tasks has laid the foundation for future improvements in spoken language processing for endangered languages, with the goal of  engaging participants in speech and language technology to push forward the state-of-the-art in several language processing tasks that can be integrated into tools to support the needs of end users.


Last Modified: 02/18/2025
Modified by: Gina-Anne Levow

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page