Award Abstract # 1464553
RI: Small: Language Induction meets Language Documentation: Leveraging bilingual aligned audio for learning and preserving languages

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: UNIVERSITY OF NOTRE DAME DU LAC
Initial Amendment Date: October 29, 2014
Latest Amendment Date: April 30, 2015
Award Number: 1464553
Award Instrument: Continuing Grant
Program Manager: D. Langendoen
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: September 1, 2014
End Date: September 30, 2018 (Estimated)
Total Intended Award Amount: $470,000.00
Total Awarded Amount to Date: $470,000.00
Funds Obligated to Date: FY 2014 = $470,000.00
History of Investigator:
  • David Chiang (Principal Investigator)
    dchiang@nd.edu
Recipient Sponsored Research Office: University of Notre Dame
940 GRACE HALL
NOTRE DAME
IN  US  46556-5708
(574)631-7432
Sponsor Congressional District: 02
Primary Place of Performance: University of Notre Dame
940 Grace Hall
Notre Dame
IN  US  46556-5708
Primary Place of Performance
Congressional District:
02
Unique Entity Identifier (UEI): FPU6XGFXMBE9
Parent UEI: FPU6XGFXMBE9
NSF Program(s): Robust Intelligence,
DEL
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7923, 7495
Program Element Code(s): 749500, 771900
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Thousands of the world's languages are in danger of dying out before they have been systematically documented. Many other languages have millions of speakers, yet they exist only in spoken form, and minimal documentary records are available. As a consequence, important sources of knowledge about human language and culture are inaccessible, and at risk of being lost forever. Moreover, it is difficult to develop technologies for processing these languages, leaving their speech communities on the far side of a widening digital divide. The first step to solving these problems is language documentation, and so the goal of this project is to develop computational methods based on automatic speech recognition and machine translation for documenting endangered and unwritten languages on an unprecedented scale.

To be successful, any approach must guarantee both the sufficiency and interpretability of the documentation it produces. This project ensures sufficiency by using a combination of community outreach, crowdsourcing techniques, and mobile/web technologies to collect hundreds of hours (millions of words) of speech. The interpretability is enabled by augmenting original speech recordings with careful verbatim repetitions along with translations into a well-resourced language. Finally, computational models are developed to automate transcription of recordings and alignment with translations, resulting in bilingual aligned text. The result is a kind of digital Rosetta Stone: a large-scale key for interpreting the world's languages even if they are not written, or no longer even spoken.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(Showing: 1 - 10 of 18)
Alexis Michaud and Oliver Adams and Trevor Cohn and Graham Neubig and S\'everine Guillaume "Integrating automatic transcription into the language documentation workflow: experiments with {Na} data and the {Persephone Toolkit}" Language Documentation and Conservation , v.12 , 2018
Antonios Anastasopoulos and David Chiang "A case study on using speech-to-translation alignments for language documentation" Second Workshop on Computational Methods for Endangered Languages , 2017
Antonios Anastasopoulos and David Chiang "Leveraging translations for speech transcription in low-resource settings" INTERSPEECH 2018 , 2018
Antonios Anastasopoulos and David Chiang "Tied Multitask Learning for Neural Speech Translation" NAACL HLT 2018 , 2018
Antonios Anastasopoulos, Marika Lekakou, Josep Quer, Eleni Zimianiti, Justin DeBenedetto, and David Chiang "Part-of-Speech Tagging in an Endangered Language: a Parallel Griko-Italian Resource" COLING 2018 , 2018
Antonios Anastasopoulos, Sameer Bansal, David Chiang, Sharon Goldwater and Adam Lopez "Spoken Term Discovery for Language Documentation using Translations" Workshop on Speech Centric Natural Language Processing , 2017
Antonis Anastasopoulos, Long Duong, and David Chiang "An Unsupervised Probability Model for Speech-to-Translation Alignment of Low-Resource Languages" EMNLP , 2016
Brian Thompson, Huda Khayrallah, Antonios Anastasopoulos, Arya McCarthy, Kevin Duh, Rebecca Marvin, Paul McNamee, Jeremy Gwinnup, Tim Anderson, and Philipp Koehn "Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation" WMT 2018 , 2018
Long Duong and Trevor Cohn and Steven Bird and Paul Cook "A neural network model for low-resource universal dependency parsing" EMNLP , 2015
Long Duong, Antonios Anastasopoulos, David Chiang, Steven Bird and Trevor Cohn "An Attentional Model for Speech Translation Without Transcription" NAACL , 2016
Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird and Trevor Cohn "Learning Crosslingual Word Embeddings without Bilingual Corpora" EMNLP , 2016
(Showing: 1 - 10 of 18)

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The goal of the Lido project was to develop technologies for creating interpretable archives of endangered-language data at scale. More specifically, the goal was to transform a collection of parallel speech (speech in an endangered language and its speech translation into English or another contact language) into something resembling interlinear glossed text, i.e., with a text transcription of both languages and with a word-by-word gloss or word alignment between the two transcriptions.

Intellectual Merit: We demonstrated that in the absence of transcribed speech data, we can directly align speech to its translation. A possible application would be to produce word glosses for speech data, helping linguists to interpret the data. We also showed that with a relatively small amount of transcribed speech data and a small amount of parallel text, we can use a speech transcription model to improve a translation model and vice versa.

We also developed a model of transcription that replaces onerous phonemic transcription, which can only be performed by trained linguists, with a new method called sparse lexical transcription, which is more scalable and can be performed by minimally-trained speakers.

Additionally, we developed new state-of-the-art models for a wide range of other tasks related to low-resource and endangered languages: unsupervised word-spotting, speech transcription of tonal languages, induction of bilingual lexicons and word embeddings, part-of-speech tagging, and dependency parsing.

Broader Impacts: The project has addressed the problem of scaling up the urgent work of documenting the world's languages by: (a) lowering the amount of training data required for computational methods; (b) increasing the productivity of linguists on existing tasks; and (c) developing new documentary methods that can be performed by speakers who have no specialised training.

We have released many of the systems developed in this project as open-source software, including: Persephone, an open-source automatic transcription tool intended for use by linguists working in endangered language situations; Zahwa, a mobile web app designed for use by speakers of endangered languages to documenting procedural knowledge; and Penne, an open-source deep learning toolkit designed for ease of use and clarity of implementation.

Finally, we created and released datasets of audio with translations in several languages, and non-native English text with Spanish translations.


Last Modified: 01/14/2019
Modified by: David W Chiang

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page