
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | October 29, 2014 |
Latest Amendment Date: | April 30, 2015 |
Award Number: | 1464553 |
Award Instrument: | Continuing Grant |
Program Manager: |
D. Langendoen
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | September 1, 2014 |
End Date: | September 30, 2018 (Estimated) |
Total Intended Award Amount: | $470,000.00 |
Total Awarded Amount to Date: | $470,000.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
940 GRACE HALL NOTRE DAME IN US 46556-5708 (574)631-7432 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
940 Grace Hall Notre Dame IN US 46556-5708 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
Robust Intelligence, DEL |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Thousands of the world's languages are in danger of dying out before they have been systematically documented. Many other languages have millions of speakers, yet they exist only in spoken form, and minimal documentary records are available. As a consequence, important sources of knowledge about human language and culture are inaccessible, and at risk of being lost forever. Moreover, it is difficult to develop technologies for processing these languages, leaving their speech communities on the far side of a widening digital divide. The first step to solving these problems is language documentation, and so the goal of this project is to develop computational methods based on automatic speech recognition and machine translation for documenting endangered and unwritten languages on an unprecedented scale.
To be successful, any approach must guarantee both the sufficiency and interpretability of the documentation it produces. This project ensures sufficiency by using a combination of community outreach, crowdsourcing techniques, and mobile/web technologies to collect hundreds of hours (millions of words) of speech. The interpretability is enabled by augmenting original speech recordings with careful verbatim repetitions along with translations into a well-resourced language. Finally, computational models are developed to automate transcription of recordings and alignment with translations, resulting in bilingual aligned text. The result is a kind of digital Rosetta Stone: a large-scale key for interpreting the world's languages even if they are not written, or no longer even spoken.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The goal of the Lido project was to develop technologies for creating interpretable archives of endangered-language data at scale. More specifically, the goal was to transform a collection of parallel speech (speech in an endangered language and its speech translation into English or another contact language) into something resembling interlinear glossed text, i.e., with a text transcription of both languages and with a word-by-word gloss or word alignment between the two transcriptions.
Intellectual Merit: We demonstrated that in the absence of transcribed speech data, we can directly align speech to its translation. A possible application would be to produce word glosses for speech data, helping linguists to interpret the data. We also showed that with a relatively small amount of transcribed speech data and a small amount of parallel text, we can use a speech transcription model to improve a translation model and vice versa.
We also developed a model of transcription that replaces onerous phonemic transcription, which can only be performed by trained linguists, with a new method called sparse lexical transcription, which is more scalable and can be performed by minimally-trained speakers.
Additionally, we developed new state-of-the-art models for a wide range of other tasks related to low-resource and endangered languages: unsupervised word-spotting, speech transcription of tonal languages, induction of bilingual lexicons and word embeddings, part-of-speech tagging, and dependency parsing.
Broader Impacts: The project has addressed the problem of scaling up the urgent work of documenting the world's languages by: (a) lowering the amount of training data required for computational methods; (b) increasing the productivity of linguists on existing tasks; and (c) developing new documentary methods that can be performed by speakers who have no specialised training.
We have released many of the systems developed in this project as open-source software, including: Persephone, an open-source automatic transcription tool intended for use by linguists working in endangered language situations; Zahwa, a mobile web app designed for use by speakers of endangered languages to documenting procedural knowledge; and Penne, an open-source deep learning toolkit designed for ease of use and clarity of implementation.
Finally, we created and released datasets of audio with translations in several languages, and non-native English text with Spanish translations.
Last Modified: 01/14/2019
Modified by: David W Chiang
Please report errors in award information by writing to: awardsearch@nsf.gov.