Skip to feedback

Award Abstract # 1160639
Language Preservation 2.0: Crowdsourcing Oral Language Documentation using Mobile Devices

NSF Org: BCS
Division of Behavioral and Cognitive Sciences
Recipient: TRUSTEES OF THE UNIVERSITY OF PENNSYLVANIA, THE
Initial Amendment Date: June 18, 2012
Latest Amendment Date: June 18, 2012
Award Number: 1160639
Award Instrument: Standard Grant
Program Manager: Shobhana Chelliah
BCS
 Division of Behavioral and Cognitive Sciences
SBE
 Directorate for Social, Behavioral and Economic Sciences
Start Date: July 1, 2012
End Date: December 31, 2014 (Estimated)
Total Intended Award Amount: $101,501.00
Total Awarded Amount to Date: $101,501.00
Funds Obligated to Date: FY 2012 = $101,501.00
History of Investigator:
  • Mark Liberman (Principal Investigator)
    myl@unagi.cis.upenn.edu
  • Steven Bird (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Pennsylvania
3451 WALNUT ST STE 440A
PHILADELPHIA
PA  US  19104-6205
(215)898-7293
Sponsor Congressional District: 03
Primary Place of Performance: Linguistic Data Consortium
3600 MARKET ST STE 810
PHILADELPHIA
PA  US  19104-2653
Primary Place of Performance
Congressional District:
03
Unique Entity Identifier (UEI): GM1XX56LEP58
Parent UEI: GM1XX56LEP58
NSF Program(s): DEL
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7719, SMET
Program Element Code(s): 771900
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.075

ABSTRACT

Language Preservation 2.0

The purpose of this pilot project is to demonstrate the feasibility of a new approach to documenting endangered languages.

To allow wide-ranging investigation of a language even after it is no longer spoken, we need the equivalent of the million words of extant biblical Hebrew texts, or the five million words of extant classical Latin. But for endangered languages without a significant culture of literacy, diverse text collections on this scale seem out of reach.

Given typical speaking rates of about 10,000 word-equivalents per hour, a hundred hours of recorded speech -- conversations, narratives, or oral histories -- would give us the equivalent of a million words of text. With community involvement, hundreds of hours of such recordings are easily within reach.

However, transcribing such large audio collections is a daunting task, given the small number of literate native speakers and the time-consuming nature of such transcription, which can take 200 hours of work for every hour of audio. We propose to solve this problem by substituting re-speaking and verbal translation: one or more native speakers repeats each phrase of a recording, speaking slowly and carefully, and then translates it into a better-documented language.

The utility of translated passages as a way to analyze otherwise-unknown languages has been demonstrated many times, starting with the Rosetta Stone. This aspect of our task is easier, since at least a grammatical sketch will in general be available.

Our goal in this project is to demonstrate the utility of re-speaking. We believe that linguists, starting out with relatively little knowledge of a language, can produce phonetic transcriptions that will be good enough to support subsequent analysis resulting in coherent texts, in a process analogous to (but easier than) the process that allowed previous generations of scholars to learn to read ancient Egyptian or Sumerian.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Thousands of the world's languages are not adequately documented, and the languages are falling out of use more rapidly than linguists can record and transcribe them. This project investigated the problem of scaling up the language documentation effort through crowdsourcing, engaging the members of speech communities to record, respeak, and orally translate their linguistic heritage.

The software, Aikuma, is available from aikuma.org, and won the Open Source Software World Challenge Grand Prize 2013. Field tests were conducted in Papua New Guinea, Brazil, and Nepal. Laboratory experiments demonstrated that the audio collected by the phones is of sufficient quality to support later scientific study.

The project established an effective new way to avoid the usual transcription bottleneck which prevents linguists from transcribing more than a few hours of recordings for any language studied. Instead, the method relies on a protocol known as "careful respeaking", in which someone listens to a previously made recording and carefully repeats what was said, phrase by phrase. Aikuma permits the user to start respeaking at any stage during playback and records what was said, aligning it with the original source. Oral translation works in the same way. Accordingly, each source is associated with additional recordings that can be used by future linguists to perform their transcription and translation work, even once no speakers of the language remain.

The app is being used in a variety of ongoing language documentation work, more effectively leveraging the human resources of local speech communities, and contributing significantly to the preservation of endangered languages.

 


Last Modified: 05/19/2015
Modified by: Steven G Bird