NSF Award Search: Award # 1160274

Award Abstract # 1160274

AGGREGATION: Automatic Generation of Grammars for Endangered Languages from Glosses and Typological Information [ctn, ing, inh]

NSF Org:	BCS Division of Behavioral and Cognitive Sciences
Recipient:	UNIVERSITY OF WASHINGTON
Initial Amendment Date:	July 26, 2012
Latest Amendment Date:	February 6, 2014
Award Number:	1160274
Award Instrument:	Standard Grant
Program Manager:	Colleen M. Fitzgerald BCS Division of Behavioral and Cognitive Sciences SBE Directorate for Social, Behavioral and Economic Sciences
Start Date:	September 15, 2012
End Date:	June 30, 2015 (Estimated)
Total Intended Award Amount:	$224,039.00
Total Awarded Amount to Date:	$228,071.00
Funds Obligated to Date:	FY 2012 = $224,039.00 FY 2014 = $4,032.00
History of Investigator:	Emily Bender (Principal Investigator) ebender@u.washington.edu Fei Xia (Co-Principal Investigator)
Recipient Sponsored Research Office:	University of Washington 4333 BROOKLYN AVE NE SEATTLE WA US 98195-1016 (206)543-4043
Sponsor Congressional District:	07
Primary Place of Performance:	University of Washington Dept of Linguistics/Box 354340 Seattle WA US 98195-4340
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	HD1WMN6945W6
Parent UEI:
NSF Program(s):	IIS Special Projects, DEL
Primary Program Source:	01001213DB NSF RESEARCH & RELATED ACTIVIT 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	1311, 7484, 7719, 9251, SMET
Program Element Code(s):	748400, 771900
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.075

ABSTRACT

The world's linguistic diversity is diminishing at an alarming rate, and there are not enough resources (trained field linguists or funding for them) to document all the endangered languages before they are gone. Thus there is a critical need for software tools to support the efficiency of field linguists. This project will develop software tools to assist in the documentation of endangered languages by merging two types of resources: Collections of linguistic examples curated by linguists and a cross-linguistic computational grammar resource, called the Grammar Matrix. The result will be a system for creating machine-readable, or implemented, grammars from data collected and annotated by field linguists.

Implemented grammars can contribute to endangered language documentation in several ways: The grammars themselves provide a very rich resource, allowing linguists to explore analyses at a level of precision not usually achieved in prose descriptions. Furthermore, implemented grammars can be used to create treebanks, that is, collections of utterances associated with syntactic and semantic structures. The process of creating the treebank can provide important feedback to the field linguist about aspects of the linguistic data not covered by current analyses. The resulting treebanks can be used to create further computational tools and are also a rich source of comparable data for qualitative and quantitative work in linguistic typology, grounding higher-level linguistic abstractions in actual utterances in a computationally tractable fashion.

While building an implemented grammar is typically not within the scope of a field linguistics project, field linguists do routinely create collections of examples of glossed, translated text (called "IGT"), which encapsulate the result of extensive linguistic analysis. This project will further develop computational methods for extracting typological information from IGT like those pioneered by the RiPLes project (Xia & Lewis 2007, Lewis & Xia 2008) and combine that information with the cross-linguistic resource produced by the Grammar Matrix project (Bender et al 2002, 2010) to create implemented grammars for endangered languages.

The Division of Information & Intelligent Systems of the Directorate for Computer & Information Science & Engineering is funding this award as part of its commitment to support the development of computational tools and methods for the documentation of endangered languages.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Goodman, Michael Wayne, Joshua Crowgey, Fei Xia and Emily M. Bender "Xigt: Extensible Interlinear Glossed Text for Natural Language Processing" Language Resources and Evaluation , v.49 , 2015 , p.455

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The goal of the AGGREGATION project is to support language

documentation efforts by automatically creating computational grammars

of endangered languages. Computational grammars themselves form a

rich, detailed type of documentation while also facilitating further

language description through automatically identifying sentences of

interest in the collected data. The development of computational

grammars by hand is time-consuming, however, and so we seek to provide

these benefits through automatic grammar creation on the basis of

analyses already performed by field linguists. In particular, we are

building on a cross-linguistic computational grammar resource called

the LinGO Grammar Matrix customization system. This system includes a

cross-linguistic core grammar, meant to be useful for any language,

and a series of `libraries' of analyses of cross-linguistically

variable phenomena. Users of the system can access the libraries by

filling out a web-based questionnaire on the basis of which the system

produces a customized grammar.

In Phase I, the AGGREGATION project has developed prototype software

to answer the LinGO Grammar Matrix's questionnaire directly on the

basis of endangered language data collected and annotated by field

linguists. These annotations include both a translation into English

and a `gloss' line that indicates the meaning or grammatical function

of each word (or meaningful part of a word). Our prototype system was

tested on an annotated collection of data from Chintang, an endangered

language of Nepal. It is able to create a grammar that can be used to

analyze a portion of the sentences in that corpus, relating the

sentences to representations of their semantics.

The development of this prototype system required the development of

several component pieces, including: (i) Xigt: A data model and XML

serialization for storing annotated data in a way that is both

efficient to work with computationally and flexible enough that we can

add enrichments to the annotations (such as grammatical analyses of

the English translations; (ii) Algorithms for inferring linguistic

properties of a language from annotated data such as what type of case

marking system they have (if any) or how they order the major

constituents of a sentence (subject, object and verb); (iii)

Algorithms for inferring the internal structure of words in a language

on the basis of annotated data; (iv) Methodologies for evaluating the

success of these algorithms.

Last Modified: 09/16/2015
Modified by: Emily M Bender

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error