Award Abstract # 1160274
AGGREGATION: Automatic Generation of Grammars for Endangered Languages from Glosses and Typological Information [ctn, ing, inh]

NSF Org: BCS
Division of Behavioral and Cognitive Sciences
Recipient: UNIVERSITY OF WASHINGTON
Initial Amendment Date: July 26, 2012
Latest Amendment Date: February 6, 2014
Award Number: 1160274
Award Instrument: Standard Grant
Program Manager: Colleen M. Fitzgerald
BCS
 Division of Behavioral and Cognitive Sciences
SBE
 Directorate for Social, Behavioral and Economic Sciences
Start Date: September 15, 2012
End Date: June 30, 2015 (Estimated)
Total Intended Award Amount: $224,039.00
Total Awarded Amount to Date: $228,071.00
Funds Obligated to Date: FY 2012 = $224,039.00
FY 2014 = $4,032.00
History of Investigator:
  • Emily Bender (Principal Investigator)
    ebender@u.washington.edu
  • Fei Xia (Co-Principal Investigator)
Recipient Sponsored Research Office: University of Washington
4333 BROOKLYN AVE NE
SEATTLE
WA  US  98195-1016
(206)543-4043
Sponsor Congressional District: 07
Primary Place of Performance: University of Washington
Dept of Linguistics/Box 354340
Seattle
WA  US  98195-4340
Primary Place of Performance
Congressional District:
07
Unique Entity Identifier (UEI): HD1WMN6945W6
Parent UEI:
NSF Program(s): IIS Special Projects,
DEL
Primary Program Source: 01001213DB NSF RESEARCH & RELATED ACTIVIT
01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1311, 7484, 7719, 9251, SMET
Program Element Code(s): 748400, 771900
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.075

ABSTRACT

The world's linguistic diversity is diminishing at an alarming rate, and there are not enough resources (trained field linguists or funding for them) to document all the endangered languages before they are gone. Thus there is a critical need for software tools to support the efficiency of field linguists. This project will develop software tools to assist in the documentation of endangered languages by merging two types of resources: Collections of linguistic examples curated by linguists and a cross-linguistic computational grammar resource, called the Grammar Matrix. The result will be a system for creating machine-readable, or implemented, grammars from data collected and annotated by field linguists.

Implemented grammars can contribute to endangered language documentation in several ways: The grammars themselves provide a very rich resource, allowing linguists to explore analyses at a level of precision not usually achieved in prose descriptions. Furthermore, implemented grammars can be used to create treebanks, that is, collections of utterances associated with syntactic and semantic structures. The process of creating the treebank can provide important feedback to the field linguist about aspects of the linguistic data not covered by current analyses. The resulting treebanks can be used to create further computational tools and are also a rich source of comparable data for qualitative and quantitative work in linguistic typology, grounding higher-level linguistic abstractions in actual utterances in a computationally tractable fashion.

While building an implemented grammar is typically not within the scope of a field linguistics project, field linguists do routinely create collections of examples of glossed, translated text (called "IGT"), which encapsulate the result of extensive linguistic analysis. This project will further develop computational methods for extracting typological information from IGT like those pioneered by the RiPLes project (Xia & Lewis 2007, Lewis & Xia 2008) and combine that information with the cross-linguistic resource produced by the Grammar Matrix project (Bender et al 2002, 2010) to create implemented grammars for endangered languages.

The Division of Information & Intelligent Systems of the Directorate for Computer & Information Science & Engineering is funding this award as part of its commitment to support the development of computational tools and methods for the documentation of endangered languages.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Goodman, Michael Wayne, Joshua Crowgey, Fei Xia and Emily M. Bender "Xigt: Extensible Interlinear Glossed Text for Natural Language Processing" Language Resources and Evaluation , v.49 , 2015 , p.455

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

 

The goal of the AGGREGATION project is to support language
documentation efforts by automatically creating computational grammars
of endangered languages.  Computational grammars themselves form a
rich, detailed type of documentation while also facilitating further
language description through automatically identifying sentences of
interest in the collected data.  The development of computational
grammars by hand is time-consuming, however, and so we seek to provide
these benefits through automatic grammar creation on the basis of
analyses already performed by field linguists. In particular, we are
building on a cross-linguistic computational grammar resource called
the LinGO Grammar Matrix customization system.  This system includes a
cross-linguistic core grammar, meant to be useful for any language,
and a series of `libraries' of analyses of cross-linguistically
variable phenomena.  Users of the system can access the libraries by
filling out a web-based questionnaire on the basis of which the system
produces a customized grammar.
In Phase I, the AGGREGATION project has developed prototype software
to answer the LinGO Grammar Matrix's questionnaire directly on the
basis of endangered language data collected and annotated by field
linguists.  These annotations include both a translation into English
and a `gloss' line that indicates the meaning or grammatical function
of each word (or meaningful part of a word).  Our prototype system was
tested on an annotated collection of data from Chintang, an endangered
language of Nepal.  It is able to create a grammar that can be used to
analyze a portion of the sentences in that corpus, relating the
sentences to representations of their semantics.
The development of this prototype system required the development of
several component pieces, including: (i) Xigt: A data model and XML
serialization for storing annotated data in a way that is both
efficient to work with computationally and flexible enough that we can
add enrichments to the annotations (such as grammatical analyses of
the English translations; (ii) Algorithms for inferring linguistic
properties of a language from annotated data such as what type of case
marking system they have (if any) or how they order the major
constituents of a sentence (subject, object and verb); (iii)
Algorithms for inferring the internal structure of words in a language
on the basis of annotated data; (iv) Methodologies for evaluating the
success of these algorithms.

 

 


Last Modified: 09/16/2015
Modified by: Emily M Bender

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page