Award Abstract # 1829290
Building a corpus of phonemic lexicons to study information theoretic universals

NSF Org: BCS
Division of Behavioral and Cognitive Sciences
Recipient: BROWN UNIVERSITY
Initial Amendment Date: August 24, 2018
Latest Amendment Date: February 2, 2024
Award Number: 1829290
Award Instrument: Continuing Grant
Program Manager: Rachel M. Theodore
rtheodor@nsf.gov
 (703)292-4770
BCS
 Division of Behavioral and Cognitive Sciences
SBE
 Directorate for Social, Behavioral and Economic Sciences
Start Date: September 1, 2018
End Date: May 31, 2024 (Estimated)
Total Intended Award Amount: $391,055.00
Total Awarded Amount to Date: $391,055.00
Funds Obligated to Date: FY 2018 = $191,117.00
FY 2019 = $199,938.00
History of Investigator:
  • Uriel Cohen Priva (Principal Investigator)
    uriel_cohen_priva@brown.edu
Recipient Sponsored Research Office: Brown University
1 PROSPECT ST
PROVIDENCE
RI  US  02912-9100
(401)863-2777
Sponsor Congressional District: 01
Primary Place of Performance: Brown University
Office of Sponsored Projects
Providence
RI  US  02912-9093
Primary Place of Performance
Congressional District:
01
Unique Entity Identifier (UEI): E3FDXZ6TBHW3
Parent UEI: E3FDXZ6TBHW3
NSF Program(s): Linguistics
Primary Program Source: 01001819DB NSF RESEARCH & RELATED ACTIVIT
01001920DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1311, 9251, SMET
Program Element Code(s): 131100
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.075

ABSTRACT

Human language use reflects the nature of human communication. For instance, frequent words tend to have fewer sounds than infrequent ones, which facilitates quick production and understanding. However, little is known about more fine-grained distinctions. For instance, English has more /k/ than /p/ sounds. Does that reflect a property of human language and its physiological and perceptual nature or a historical accident? Answering such questions requires comparative data on the frequency and phonological makeup of words in many languages. This project will build on existing textual sources and word frequency lists to provide the phonological makeup of words in close to 200 low-resource languages. The phonological word lists will provide an invaluable resource to the understanding of human language and provide much-needed linguistic resources to low-resource languages. The outputs of the project will be made public and easily accessible, thereby assisting in documenting and teaching the processed languages, and in building computational linguistic resources such as text-to-speech engines.

The research team, including trained undergraduate and graduate students, will create rules to translate alphabets to phonemic representation for multiple languages. The team will then collect textual resources and word frequency lists from publicly available sources such as online Bibles, newspapers, and movie subtitles. The rules will be applied separately to each source and the resulting phonological representations will be made publicly available, such that not only researchers but also the general public will be able to use and interact with the data. The researchers will proceed to use the data to investigate whether the information theoretic properties of sounds have distributional universality: do sounds tend to provide similar amounts of information cross-linguistically, and if so, does their information content correlate with their phonetic properties? Universality is an age-old question, and the similarities and differences of properties across language can provide new insights into language use. Specifically, the researchers will use information theoretic properties to predict whether low information or other previously studied phonological properties are likely to promote consonant weakening in those languages.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cohen Priva, Uriel and Strand, Emily "Schwas duration and acoustic position in American English" Journal of Phonetics , v.96 , 2023 https://doi.org/10.1016/j.wocn.2022.101198 Citation Details
Cohen Priva, Uriel and Yang, Shiying and Strand, Emily "The stability of segmental properties across genre and corpus types in low-resource languages" Proceedings of the Society for Computation in Linguistics , v.3 , 2020 10.7275/fttf-fq95 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

Over the past two decades, there has been an explosion of word-level data availability, even for low-resource languages. However, utilizing such resources at the sound level has been significantly hindered by the challenge of translating written language to sound-level annotations. This grant aimed to extend the availability of written data to the sound level by providing translation engines for languages with transparent alphabets, making the translation of written language to underlying sounds possible.


The output is the XPF Corpus, which consists of:

  • Two translation engines, in Python and in JavaScript.
  • Over two hundred translation schemes for languages from varied language families. Many of the languages that were included in the corpus have very limited resources, and their inclusion in the corpus would make it possible to make generalizations that rely less heavily on the few languages that benefit from high resource availability.
  • Expansive documentation for the code, translation schemes, and the rationale behind specific translation choices.

The corpus is already used by many researchers and will continue to facilitate such research in the future. Moreover, all the products of the corpus are available via a repository on GitHub (github.com/CohenPr-XPF/XPF) and to non-experts through an online interface at cohenpr-xpf.github.io/XPF/. Others can extend and alter the corpus as they see fit, ensuring its continued support and development.

The XPF Corpus is the result of collaborative efforts at Brown University building on NSF's support. This includes contributions from many undergraduate students who gained valuable experience in writing regular expressions, conducting phonological analyses, and processing data. Two honors theses were completed as part of this project. Similarly, the project shaped the training of two PhD students and their dissertations, supported a post-baccalaureate research assistant, and helped extend and widen the training of a postdoctoral fellow.

As an existing resource, its contribution is not limited to the funding period. It will help the linguistics community, the speakers of languages used in the corpus, and companies that aim to provide text to speech engines in the future.


Last Modified: 09/27/2024
Modified by: Uriel Cohen Priva

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page