
NSF Org: |
BCS Division of Behavioral and Cognitive Sciences |
Recipient: |
|
Initial Amendment Date: | August 24, 2018 |
Latest Amendment Date: | February 2, 2024 |
Award Number: | 1829290 |
Award Instrument: | Continuing Grant |
Program Manager: |
Rachel M. Theodore
rtheodor@nsf.gov (703)292-4770 BCS Division of Behavioral and Cognitive Sciences SBE Directorate for Social, Behavioral and Economic Sciences |
Start Date: | September 1, 2018 |
End Date: | May 31, 2024 (Estimated) |
Total Intended Award Amount: | $391,055.00 |
Total Awarded Amount to Date: | $391,055.00 |
Funds Obligated to Date: |
FY 2019 = $199,938.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
1 PROSPECT ST PROVIDENCE RI US 02912-9100 (401)863-2777 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Office of Sponsored Projects Providence RI US 02912-9093 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Linguistics |
Primary Program Source: |
01001920DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.075 |
ABSTRACT
Human language use reflects the nature of human communication. For instance, frequent words tend to have fewer sounds than infrequent ones, which facilitates quick production and understanding. However, little is known about more fine-grained distinctions. For instance, English has more /k/ than /p/ sounds. Does that reflect a property of human language and its physiological and perceptual nature or a historical accident? Answering such questions requires comparative data on the frequency and phonological makeup of words in many languages. This project will build on existing textual sources and word frequency lists to provide the phonological makeup of words in close to 200 low-resource languages. The phonological word lists will provide an invaluable resource to the understanding of human language and provide much-needed linguistic resources to low-resource languages. The outputs of the project will be made public and easily accessible, thereby assisting in documenting and teaching the processed languages, and in building computational linguistic resources such as text-to-speech engines.
The research team, including trained undergraduate and graduate students, will create rules to translate alphabets to phonemic representation for multiple languages. The team will then collect textual resources and word frequency lists from publicly available sources such as online Bibles, newspapers, and movie subtitles. The rules will be applied separately to each source and the resulting phonological representations will be made publicly available, such that not only researchers but also the general public will be able to use and interact with the data. The researchers will proceed to use the data to investigate whether the information theoretic properties of sounds have distributional universality: do sounds tend to provide similar amounts of information cross-linguistically, and if so, does their information content correlate with their phonetic properties? Universality is an age-old question, and the similarities and differences of properties across language can provide new insights into language use. Specifically, the researchers will use information theoretic properties to predict whether low information or other previously studied phonological properties are likely to promote consonant weakening in those languages.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Over the past two decades, there has been an explosion of word-level data availability, even for low-resource languages. However, utilizing such resources at the sound level has been significantly hindered by the challenge of translating written language to sound-level annotations. This grant aimed to extend the availability of written data to the sound level by providing translation engines for languages with transparent alphabets, making the translation of written language to underlying sounds possible.
The output is the XPF Corpus, which consists of:
- Two translation engines, in Python and in JavaScript.
- Over two hundred translation schemes for languages from varied language families. Many of the languages that were included in the corpus have very limited resources, and their inclusion in the corpus would make it possible to make generalizations that rely less heavily on the few languages that benefit from high resource availability.
- Expansive documentation for the code, translation schemes, and the rationale behind specific translation choices.
The corpus is already used by many researchers and will continue to facilitate such research in the future. Moreover, all the products of the corpus are available via a repository on GitHub (github.com/CohenPr-XPF/XPF) and to non-experts through an online interface at cohenpr-xpf.github.io/XPF/. Others can extend and alter the corpus as they see fit, ensuring its continued support and development.
The XPF Corpus is the result of collaborative efforts at Brown University building on NSF's support. This includes contributions from many undergraduate students who gained valuable experience in writing regular expressions, conducting phonological analyses, and processing data. Two honors theses were completed as part of this project. Similarly, the project shaped the training of two PhD students and their dissertations, supported a post-baccalaureate research assistant, and helped extend and widen the training of a postdoctoral fellow.
As an existing resource, its contribution is not limited to the funding period. It will help the linguistics community, the speakers of languages used in the corpus, and companies that aim to provide text to speech engines in the future.
Last Modified: 09/27/2024
Modified by: Uriel Cohen Priva
Please report errors in award information by writing to: awardsearch@nsf.gov.