NSF Award Search: Award # 0745914

Award Abstract # 0745914

CAREER: Robust Parsing for New Domains and Languages

NSF Org:	IIS Division of Information & Intelligent Systems
Recipient:	UNIVERSITY OF PITTSBURGH - OF THE COMMONWEALTH SYSTEM OF HIGHER EDUCATION
Initial Amendment Date:	April 3, 2008
Latest Amendment Date:	June 13, 2012
Award Number:	0745914
Award Instrument:	Continuing Grant
Program Manager:	Tatiana Korelsky IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering
Start Date:	July 1, 2008
End Date:	June 30, 2014 (Estimated)
Total Intended Award Amount:	$499,975.00
Total Awarded Amount to Date:	$499,975.00
Funds Obligated to Date:	FY 2008 = $297,682.00 FY 2011 = $94,279.00 FY 2012 = $108,014.00
History of Investigator:	Rebecca Hwa (Principal Investigator) rebecca.hwa@gwu.edu
Recipient Sponsored Research Office:	University of Pittsburgh 4200 FIFTH AVENUE PITTSBURGH PA US 15260-0001 (412)624-7400
Sponsor Congressional District:	12
Primary Place of Performance:	University of Pittsburgh 4200 FIFTH AVENUE PITTSBURGH PA US 15260-0001
Primary Place of Performance Congressional District:	12
Unique Entity Identifier (UEI):	MKAGLD59JRL1
Parent UEI:
NSF Program(s):	Robust Intelligence
Primary Program Source:	01000809DB NSF RESEARCH & RELATED ACTIVIT 01001112DB NSF RESEARCH & RELATED ACTIVIT 01001213DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	1045, 1187, 7495, 9102, 9215, HPCC
Program Element Code(s):	749500
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.070

ABSTRACT

To facilitate linguistic communications, natural language processing (NLP) technologies must be applicable to different languages across different domains. A limitation of many NLP systems is that they do not perform as well on data types that diverge from their training examples. The objective of this CAREER project is to increase the robustness and coverage of a fundamental NLP component, the syntactic parser.

Specifically, this project explores adaptation methods to extend a standard English parser for processing different domains (e.g., scientific literature, emails) and different languages (e.g., Chinese). Three types of correspondences are considered. First, if coarse-level correspondences are explicit in the data (e.g., bilingual documents), finer-grained correspondences at the word- or phrasal-level may be inferred, and semi-supervised learning may be used to transfer domain knowledge across the inferred correspondence.
Second, if the correspondences are inexact (e.g., multiple translations of varying quality), the mis-matched portions may be identified and transformed to achieve a closer mapping. Third, if the correspondences are indirect, methods for inducing correspondences from non-parallel corpora may be appropriate.

Parser adaptation stands to increase the range of NLP applications; examples include: data mining from medical documents and automatic tutoring for non-English speakers. As the project aims to bring together several strands of research, it offers ample research opportunities to graduate and undergraduate students. The algorithmic aspects encourage forming synthesis from areas of semi-supervised learning, relational data modeling, grammar induction, and machine translation; the empirical aspects afford students an arena to hone their skills in good scientific methodologies.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project tackles the challenges of using computers to automatically parse written sentences. While advances in natural language processing (NLP) have led to high quality parsers for processing standard newspaper English, computer systems face substantial obstacles when processing sentences from more diverse sources, including the writings of English-as-a-Second-Language (ESL) learners, informal blurbs made in social media, writings from specialized domains such as legal documents and scientific literatures, and computer-generated sentences such as the outputs of machine translation (MT) systems. The main problem is that these sentences diverge significantly from the example sentences used to develop the system. This project addresses the problem by framing it as a machine translation task: how should we model the relationship between a wider population of English expressions and "newspaper" sentences? The investigation of this project has especially focused on two domains: the writings of ESL learners, which often contain grammar mistakes and usage errors, and the outputs of MT systems, which often contain a wider variety of garbled phrases and disfluencies than ESL learners. The work has led to the following three main outcomes.

The Chinese Room System: A visualization interface has been developed to bridge between an imperfect MT system and a human user who cannot read the source language. Through a visual display of various linguistic resources, the system helps the user to correct and improve MT outputs. (The system currently supports Chinese-English and Arabic-English, but it can be extended to arbitrary language pairs.) In addition to providing a service to users who wish to understand a document in a foreign language they cannot read, the Chinese Room System is also an instrument for collecting and analyzing the relationship between garbled MT outputs and the intended translation expressed in well-formed English.

Computational Models of Common Writing Problems of ESL Learners: Several systems have been developed to identify common errors that non-native learners of English make. One is a predictor of preposition usage; it differs from existing systems in that its development requires fewer training examples. Another is a predictor of redundant words and phrases (e.g., in the phrase "ruby red slippers," the word "red" is redundant); leveraging translations to other languages, the system identifies those words whose meanings have already been conveyed by other words. A third is a model of correction detections: given an ill-formed sentence and its corresponding revision, detect the locations and reasons of the mistakes. By modeling the relationship between neighboring individual changes, the system made more accurate segmentations of corrections and improved upon a previous system's published results.

Computational Models of Relationships between MT and the writings of ESL Learners: Current machine translation systems and second-language learners share some similarities: they both have an imperfect grasp of the target language; some aspects of a learner's native language (or an MT system's source language) might get transmitted and appear as disfluent artifacts in their English expressions. In this project, several computational models (as well as relevant data sets) have been designed and developed to help us gain a better understanding of the relationships between the types of mistakes made by MT systems and those made by ESL learners. First, a quasi-synchronous grammar model, a mathematical model previously used for MT, has been adapted for "translating" from problematic ESL sentences to their corrections. Mathematically, the model treats "ESL English" as a foreign language like French or Chinese. The implication is that mistakes mad...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Top

Success

Error