NSF Award Search: Award # 1420785 - Doctoral Dissertation: Investigating the role of grammatical representation in language learnability

Award Abstract # 1420785

Doctoral Dissertation: Investigating the role of grammatical representation in language learnability

NSF Org:	BCS Division of Behavioral and Cognitive Sciences
Recipient:	MASSACHUSETTS INSTITUTE OF TECHNOLOGY
Initial Amendment Date:	July 8, 2014
Latest Amendment Date:	July 8, 2014
Award Number:	1420785
Award Instrument:	Standard Grant
Program Manager:	William Badecker BCS Division of Behavioral and Cognitive Sciences SBE Directorate for Social, Behavioral and Economic Sciences
Start Date:	July 15, 2014
End Date:	December 31, 2015 (Estimated)
Total Intended Award Amount:	$11,710.00
Total Awarded Amount to Date:	$11,710.00
Funds Obligated to Date:	FY 2014 = $11,710.00
History of Investigator:	Edward Gibson (Principal Investigator) egibson@mit.edu Leon Bergen (Co-Principal Investigator)
Recipient Sponsored Research Office:	Massachusetts Institute of Technology 77 MASSACHUSETTS AVE CAMBRIDGE MA US 02139-4301 (617)253-1000
Sponsor Congressional District:	07
Primary Place of Performance:	Massachusetts Institute of Technology 77 Massachusetts Ave Cambridge MA US 02139-4301
Primary Place of Performance Congressional District:	07
Unique Entity Identifier (UEI):	E2NYLCDML6V1
Parent UEI:	E2NYLCDML6V1
NSF Program(s):	Linguistics
Primary Program Source:	01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	1311, 9179, SMET
Program Element Code(s):	131100
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.075

ABSTRACT

Technologies which process natural language have become ubiquitous in the last decade. Web search engines, for example, process billions of pages of text, in order to determine which of those pages best match a user's search query. Many interfaces for interacting with computers -- for example, Apple's Siri personal assistant -- take voice-issued commands from their users, and must process these commands in order to follow the users' instructions. Finally, machine translation technologies have become available for many of the world's most common languages, allowing users to automatically translate text that they find in foreign books or websites. These technologies mostly rely on simple models of language, known as n-gram models or context-free grammars, which were developed in the 1950's and 1960's, and refined in later decades. These simple models of language have many advantages, most notably that they can be used to process large amounts of data very quickly. Because of their simplicity, however, these models are not able to capture many aspects of meaning in natural language. This has resulted in limitations for the technologies discussed above; virtual personal assistants are only able to process very simple types of instructions, and machine translations is still far from being as accurate as human translation. In the current project, Leon Bergen and Dr. Edward Gibson will be investigating more sophisticated kinds of language models, with the goal of increasing the ability of computers to understand language.

Under the direction of Dr. Gibson, Mr. Berger will be studying language models known as mildly context-sensitive grammars. These grammars are able to express certain types of linguistic knowledge that humans have, but which cannot be expressed using simpler types of grammatical formalisms. For example, native speakers of English know that a declarative sentence like "Mary kicked the ball" is closely related in meaning to the question "What did Mary kick?" Although this fact seems obvious, it is difficult (or impossible) to express using simple types of grammars. However, mildly context-sensitive grammars can be used to express this knowledge in a very natural way. Mr. Bergen and Dr. Gibson will be studying whether mildly context-sensitive grammars can be automatically learned from examples of grammatical sentences. To do this, they will be using techniques from machine learning, a branch of computer science and statistics that develops algorithms that can automatically learn from data. The researchers will integrate these learning algorithms with their grammatical formalism, and will test whether their method learns an accurate grammar. The accuracy of the grammar will be evaluated using a corpus -- a collection of sentences -- in which every sentence has been manually annotated with its correct grammatical structure. If accurate mildly context-sensitive grammars can be learned in this manner, then this provides a potential method for improving the natural language processing technologies which were discussed above. In particular, because this method does not require an expert to write down the complete grammar for a language, it has the potential to be deployed without tremendous engineering effort, and may be deployed easily in foreign languages.

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project aimed to investigate the relationship between grammatical representations and language learning. As an example of this relationship, consider the following sentence:

1) John put the socks in the drawer.

In example (1), there are three phrases associated with the verb "put": "John" (who is doing the putting), "the socks" (which are being put), and "in the drawer" (which is the destination for the socks). If any of these phrases is removed, the sentence becomes ungrammatical. For example, if "in the drawer" is removed, the sentence becomes:

1a) #John put the socks.

This indicates that "put" requires three phrases for it to be grammatical. In linguistic terminology, this means that there are three parts to its argument structure: something doing the putting, something being put, and a destination for the putting. More generally, the argument structure for a word is a specification of what types of phrases are required for it to be used grammatically.

Not all phrases are part of the argument structure of a word. Consider the following example:

2) John put the socks in the drawer before dinner.

In this sentence, the phrase "before dinner" is grammatically optional; the sentence is still grammatical if it is removed. This phrase is therefore not part of the argument structure of the sentence, but rather is called a modifier of the sentence.

Fluent speakers of English know the argument structure for "put", and for the other words in the language. For the different phrases in a sentence, they can determine whether that phrase is a mandatory argument, or whether it is an optional modifier. In our research, we wanted to address two questions. First, how might people learn the argument structures in their language? Second, if we wanted to design a computer system which could speak English or another language, how could that system learn the language's argument structures?
In order to address these questions, we developed a computational model of this argument-structure learning task. This system combines grammatical representations developed in linguistics with Bayesian machine learning techniques. It classifies a phrase as an argument structure when this leads to greater compression of the sentences that have been observed. We found that this computational model learns to substantially improve its classification of argument structure.

In a second part of the project, we investigated how people and computers can learn to understand questions. Consider the following sequence of questions:

3a) Who did the man see?

3b) Who did the man who plays the trumpet see?

3c) Who did the man who plays the trumpet which was stolen see?

In each of these questions, the speaker is asking about who was seen. Intuitively, the wh-term "Who" is linked to the verb "see" at the end of the sentence, and serves as the argument of this verb. As the sequence (3a-c) illustrates, the wh-term in a question and the verb it is linked to can be separated by linguistic material of arbitrarily large complexity. For this reason, the dependency between the wh-term and the verb is known as a long-distance dependency.

There is a challenge in understanding these sentences: how does one determine that the wh-term "who" is associated with the verb "see," rather than the other verbs in the sentence? More generally, how do people learn to resolve these long-distance dependencies, and how can we build computer systems which learn to resolve them?

In order to try to address these questions, we developed a computational model for learning the structure of questions, and long-distance dependencies more generally. This model uses a grammatical formalism for representing long-distance dependencies known as minimalist grammars, and combines this with ...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Top

Success

Error