Award Abstract # 0836431
SGER: Scaling up unsupervised grammar induction

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: CARNEGIE MELLON UNIVERSITY
Initial Amendment Date: July 11, 2008
Latest Amendment Date: June 30, 2009
Award Number: 0836431
Award Instrument: Standard Grant
Program Manager: Tatiana Korelsky
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: July 1, 2008
End Date: December 31, 2009 (Estimated)
Total Intended Award Amount: $0.00
Total Awarded Amount to Date: $212,721.00
Funds Obligated to Date: FY 2008 = $146,721.00
FY 2009 = $66,000.00
History of Investigator:
  • Noah Smith (Principal Investigator)
    noah@allenai.org
Recipient Sponsored Research Office: Carnegie-Mellon University
5000 FORBES AVE
PITTSBURGH
PA  US  15213-3815
(412)268-8746
Sponsor Congressional District: 12
Primary Place of Performance: Carnegie-Mellon University
5000 FORBES AVE
PITTSBURGH
PA  US  15213-3815
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): U3NKNFLNQ613
Parent UEI: U3NKNFLNQ613
NSF Program(s): Robust Intelligence
Primary Program Source: 01000809DB NSF RESEARCH & RELATED ACTIVIT
01000910DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7495, 9215, 9237, HPCC
Program Element Code(s): 749500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

This SGER project seeks to determine the scalability of computationally intensive, iterative statistical learning algorithms on a MapReduce architecture. Such algorithms underlie much research in natural language processing, yet their scalability to even moderately large training datasets (text corpora) has been under-explored. On the surface, scaling to more data appears to be a good fit for the MapReduce paradigm, and this exploratory project aims to identify whether such algorithms benefit from more data and more complex data than used in prior work. A special emphasis is given to unsupervised learning algorithms, such as the Expectation-Maximization algorithm, which have been widely studied on small problems and rarely studied on large ones. The technique is applicable to many other methods, as well.

At the same time, the project seeks to explore how to leverage supercomputers and MapReduce to make these learning algorithms faster, permitting a faster research cycle. Concretely, the "E step" (or its
analogue) is the most computationally demanding part of an iteration, but the standard assumption that the training data are independently and identically distributed permits parallelization. To the extent that this parallelization is affected by network and input-output overhead, each iteration of training may be made faster, perhaps reducing training time from days or weeks to hours. This project explores this tradeoff and others like it.

This work leverages a resource donated by Yahoo for use by the PI's research group: a 4,000-node supercomputer running Hadoop (an open-source implementation of MapReduce).

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

André F. T. Martins, Noah A. Smith, and Eric P. Xing "Polyhedral Outer Approximations with Application to Natural Language Parsing" Proceedings of the International Conference on Machine Learning , 2009
Ashish Venugopal, Andreas Zollmann, Noah A. Smith, and Stephan Vogel "Preference Grammars: Softening Syntactic Constraints to Improve Statistical Machine Translation" Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference , 2009
Ashish Venugopal, Andreas Zollmann, Noah A. Smith, and Stephan Vogel "Wider Pipelines: N-Best Alignments and Parses in MT Training" Proceedings of the Conference of the Association for Machine Translation in the Americas , 2008
Kevin Gimpel and Noah A. Smith "Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming without Semirings" Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics , 2009
Shay B. Cohen and Noah A. Smith "Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction" Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference , 2009
Shay B. Cohen, Kevin Gimpel, and Noah A. Smith "Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction" Advances in Neural Information Processing Systems 21 , 2008
Tae Yano, William W. Cohen, and Noah A. Smith "Predicting Response to Political Blog Posts with Topic Models" Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference , 2009

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page