
NSF Org: |
IIS Division of Information & Intelligent Systems |
Recipient: |
|
Initial Amendment Date: | July 11, 2008 |
Latest Amendment Date: | June 30, 2009 |
Award Number: | 0836431 |
Award Instrument: | Standard Grant |
Program Manager: |
Tatiana Korelsky
IIS Division of Information & Intelligent Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | July 1, 2008 |
End Date: | December 31, 2009 (Estimated) |
Total Intended Award Amount: | $0.00 |
Total Awarded Amount to Date: | $212,721.00 |
Funds Obligated to Date: |
FY 2009 = $66,000.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
5000 FORBES AVE PITTSBURGH PA US 15213-3815 (412)268-8746 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
5000 FORBES AVE PITTSBURGH PA US 15213-3815 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | Robust Intelligence |
Primary Program Source: |
01000910DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
This SGER project seeks to determine the scalability of computationally intensive, iterative statistical learning algorithms on a MapReduce architecture. Such algorithms underlie much research in natural language processing, yet their scalability to even moderately large training datasets (text corpora) has been under-explored. On the surface, scaling to more data appears to be a good fit for the MapReduce paradigm, and this exploratory project aims to identify whether such algorithms benefit from more data and more complex data than used in prior work. A special emphasis is given to unsupervised learning algorithms, such as the Expectation-Maximization algorithm, which have been widely studied on small problems and rarely studied on large ones. The technique is applicable to many other methods, as well.
At the same time, the project seeks to explore how to leverage supercomputers and MapReduce to make these learning algorithms faster, permitting a faster research cycle. Concretely, the "E step" (or its
analogue) is the most computationally demanding part of an iteration, but the standard assumption that the training data are independently and identically distributed permits parallelization. To the extent that this parallelization is affected by network and input-output overhead, each iteration of training may be made faster, perhaps reducing training time from days or weeks to hours. This project explores this tradeoff and others like it.
This work leverages a resource donated by Yahoo for use by the PI's research group: a 4,000-node supercomputer running Hadoop (an open-source implementation of MapReduce).
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
Please report errors in award information by writing to: awardsearch@nsf.gov.