
NSF Org: |
CNS Division Of Computer and Network Systems |
Recipient: |
|
Initial Amendment Date: | April 12, 2011 |
Latest Amendment Date: | January 15, 2015 |
Award Number: | 1131889 |
Award Instrument: | Continuing Grant |
Program Manager: |
Marilyn McClure
mmcclure@nsf.gov (703)292-5197 CNS Division Of Computer and Network Systems CSE Directorate for Computer and Information Science and Engineering |
Start Date: | April 15, 2011 |
End Date: | January 31, 2016 (Estimated) |
Total Intended Award Amount: | $321,780.00 |
Total Awarded Amount to Date: | $321,780.00 |
Funds Obligated to Date: |
FY 2011 = $79,853.00 FY 2012 = $82,299.00 FY 2013 = $85,078.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
520 LEE ENTRANCE STE 211 AMHERST NY US 14228-2577 (716)645-2634 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
501 Capen Hall Buffalo NY US 14260-1600 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | CSR-Computer Systems Research |
Primary Program Source: |
01001112DB NSF RESEARCH & RELATED ACTIVIT 01001213DB NSF RESEARCH & RELATED ACTIVIT 01001314DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
CAREER: Data-aware Distributed Computing for Enabling
Large-scale Collaborative Science
PI: Tevfik Kosar, Louisiana State University
Abstract
Applications and experiments in all areas of science are becoming increasingly complex and more demanding in terms of their computational and data requirements. Some applications generate data volumes reaching petabytes. Sharing, disseminating, and analyzing these large data sets becomes a big challenge, especially when distributed resources are used.
This Faculty Early Career Development (CAREER) project proposes a new distributed computing paradigm called ?data-aware distributed computing?, which will include a diverse set of algorithms, models, and tools for mitigating the data bottleneck in distributed computing systems; and will support a broad range of data-intensive as well as dynamic data-driven applications. As part of this project, research and development will be performed on three main components: i) a data-aware scheduler which will provide capabilities such as planning, scheduling, resource reservation, job execution, and error recovery for data movement tasks; ii) integration of these capabilities to the other layers in distributed computing such as workflow planning, resource brokering, and storage management; and iii) further optimization of data movement tasks via dynamically tuning of underlying protocol transfer parameters.
Research will be integrated to literally all levels of education which will include science projects, seminars and summer camps on data-intensive computing with K-12 students (where 99% is minority); curriculum development, mentoring, and international student/intern exchange programs for undergraduate and graduate students; summer internships and workshops specifically for HBCU community including faculty members.
The tools and software developed in this project will be available to public via open-source distribution.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
Intellectual Merit:
Applications in a variety of spaces — scientific, industrial, and personal — now generate more data than ever before. As data has become more abundant and data resources become more heterogeneous, accessing, sharing, analyzing and disseminating these data sets has become a bigger challenge.
This research brings the concept of "data-awareness" to several most crucial petascale and distributed computing components such as end-to-end workflow management, resource discovery and brokering, and data storage management.
Specific contributions of the project within the discipline include:
1) Data-aware scheduling: We developed a) novel algorithms for efficient planning, scheduling, and execution of data transfer tasks; b) a data scheduling algorithm for advanced reservation and provisioning of resources; c) a new data scheduling framework with early error detection, classification, and recovery capabilities; d) a semantically-aware data discovery and placement algorithm for collaborative computing environments; e) asynchronous replication models for multi-master metadata replication in data-aware distributed storage.
2) Data-aware workflows: We developed a) models to choose the best data access method (i.e., staging vs remote I/O) specific to the application; b) data-aware workflow scheduling algorithms for heterogeneous distributed computing environments; and c) a novel algorithm for locality and network-aware reduce task scheduling of data-intensive applications in a cloud setting.
3) End-to-end data throughput optimization: We developed a) application-level models to predict the best combination of protocol parameters for optimal network performance; b) a novel hysteresis-based technique to optimize the transfer parameters based on real-time as well as historical data analysis; c) an end-to-end throughput optimization model which includes disk and CPU striping for end-to-end data-flow parallelism; d) novel data transfer algorithms which aim to achieve high data transfer throughput while keeping the energy consumption during the transfers at the minimal levels; and e) a cloud-hosted data transfer scheduling and optimization service called StorkCloud.
These developed techniques, models, algorithms, and tools enable a new computing paradigm called "data-aware distributed computing" which does not only impact computer science research by changing the way petascale distributed computing is performed, but it also changes how domain scientists perform their research by facilitating rapid analysis and sharing of raw data and results. The cloud-hosted StorkCloud data transfer scheduling and optimization service has a potential to become a key component of the information resources that form the national infrastructure.
This project resulted in a) one edited book titled "Data Intensive Distributed Computing: Challenges and Solutions for Large-Scale Information Management"; b) 12 journal papers in top CS journals; c) 21 conference and workshop papers; and d) 3 book chapters in different edited volumes. The PI has received two "best paper awards" from these publications.
Broader Impact:
This project integrated research to different levels of education through science projects, seminars, workshops, curriculum deve...
Please report errors in award information by writing to: awardsearch@nsf.gov.