
NSF Org: |
OAC Office of Advanced Cyberinfrastructure (OAC) |
Recipient: |
|
Initial Amendment Date: | August 25, 2014 |
Latest Amendment Date: | August 25, 2014 |
Award Number: | 1443040 |
Award Instrument: | Standard Grant |
Program Manager: |
Amy Walton
awalton@nsf.gov (703)292-4538 OAC Office of Advanced Cyberinfrastructure (OAC) CSE Directorate for Computer and Information Science and Engineering |
Start Date: | January 1, 2015 |
End Date: | July 31, 2018 (Estimated) |
Total Intended Award Amount: | $1,485,021.00 |
Total Awarded Amount to Date: | $1,485,021.00 |
Funds Obligated to Date: |
|
History of Investigator: |
|
Recipient Sponsored Research Office: |
240 FRENCH ADMINISTRATION BLDG PULLMAN WA US 99164-0001 (509)335-9661 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
Pullman WA US 99164-6414 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): |
ADVANCES IN BIO INFORMATICS, Data Cyberinfrastructure |
Primary Program Source: |
|
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.070 |
ABSTRACT
Scientific community databases fulfill an important research need by offering curated information to audiences with shared basic and applied research goals. They serve as clearinghouses for community information and communication. Researchers need easy-to-use analytical workflows in an easily accessible and familiar location, but the community database typically lacks the infrastructure to support these needs and more data exchange is needed between sites. Additionally, the community database requires the ability to easily incorporate results from analytical workflows for public dissemination, and the capacity to transfer large datasets quickly between computational resources and the database. Tripal, an open-source toolkit used for construction of online genomic and genetic databases, is uniquely positioned to provide solutions to these challenges as it has been adopted by multiple community databases and thus provides a common infrastructure.
This project creates Tripal Gateway: a set of modules (extensions) to be incorporated into Tripal to foster greater data dissemination, collaboration, and research. The team develops three modules that integrate Tripal with Galaxy (a popular workflow system), interconnects Tripal sites for data sharing, and utilizes emerging technologies for faster data exchange:
- Tripal Galaxy - a module integrating Galaxy workflows into a Tripal site, providing both next-generation analytical workflows and seamless transition of results into the community database.
- Tripal Exchange - a module to provide capabilities for cross-site querying, enabling collation and viewing of data from multiple sites, and integration of data into workflows.
- Tripal SDN - a module incorporating software defined networking (SDN) technology, providing mechanisms to improve speed of data exchange.
These new modules are developed, implemented, and tested in conjunction with six data sites (the Citrus Genome Database, Cool Season Food Legumes, CottonGen, the Genome Database for Rosaceae, Hardwood Genomics, and TreeGenes). Integration of the Tripal Gateway is also anticipated for four additional databases (GrainGenes, KnowPulse, LegumeInfo, and PeanutBase). After implementation, this effort will interlink and allow cross-querying across a major Arabidopsis resource, four legume genomics sites, the primary cotton community site, GrainGenes, and four different tree genomic sites covering fruit trees and forest trees. Implementation of Tripal Gateway into the community databases servicing these extensive research communities will support basic and applied research that is both crop-specific and broadly useful across crop agriculture.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
PROJECT OUTCOMES REPORT
Disclaimer
This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.
The Tripal Gateway project is intended to provide computational infrastructure for online data repositories housing genomic, genetic and breeding data. An increasing number of online data repositories are using the open-source software package, Tripal (http://tripal.info/), to power their sites which typically offer data for one or more related species of plants or animals. Examples of online data repositories include several Tree databases including Tree Genes (https://treegenesdb.org/), The Hardwood Genomics Project (https://hardwoodgenomics.org/), the Citrus Genome Database (https://www.citrusgenomedb.org/) and the Genome Database for Rosaceae (https://www.rosaceae.org/). These sites house data for trees of ecological and agricultural importance. Moreover, these and other sites provide tools to visualize, mine and analyze the data they house.
DNA sequencing and other technologies are lowering the cost for data collection in biology, yet the costs for creating online websites to house the data remains a time consuming and expensive task. Often groups that can collect the data do not have the resources to create a fully-searchable online data repository with computational tools that can support large datasets. Tripal, therefore, is meant to ease the burden by providing a common infrastructure on which new sites can be built and the Tripal Gateway project is meant to provide improved support for large data. By using a common infrastructure, Tripal-based sites can share software, ideas and tools which in turn decreases costs.
The Tripal Gateway project, has expanded on the offering of tools for Tripal-based sites, allowing them to provide additional infrastructure to their end-user scientists. First, this project now allows related sites, such as the tree sites listed previously, to exchange data amongst themselves without the need to duplicate the data housed in the other sites. Tree researchers can more easily locate and find tree specific data (such as genes of interest) by visiting just one of the integrated tree sites where they can find data across them all. This increases the ability for scientists to find data of importance without the need to visit multiple sites. This same functionality is now available to any Tripal-based site wishing to exchange and share data with others and that installs the Tripal Gateway tools.
Second, scientists often desire to analyze data in these repositories in combination with their own data. Yet, large data required for these analyses are difficult to process. Often these analyses require multiple data processing steps that occur in a specific order. These multi-step analyses, known as workflows, are best handled by software dedicated to automating execution. The Galaxy Project (https://galaxyproject.org/) is one such tool and is popular with biology researchers and can use high performance computing data centers. The Tripal Gateway project has integrated Galaxy with Tripal to allow for seamless execution of workflows using data from the website and data provided by the scientist. The scientist can now provide parameters for execution of the workflow within a Tripal site and Tripal manages the workflow. This reduces the complexity for some scientists who need not worry about computational infrastructure nor workflow design. Data and results are accessible within the same website.
The third benefit provided by this project are new resources to support movement of large data. Often data analysis needed for workflows occurs on computers that are not in the same data center as the online Tripal database. Moving large amounts of data across the internet for analysis can be challenging. Internet2 is a national-level research network designed to support fast movement of data between research institutions. The Tripal Gateway project provides software tools that support as-fast-as-possible movement of data using Internet2 (or standard commodity internet). Thus, data needed for scientific workflows can be moved between Tripal sites and other data repositories using smarter internet pathways.
In summary, the Tripal Gateway project provides three major sets of tools meant to help Tripal-based sites better deal with increasing quantities of data. Because all Tripal sites are built on a common framework, the Tripal Gateway tools can be installed by any Tripal-based sites, providing a “plug-and-play” like experience for site developers. Such computational infrastructure will allow scientists who use these databases to more easily find, move and analyze large datasets.
Last Modified: 10/31/2018
Modified by: Stephen Ficklin
Please report errors in award information by writing to: awardsearch@nsf.gov.