Award Abstract # 1443040
CIF21 DIBBS: Tripal Gateway, a Platform for Next-Generation Data Analysis and Sharing

NSF Org: OAC
Office of Advanced Cyberinfrastructure (OAC)
Recipient: WASHINGTON STATE UNIVERSITY
Initial Amendment Date: August 25, 2014
Latest Amendment Date: August 25, 2014
Award Number: 1443040
Award Instrument: Standard Grant
Program Manager: Amy Walton
awalton@nsf.gov
 (703)292-4538
OAC
 Office of Advanced Cyberinfrastructure (OAC)
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: January 1, 2015
End Date: July 31, 2018 (Estimated)
Total Intended Award Amount: $1,485,021.00
Total Awarded Amount to Date: $1,485,021.00
Funds Obligated to Date: FY 2014 = $1,485,021.00
History of Investigator:
  • Stephen Ficklin (Principal Investigator)
    stephen.ficklin@wsu.edu
  • Doreen Main (Co-Principal Investigator)
  • Frank Feltus (Co-Principal Investigator)
  • Margaret Staton (Co-Principal Investigator)
  • Jill Wegrzyn (Co-Principal Investigator)
Recipient Sponsored Research Office: Washington State University
240 FRENCH ADMINISTRATION BLDG
PULLMAN
WA  US  99164-0001
(509)335-9661
Sponsor Congressional District: 05
Primary Place of Performance: Washington State University
Pullman
WA  US  99164-6414
Primary Place of Performance
Congressional District:
05
Unique Entity Identifier (UEI): XRJSGX384TD6
Parent UEI:
NSF Program(s): ADVANCES IN BIO INFORMATICS,
Data Cyberinfrastructure
Primary Program Source: 01001415DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 8048, 7433
Program Element Code(s): 116500, 772600
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Scientific community databases fulfill an important research need by offering curated information to audiences with shared basic and applied research goals. They serve as clearinghouses for community information and communication. Researchers need easy-to-use analytical workflows in an easily accessible and familiar location, but the community database typically lacks the infrastructure to support these needs and more data exchange is needed between sites. Additionally, the community database requires the ability to easily incorporate results from analytical workflows for public dissemination, and the capacity to transfer large datasets quickly between computational resources and the database. Tripal, an open-source toolkit used for construction of online genomic and genetic databases, is uniquely positioned to provide solutions to these challenges as it has been adopted by multiple community databases and thus provides a common infrastructure.

This project creates Tripal Gateway: a set of modules (extensions) to be incorporated into Tripal to foster greater data dissemination, collaboration, and research. The team develops three modules that integrate Tripal with Galaxy (a popular workflow system), interconnects Tripal sites for data sharing, and utilizes emerging technologies for faster data exchange:
- Tripal Galaxy - a module integrating Galaxy workflows into a Tripal site, providing both next-generation analytical workflows and seamless transition of results into the community database.
- Tripal Exchange - a module to provide capabilities for cross-site querying, enabling collation and viewing of data from multiple sites, and integration of data into workflows.
- Tripal SDN - a module incorporating software defined networking (SDN) technology, providing mechanisms to improve speed of data exchange.

These new modules are developed, implemented, and tested in conjunction with six data sites (the Citrus Genome Database, Cool Season Food Legumes, CottonGen, the Genome Database for Rosaceae, Hardwood Genomics, and TreeGenes). Integration of the Tripal Gateway is also anticipated for four additional databases (GrainGenes, KnowPulse, LegumeInfo, and PeanutBase). After implementation, this effort will interlink and allow cross-querying across a major Arabidopsis resource, four legume genomics sites, the primary cotton community site, GrainGenes, and four different tree genomic sites covering fruit trees and forest trees. Implementation of Tripal Gateway into the community databases servicing these extensive research communities will support basic and applied research that is both crop-specific and broadly useful across crop agriculture.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Connor Wytko, Brian Soto, Stephen P. Ficklin "blend4php: a PHP API for Galaxy" Database , 2017 10.1093/database/baw154
Watts NW & Feltus FA "Big Data Smart Socket (BDSS): A System that Abstracts Data Transfer Habits from End Users" Bioinformatics , 2016 10.1093/bioinformatics/btw679
Ming Chen, Nathan Henry, Abdullah Almsaeed, Xiao Zhou, Jill Wegrzyn, Stephen Ficklin, and Margaret Staton "New extension software modules to enhance searching and display of transcriptome data in Tripal databases" Database , 2017 10.1093/database/bax052
Jung, S., Ficklin, S. P., Lee, T., Cheng, C.-H., Blenda, A., Zheng, P., Yu, J., Humann, J., Ficklin, S. P., Ksenika, G., Scott, K., Frank, M., Ru, S., Hough, H., Evans, K., Peace, C., Olmstead, M., DeVetter, L. W., McFerson, J., Coe, M., Wegrzyn, J. L., S "15 years of GDR: New data and functionality in the Genome Database for Rosaceae" Nucleic Acids Research , 2018 , p.gky1000 10.1093/nar/gky1000
Harper, L., Campbell, J., Cannon, E., Jung, S., Poelchau, M., Walls, R.L., Andorf, C.M., Arnaud, E., Berardini, T., Birkett, C., Cannon, S., Carson, J., Condon, B., Cooper, L., Dunn, N., Elsik, C., Farmer, A., Ficklin, S.P., Grant, D., Grau, E., Herndon, "AgBioData Consortium Recommendations for Sustainable Genomics and Genetics Databases for Agriculture" Database (Oxford) , 2018 , p.bay088 10.1093/database/bay088
Frank A. Feltus, Joseph R. Breen III, Juan Deng, Ryan S. Izard, Christopher A. Konger, Walter B. Ligon III, Don Preuss and Kuang-Ching Wang "The Widening Gulf between Genomics Data Generation and Consumption: A Practical Guide to Big Data Transfer Technology." Bioinformatics and Biology Insights , 2015 DOI: 10.4137/BBI.S28988
Condon, B., Almsaeed, A., West, J., Chen, M., Staton, M "Tripal Developer Toolkit" Database (Oxford) , 2018 , p.bay099 10.1093/database/bay099
Falk, T., Herndon, N., Grau, E., Buehler, S., Richter, P., Zaman, S., Baker, E., Ramnath, R., Ficklin, S., Staton, M., Feltus, F., Jung, S., Main, D., Wegrzyn, J. "Growing and cultivating the forest genomics database, TreeGenes" Database (Oxford) , 2018 , p.bay084 10.1093/database/bay084

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

The Tripal Gateway project is intended to provide computational infrastructure for online data repositories housing genomic, genetic and breeding data.  An increasing number of online data repositories are using the open-source software package, Tripal (http://tripal.info/), to power their sites which typically offer data for one or more related species of plants or animals. Examples of online data repositories include several Tree databases including Tree Genes (https://treegenesdb.org/), The Hardwood Genomics Project (https://hardwoodgenomics.org/), the Citrus Genome Database (https://www.citrusgenomedb.org/) and the Genome Database for Rosaceae (https://www.rosaceae.org/).  These sites house data for trees of ecological and agricultural importance. Moreover, these and other sites provide tools to visualize, mine and analyze the data they house. 

DNA sequencing and other technologies are lowering the cost for data collection in biology, yet the costs for creating online websites to house the data remains a time consuming and expensive task.  Often groups that can collect the data do not have the resources to create a fully-searchable online data repository with computational tools that can support large datasets.  Tripal, therefore, is meant to ease the burden by providing a common infrastructure on which new sites can be built and the Tripal Gateway project is meant to provide improved support for large data.  By using a common infrastructure, Tripal-based sites can share software, ideas and tools which in turn decreases costs. 

The Tripal Gateway project, has expanded on the offering of tools for Tripal-based sites, allowing them to provide additional infrastructure to their end-user scientists.  First, this project now allows related sites, such as the tree sites listed previously, to exchange data amongst themselves without the need to duplicate the data housed in the other sites.  Tree researchers can more easily locate and find tree specific data (such as genes of interest) by visiting just one of the integrated tree sites where they can find data across them all.  This increases the ability for scientists to find data of importance without the need to visit multiple sites.  This same functionality is now available to any Tripal-based site wishing to exchange and share data with others and that installs the Tripal Gateway tools.

Second, scientists often desire to analyze data in these repositories in combination with their own data.  Yet, large data required for these analyses are difficult to process. Often these analyses require multiple data processing steps that occur in a specific order. These multi-step analyses, known as workflows, are best handled by software dedicated to automating execution.  The Galaxy Project (https://galaxyproject.org/) is one such tool and is popular with biology researchers and can use high performance computing data centers.  The Tripal Gateway project has integrated Galaxy with Tripal to allow for seamless execution of workflows using data from the website and data provided by the scientist.  The scientist can now provide parameters for execution of the workflow within a Tripal site and Tripal manages the workflow. This reduces the complexity for some scientists who need not worry about computational infrastructure nor workflow design.  Data and results are accessible within the same website.

The third benefit provided by this project are new resources to support movement of large data.  Often data analysis needed for workflows occurs on computers that are not in the same data center as the online Tripal database.  Moving large amounts of data across the internet for analysis can be challenging.  Internet2 is a national-level research network designed to support fast movement of data between research institutions.  The Tripal Gateway project provides software tools that support as-fast-as-possible movement of data using Internet2 (or standard commodity internet).  Thus, data needed for scientific workflows can be moved between Tripal sites and other data repositories using smarter internet pathways.

In summary, the Tripal Gateway project provides three major sets of tools meant to help Tripal-based sites better deal with increasing quantities of data.  Because all Tripal sites are built on a common framework, the Tripal Gateway tools can be installed by any Tripal-based sites, providing a “plug-and-play” like experience for site developers.   Such computational infrastructure will allow scientists who use these databases to more easily find, move and analyze large datasets.

 

 


Last Modified: 10/31/2018
Modified by: Stephen Ficklin

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page