Award Abstract # 0850205
Collaborative Research - Biochemically-Constrained Genomic Signal Processing (BioGSP): A Multi-Scale Interdisciplinary Approach to Regulatory Network Inference

NSF Org: DBI
Division of Biological Infrastructure
Recipient: REGENTS OF THE UNIVERSITY OF CALIFORNIA, THE
Initial Amendment Date: August 31, 2009
Latest Amendment Date: July 6, 2011
Award Number: 0850205
Award Instrument: Continuing Grant
Program Manager: Peter McCartney
DBI
 Division of Biological Infrastructure
BIO
 Directorate for Biological Sciences
Start Date: September 15, 2009
End Date: August 31, 2012 (Estimated)
Total Intended Award Amount: $127,427.00
Total Awarded Amount to Date: $127,427.00
Funds Obligated to Date: FY 2009 = $84,049.00
FY 2011 = $43,378.00
History of Investigator:
  • Adam Arkin (Principal Investigator)
    aparkin@lbl.gov
Recipient Sponsored Research Office: University of California-Berkeley
1608 4TH ST STE 201
BERKELEY
CA  US  94710-1749
(510)643-3891
Sponsor Congressional District: 12
Primary Place of Performance: University of California-Berkeley
1608 4TH ST STE 201
BERKELEY
CA  US  94710-1749
Primary Place of Performance
Congressional District:
12
Unique Entity Identifier (UEI): GS3YEVSS12N6
Parent UEI:
NSF Program(s): ADVANCES IN BIO INFORMATICS
Primary Program Source: 01000910DB NSF RESEARCH & RELATED ACTIVIT
01001112DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 1165, 9179, 9183, 9184, BIOT
Program Element Code(s): 116500
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.074

ABSTRACT

Columbia University and the University of California Berkley are awarded grants for the development of novel analytical tools that bridge interdisciplinary gaps between engineering and biological sciences by facilitating a synergistic integration of top-down statistical signal processing theory approaches with bottom-up methods that characterize biological networks as collections of basic biomolecular interactions. The former are the subject of the emerging engineering discipline: Genomic Signal Processing (GSP), while the latter are the domain of classical biochemistry/biophysics. The investigators recognize that the number of putative signal processing mechanisms that needs to be analyzed by GSP for a given biological system could be significantly reduced when their consistency with biochemical/biophysical laws is demanded. The resulting Biochemically-constrained GSP (BioGSP) approach is thus able to produce results on par with traditional GSP methods, but is significantly more efficient as well as assured to be in compliance with key molecular properties of biological mechanisms.

Biological systems consist of molecules and molecular complexes, whose interactions comprise intricate circuits and networks. Knowledge of their structure and function can lead to powerful new ways of controlling biological mechanisms, which may potentially enable new approaches to remedying faults in natural biological processes as well as to engineering denovo synthetic biomolecular designs. Recent advancements in experimental techniques have allowed us an unprecedented view of how these systems are structured. However, detailed understanding their function remains a challenge due, in large part to the scale and complexity of networks involved as well as the nonlinear nature of biochemical interactions among the various molecular species. This issue is particularly acute for genetic networks - both because of their importance to biological systems development and operation as well as due to the often complex regulatory patterns they employ. Further information about the project may be found at the PI web sites at http://www.ee.columbia.edu/~wangx/ and http://genomics.lbl.gov/index.html.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Guido H. Jajamovich, Xiaodong Wang, Adam P. Arkin, and Michael S. Samoilov "Bayesian Multiple-Instance Motif Discovery with BAMBI: Inference of Recombinase and Transcription Factor Binding Sites" Nucleic Acids Research , v.39 , 2011 , p.e146 10.1093/nar/gkr745
Liming Wang, Xiaodong Wang, Adam P. Arkin, and Michael S. Samoilov "Inference of gene regulatory networks from genome-wide knockout fitness data" Bioinformatics , v.29 , 2013 , p.338 10.1093/bioinformatics/bts634

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

In this project we invented a number of new computational approaches to understanding how cells regulate their behavior in different environments to survive in these variable conditions.

(1) Gene expression underlies most essential cellular processes and is typically controlled by networks of regulatory interactions. Two basic mechanisms directly involved in regulating gene expression are transcription factor binding and site-specific recombination. In both cases, the proteins involved often attach to highly specific DNA segments - conserved regulatory motifs - which leads to activation or repression of gene expression either due to epigenetic interactions between transcription factors and components of RNA polymerase machinery or recombinase-mediated genetic and genomic modifications of the underlying DNA sequences. Thus, discovery of such motifs represents of the essential problems in bioinformatics and computational biology.

However, as individual binding sites are subject to context-specific optimizations of protein affinities as well as neutral alterations by random mutations, nucleotide sequences of various motif instances can display a significant degree of heterogeneity. Motif discovery from sequences thus becomes a computationally challenging task that has been the subject of much research in recent years. Furthermore, along with performance, one of the essential requirements for a practically useful motif discovery algorithm is input flexibility.

To address these problems:

  1. We have developed BAMBI – a sequential Monte Carlo algorithm based on the position weight matrix model that has the flexibility to also estimate motif length, number of instances, as well as their locations within each sequence.
  2. Using BAMBI, we have shown that the proposed approach can be used to find binding sites in synthetic data as well as in the DNA sequence database containing multiple binding site instances of cAMP receptor protein (CRP) – a major prokaryotic transcription factor.
  3. The problem of discovering multiple motif instances with unknown length is particularly significant in the case of recombinases, whose target sites tend to be both long (on the order of 30 bp or more) as well as occur in multiple instances within relevant genomic loci (at least two are needed for DNA strand exchange). Using BAMBI, we were able to successfully identify these sites within the sequences containing the compiled list of Din-family recombinase sites.

Results obtained reveal that BAMBI demonstrates better statistical performance in the described applications than four of the widely-used profile-based motif discovery algorithms.

(2)  Genome-wide fitness is an emerging type of high throughput biological data generated for individual organisms by creating libraries of knockouts, subjecting them to broad ranges of environmental conditions, and measuring the resulting clone-specific fitnesses. Since fitness is an organism-scale measure of gene regulatory network behavior, it may offer certain advantages when insights into such phenotypical and functional features are of primary interest over individual gene expression. In particular, contribution of practically irrelevant genes may be effectively filtered out if they do not contribute substantially to fitness state—regardless of their statistical significance or dynamic state.

To address these problems we have developed a model and proposed an inference algorithm for using fitness data from knockout libraries to identify underlying gene regulatory networks. Unlike most prior methods, the presented approach captures not only structural, but also dynamical and non-linear nature of biomolecular systems involved. A state–space model with non-linear basis is used for dynamically describing gene regulatory networks. Network structure is then elu...

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page