IEEE Computer Society
Stanford University, Palo Alto, CA
Fast and Sensitive Alignment of Large Genomic Sequences
Michael Brudno and Burkhard Morgenstern*
Motivation: Comparative analysis of syntenic genome sequences can be used to identify functional sites such as exons and regulatory elements. Here, the first step is to align two or several evolutionary related sequences and, in recent years, a number of computer programs have been developed for alignment of large genomic sequences. Some of these programs are extremely fast but often time-effciency is achieved at the expense of sensitivity. One way of combining speed and sensitivity is to use an anchored-alignment approach. In a first step, a fast heuristic identifies a chain of strong sequence similarities that serve as anchor points. In a second step, regions between these anchor points are aligned using a slower but more sensitive method. Results: We present CHAOS, a novel algorithm for rapid identification of chains of local sequence similarities among large genomic sequences. Similarities identified by CHAOS are used as anchor points to improve the running time of the DIALIGN alignment program. Systematic test runs show that this method can reduce the running time of DIALIGN by more than 93% while affecting the quality of the resulting alignments by only 1%. Availability: The source code for CHAOS is available at http://www.stanford.edu/ ~brudno/chaos/. An integrated program package containing CHAOS and DIALIGN will be available at http://bibiserv.techfak.uni-bielefeld.de/dialign/.
An Efficient Branch-and-Bound Algorithm for the Assignment of Protein Backbone NMR Peaks
Guohui Lin*, Dong Xu+, Zhi-Zhong Chen~, Tao Jiang%, and Ying Xu
NMR resonance assignment is one of the key steps in solving an NMR protein structure. The assignment process links resonance peaks to individual residues of the target protein sequence, providing the prerequisite for establishing intra-and inter-residue spatial relationships between atoms. The assignment process is tedious and time-consuming, which could take many weeks. Though there exist a number of computer programs to assist the assignment process, many NMR labs are still doing the assignments manually to ensure quality. This paper presents a new computational method based on our recent work towards automating the assignment process, particularly the process of backbone resonance peak assignment. We formulate the assignment problem as a constrained weighted bipartite matching problem. While the problem, in the most general situation, is NP-hard, we have developed a rigorous (exact)algorithm based on the branch-and-bound method by using efficient and effective bounding techniques. Our experimental results on 70 instances of (pseudo)real NMR data derived from 14 proteins demonstrate that the algorithm runs much faster than the (exhaustive)two-layer algorithm, which was introduced recently, and often recovers as many correct peak assignments as the two-layer algorithm.
Bayesian Network and Nonparametric Heteroscedastic Regression for Nonlinear Modeling of Genetic Network
Seiya Imoto, Kim Sunyong, Takao Goto, Sachiyo Aburatani* , Kousuke Tashiro*, Satoru Kuhara*, and Satoru Miyano
We propose a new statistical method for constructing a genetic network from microarray gene expression data by using a Bayesian network. An essential point of Bayesian network construction is in the estimation of the conditional distribution of each random variable. We consider fitting nonparametric regression models with heterogeneous error variances to the microarray gene expression data to capture the nonlinear structures between genes. A problem still remains to be solved in selecting an optimal graph, which gives the best representation of the system among genes. We theoretically derive a new graph selection criterion from Bayes approach in general situations. The proposed method includes previous methods based on Bayesian networks. We demonstrate the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae gene expression data newly obtained by disrupting 100 genes.
A Literature Based Method for Identifying Gene-Disease Connections
Lada A. Adamic, Dennis Wilkinson, Bernardo A. Huberman, Eytan Adar
We present a statistical method that can swiftly identify, from the literature, sets of genes known to be associated with given diseases. It offers a comprehensive way to treat alias symbols, a statistical method for computing the relevance of the gene to the query, and a novel way to disambiguate gene symbols from other abbreviations. The method is illustrated by finding genes related to breast cancer.
Some Histograms Generated from the Counting of K-tuples in Genomes
Huimin Xie* and Bailin Hao
K-histograms are obtained through the counting of distinct K-tuples which have the same repeating number in a genome s theoretically explored. A relationship between the K-histograms and the probability functions of occurrence of K-tuples in sequences is established. The sequences discussed are generated from three classes of models, i.e. equal probability independently identical distribution (iid), nonequal probability iid, and Markov Models. Three approaches are used which include (1) the exact computation method by the Goulden-Jackson cluster method,(2) the Monte Carlo method, (3) the Poisson approximation method.
ΦLOG: a Domain Specific Language for Solving Phylogenetic Inference Problems
E. Pontelli, D. Ranjan, B. Milligan, and G. Gupta*
Domain experts think and reason at a high level of abstraction when they solve problems in their domain of expertise. We present the design and motivation behind a domain specific language called ΦLOG to enable biologists (domain experts) to program solutions to phylogenetic inference problems at a very high level of abstraction. The implementation infrastructure (interpreter, compiler, debugger) for the DSL is automatically obtained through a software engineering framework based on Denotational Semantics and Logic Programming.
PheGe, the Platform for Exploring Genotype/Phenotype Relations on Cellular and Organism Level
One major challenge of bioinformatics is to extract biological information into a form that gives access to analyses and predictive models and that sheds new light on cellular and organism function. In order to approach automated network analysis on organism level the relational platform PheGe was generated. PheGe enables presentation of cell-specific regulatory and metabolic pathways, b) sorting and coordination of the various molecules, genes and reactions to their particular signaling systems, c) visualization of signaling par distance, d) organization of downstream events on a multi-cellular level, e) recording and evaluation of pathological relevant data, f) coordination of the aberrant genes and gene products into the various regulatory pathways balancing phenotypic patterns g) modeling of cellular differentiation and finally h) tracing of network components that balance differentiation programs.
A Multi-level Text Mining Method to Extract Biological Relationships
Mathew Palakal, Matthew Stephens, Snehasis Mukhopadhyay, Rajeev Raje,
Accurate and computationally efficient approaches in discovering relationships between biological objects from text documents are important for biologists to develop biological models. The term object refers to any biological entity such as a protein, gene, cell cycle, etc., and relationship refers to any dynamic action one object has on another, e.g., protein inhibiting another protein or one object belonging to another object such as, the cells composing an organ. This paper presents a novel approach to extract relationships between multiple biological objects that are present in a text document. The approach involves object identification, reference resolution, ontology and synonym discovery, and extracting object-object relationships. Hidden Markov Models (HMMs), dictionaries, and N-Gram models are used to set the framework to tackle the complex task of extracting object-object relationships. Experiments were carried out using a corpus of one thousand Medline abstracts. Intermediate results were obtained for the object identification process, synonym discovery, and finally the relationship extraction. For the thousand abstracts, 53 relationships were extracted of which 43 were correct, giving a specificity of 81%. These results are promising for multi-object identification and relationship finding from biological documents. The approach is both adaptable and scalable to new problems as opposed to rule-based methods.
Constrained Multiple Sequence Alignment Tool Development and Its Application to RNase Family Alignment
Chuan Yi Tang, Chin Lung Lu*, Margaret Dah-Tsyr Chang, Yin-Te Tsai~,
Yuh-Ju Sun, Kun-Mao Chaoš, Jia-Ming Chang, Yu-Han Chiou, Chia-Mao Wu,
Hao-Teng Chang, Wei-I Chou
In this paper, we design an algorithm of computing a constrained multiple sequence alignment (CMSA for short) for guaranteeing that the generated alignment satisfies the user-specified constraints that some particular residues should be aligned together. If the number of residues needed to be aligned together is a constant a, then the time-complexity of our CMSA algorithm for aligning K sequences is O(aKn 4 ), where n is the maximum of the lengths of sequences. In addition, we have build up such a CMSA software system and made several experiments on the RNase sequences, which mainly function in the RNA processing such as RNA maturation and turnover. The resulting alignments illustrate the practicability of our method.
Parallelizing a DNA Simulation Code for the Cray MTA-2
Shahid H. Bokhari*, Matthew A Glaser, Harry F. Jordan,
Yves Lansac, Jon R. Sauerš, Bart Van Zeghbroeckš
Computational experiments will be an important part of bioinformatics research in the coming decades. Convenient access to high performance (i.e. parallel) computers will be essential for carrying out such experiments. The Cray MTA-2 (Multithreaded Architecture) is an unusual parallel supercomputer that promises ease of use and high performance. We describe our experience on the MTA-2 with a molecular dynamics code, SIMU-MD, that we are using to simulate the translocation of DNA through a nanopore in a silicon based ultrafast sequencer. Our sequencer is constructed using standard VLSI technology and consists of a nanopore surrounded by four Field Effect Transistors (FETs). We propose to use the FETs to sense variations in charge as a DNA molecule translocates through the pore and thus differentiate between the four building block nucleotides of DNA. We plan to use SIMU-MD to assist in the design of the geometry of our sequencer. We were able to port SIMU-MD, a serial code written in C without any regard for parallelism, to the MTA with only a modest effort and with good performance. Our porting process needed neither a parallelism support platform nor attention to the intimate details of parallel programming and interprocessor communication, as would have been the case with more conventional supercomputers. We describe our code, the architecture of the MTA and key aspects of our porting process. We present preliminary measurements of performance on the MTA-2.
Automated Identification of Single Nucleotide Polymorphisms from Sequencing Data
Masazumi Takahashi, Fumihiko Matsuda, Nino Margetic, and Mark Lathrop Centre National de Genotypage, 2, rue Gaston Cremieux-CP5721, 91057 Evry Cedex, France
The single nucleotide polymorphism (SNP) gives us an abundant information of genetic variation. Discovery of high frequency SNPs is being undertaken in various methods in a large scale manner. However, the publicly available SNP data are not always accurate, so they should be verified. If only a particular gene locus is concerned, the locus-specific polymerase chain reaction amplification may be useful. The problem is the secondary peak of a sequence trace has to be measured. We analyzed the sequence trace data from a conventional sequencing equipment and found an applicable rule to discern SNPs from noises. We developed a software which integrated the function to automatically identify SNPs. The software works accurately for high quality sequences and also easily finds SNPs in low quality sequences with high background noises. It is also able to count the allele frequency at the same time and display in the bar graph. It is very useful for identifying de novo SNPs in an interested DNA fragment.
Fast and Sensitive Algorithm for Aligning ESTs to Human Genome
Jun Ogasawara and Shinichi Morishita
There is a pressing need to align growing set of expressed sequence tags (ESTs) to newly sequenced human genome. The problem is, however, complicated by the exon/intron structure of eucaryotic genes, misread nucleotides in ESTs, and millions of repeptive sequences in genomic sequences. Indeed, to solve this, algorithms that use dynamic programming have been proposed, but in reality, these algorithms require an enormous amount of processing time. In an effort to improve the computational efficiency of these classical DP algorithms, we develop software that fully utilizes the lookup-table for allowing the efficient detection of the start- and endpoints of an EST within a given DNA sequence, and subsequently, the prompt identification of exons and introns. In addition, high sensitivity and accuracy must be achieved by calculating locations of all spliced sites correctly for more ESTs while retaining high computational efficiency. This goal is hard to accomplish in practice, owing to misread nucleotides in ESTs and repeptive sequences in the genome, but we present a couple of heuristics effective in settling this issue. Experimental results have confirmed that our technique improves the overall computation time by orders of magnitude compared with common tools such as sim4 and BLAT, and attains high sensitivity and accuracy against datasets of clean and documented genes at the same time.
Towards Automatic Clustering of Protein Sequences
Jiong Yang and Wei Wang
Analyzing protein sequence data becomes increasingly important recently. Most previous work on this area has mainly focused on building classification models. In this paper, we investigate in the problem of automatic clustering of unlabeled protein sequences. As a widely recognized technique in statistics and computer science, clustering has been proven very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty that prevents clustering from being performed directly on protein sequence is the lack of an effective similarity measure that can be computed efficiently. Therefore, we propose a novel model for protein sequence cluster by exploring significant statistical properties possessed by the sequences. The concept of imprecise probabilities are introduced to the original probabilistic suffix tree to monitor the convergence of the empirical measurement and to guide the clustering process. It has been demonstrated that the proposed method can successfully discover meaningful families without the necessity of learning models of different families from pre-labeled "training data".
A Maximum Entropy Algorithm for Rhythmic Analysis of Genome-Wide Expression Patterns
Christopher James Langmead*, C. Robertson McClungy~, Bruce Randall Donald*,#,%
We introduce a maximum entropy-based analysis technique for extracting and characterizing rhythmic expression profiles from DNA microarray hybridization data. These patterns are clues to discovering genes implicated in cell-cycle, circadian, and other periodic biological processes. The algorithm, implemented in a program called enrage (Entropy-based Rhythmic Analysis of Gene Expression), treats the task of estimating an expression pro_le's periodicity and phase as a simultaneous bicriterion optimization problem. Specifically, a frequency domain spectrum is reconstructed from a time-series of gene expression data, subject to two constraints: (a) the likelihood of the spectrum and (b) the Shannon entropy of the reconstructed spectrum. Unlike Fourier-based spectral analysis, maximum entropy spectral reconstruction is well suited to signals of the type generated in DNA microarray experiments. Our algorithm is optimal, running in linear time in the number of expression profiles. Moreover, an implementation of our algorithm runs an order of magnitude faster than previous methods. Finally, we demonstrate that enrage is superior to other methods at identifying and characterizing periodic expression profiles on both synthetic and actual DNA microarray hybridization data.
DNA sequence compression using the Burrows-Wheeler Transform
Adjeroh D.A* , Zhang Y* , Mukherjee A# , Zhang N# , Powell M$ , and Bell T.C$
We propose an off-line dictionary oriented approach to DNA sequence compression, based on the Burrows-Wheeler Transform (BWT). The preponderance of short repeating sequences is an important phenomenon in biological sequences. These are often classified as random repeats, and hence ignored, since they should not lead to any significant compression. Here, we propose different methods to compress DNA sequences that exploit this preponderance of short repetition structures. Repetition analysis is performed based on the relationship between the BWT and important pattern matching data structures, such as the suffix tree and suffix array. We discuss how the proposed approach can be incorporated in the BWT compression pipeline.
Rapid Large-Scale Oligonucleotide Selection for Microarrays
Sven Rahmann Computational Molecular Biology, MPI for Molecular Genetics, Ihnestrasse 63-73, D-14195 Berlin, Germany
We present the first algorithm that can select suitable custom oligonucleotide probes (e.g. 25-mers) for microarray experiments on a truly large scale. For example, oligos for human genes can be found within 50 hours. This becomes possible by using the longest common substring as a speci_city measure for candidate oligos. We present an algorithm based on a suffix array with additional information that is efficient both in terms of memory usage and running time to rank all candidate oligos according to their specificity. We also introduce the concept of master sequences to describe the sequences from which oligos are to be selected. Constraints such as oligo length, melting temperature, and self-complementarity are incorporated in the master sequence at a preprocessing stage and thus kept separate from the main selection problem. As a result, custom oligos can now be designed for any sequenced genome, just as the technology for on-site chip synthesis is becoming increasingly mature.
Protein-based analysis of alternative splicing in the human genome
Ann E. Loraine, Gregg A. Helt, Melissa Cline, and Michael A. Siani-Rose
Understanding the functional significance of alternative splicing and other mechanisms that generate RNA transcript diversity is an important challenge facing modern-day molecular biology. Using homology-based, protein sequence analysis methods, it should be possible to determine how transcript diversity impacts protein structure and function. To test this, a data-mining technique was developed which identifies cases where protein isoforms arising from the same gene exhibit distinct profiles of conserved protein motifs. We found that out of a test set of over 1,300 alternatively spliced genes with solved genomic structure, over 30% exhibited a differential profile of conserved InterPro and/or Blocks protein motifs across distinct isoforms. These results suggest that motif databases such as Blocks and InterPro are potentially useful tools for investigating how alternative transcript structure affects gene function.
A Bi-Recursive Neural Network Architecture for the Prediction of Protein Coarse Contact Maps
Alessandro Vullo, and Paolo Frasconi
Prediction of contact maps may be seen as a strategic step towards the solution of fundamental open problems in structural genomics. In this paper we focus on coarse grained maps that describe the spatial neighborhood relation between secondary structure elements (helices, strands, and coils) of a protein. We introduce a new machine learning approach for scoring candidate contact maps. The method combines a specialized noncausal recursive connectionist architecture and a heuristic graph search algorithm. The network is trained using candidate graphs generated during search. We show how the process of selectingand generating training examples is important for tuning the precision of the predictor.
Prediction of protein function using protein-protein interaction data
Minghua Deng, Kui Zhang, Shipra Mehta, Ting Chen, Fengzhu Sun
Assigning functions to novel proteins is one of the most important problems in the post-genomic era. Several approaches have been applied to this problem, including analyzing gene expression patterns, phylogenetic pro_les, protein fusions and protein-protein interactions. We develop a novel approach that applies the theory of Markov random fields to infer a protein's functions using protein-protein interaction data and the functional annotations of its interaction protein partners. For each function of interest and a protein, we predict the probability that the protein has that function using Bayesian approaches. Unlike in other available approaches for protein annotation where a protein has or does not have a function of interest, we give a probability for having the function. We apply our method to predict cellular functions for yeast proteins defined in the Yeast Proteome Database(YPD), using the protein-protein interaction data from the Munich Information Center for Protein Sequences (MIPS, http://mips.gsf.de). We show that our approach outperforms other available methods for function prediction based on protein interaction data.