Sequence alignment and molecular evolution

Representing uncertainty in multiple sequence alignments

It has become increasingly clear in recent years that downstream analyses are often highly sensitive to the specific choice of alignment. There may be many plausible but suboptimal alignments within the vicinity of the optimum, containing additional---often complementary---information regarding the evolutionary relationships between the sequences; selecting a single point estimate results in the loss of this additional information, and fails to account for the statistical uncertainty associated with different regions of the alignment.

We developed a graph-based framework whereby a set of alignments sampled according to a particular model can be represented in an efficient form that allows many types of analysis to be easily carried out on the whole ensemble rather than just a single alignment. If a single representative of the ensemble is required, this framework also allows for the efficient computation of the single alignment that maximises the expected value of a variety of different accuracy scores. 

JL Herman, Á Novák, R Lyngsø, A Szabó, I Miklós and J Hein (2015). Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. BMC Bioinformatics, 16(1), 108.

Incorporating alignment uncertainty into sequence annotation

Using the DAG representation of ensembles of alignments allows for various standard algorithms to be easily adapted to operate on many alignments rather than just one. This provides a way to account for alignment uncertainty into downstream analysis without requiring simultaneous alignment and sampling of annotations, which can be complicated and computationally very expensive. Incorporating uncertainty in this way can help to improve prediction accuracy when predicting the locations of conserved motifs in non-coding regions.

JL Herman, Á Novák and J Hein (in preparation), “Incorporating alignment uncertainty into sequence annotation”


Incorporating alignment uncertainty into prediction of RNA secondary structure

A similar approach can be used to adapt the familiar inside-outside algorithm for RNA structure prediction to operate on ensembles of alignments via the graph representation of the posterior, allowing alignment uncertainty to be incorporated into the analysis.

JL Herman, A Novak, JWJ Anderson, MS Astefanoaei, D-E Gratie, C Rich, R Lyngsø and J Hein (submitted), “RNA structure prediction in the presence of alignment uncertainty”


Detecting variable CpG effects using neighbour-dependent evolutionary models

For computational reasons, when modelling molecular evolution it is common to assume that sites in a biological sequence evolve independently of each other. However, in many cases this approximation is highly inaccurate. One such case is when modelling promoter regions containing CpG islands: in order for the CpG island to be preserved, molecular evolution must necessarily occur in a context-dependent fashion.

Although inference in context-dependent evolutionary models can be challenging, it is often possible to use computational techniques to sum over unknown evolutionary histories, allowing for numerical computation of marginal likelihoods and posterior probabilities (Hobolth, 2008). We have been developing a data-augmentation based approach for mapping mutations under a context-dependent evolutionary model in order to permit Gibbs sampling of the model parameters, and are using this in combination with a hidden Markov model framework to search for variable levels of CpG-based neighbour dependency in non-coding sequences. 

Hobolth, A. (2008), "A Markov chain Monte Carlo expectation maximization algorithm for statistical analysis of DNA sequence evolution with neighbour-dependent substitution rates", Journal of Computational and Graphical Statistics,17, 138-164.
JL Herman, PC Tataru, J Chan, J Xiong and J Hein (in preparation), “Bayesian inference of site-heterogeneous neighbor-dependent substitution rates using data augmentation”


Evolution of metabolic networks

With ever-increasing computational power models of biological networks, such as protein interaction networks and metabolic networks, have been extensively developed and studied in recent years. In this project, we focus on the evolution of metabolic networks, comprising the sets of highly regulated and correlated reactions that take place between metabolites within a cell. We describe this evolution as a continuous time discrete space Markov process, with rate dependencies between two reactions in a network if they share reactants or products.

The aim of this work is to determine whether there exists a canonical way of classing the reactions within a metabolic network in terms of the rates at which reactions are added and removed. Using a hidden Markov model within a Potts-type interaction framework, the evolutionary parameters of the reactions can be sampled using Metropolis-Hastings MCMC. The method was used to estimate evolutionary rates on an Archaeal phylogeny, using three sub-networks of the whole metabolome on the KEGG database.

JL Herman, S Bouremoum, KL Keys, H Miao and J Hein (2011), “Metabolic random fields: stochastic models for the loss and gain of reactions in metabolic networks related by a phylogeny”, Technical Report, Department of Statistics, University of Oxford