Statistical methodology

Modelling high-dimensional correlation structures with mixtures of forests

Biological and evolutionary processes often produce low dimensional signals embedded in high-dimensional data observations. For example, in genetics after genotyping individuals at 100,000s of markers genome-wide, principle components analysis (PCA) can often reveal striking population sub-structure, clearly separating out clusters of related individuals within just the first few principle components. Alternatively, methods based on Hidden Markov Models (HMMs), are widely used in genetics to model linear dependent state switching processes along genomes, while latent factor models (LFMs) are popular for characterising unknown linear modes of dependence particularly in gene expression data. However, while LFMs and HMMs scale well to the high dimensions of genomic datasets, they both suffer from implicit unrealistic assumptions. For instance, LFMs assume that the data arises from the linear projection of independent latent scores while HMMs presume a known (artificial) direction and ordering across the genome.

As an alternative to these methods, we have been investigating mixtures of graphical models, restricted to spanning forests, as a model for sparse signals. Forest models have a number of advantages: no artificial directed ordering of the genome is presumed; sparsity is modelled through interpretable conditional independence structures; mixture representations (rather than linear projections) are permitted, and fast learning is possible. As such, these models have great potential to improve upon HMMs and LFMs used in applications in genetics and genomics.

JL Herman and C Holmes (2010), “Bayesian inference for low-dimensional signals in high-dimensional genetic and genomic data using mixtures of trees”, Technical Report, University of Oxford


Algorithms for parallel MCMC

We have been considering how to most efficiently reduce the run time of Markov Chain Monte Carlo simulations by using speculative computation over multiple processors. Results for the optimal acceptance probability size can be extended to the case of parallel MCMC using speculative computation to develop an algorithm which gives the maximum possible speedup for an arbitrary target distribution. Applying this to an example case of estimating mutation rates using pseudomarginal MCMC, a speedup of 2.70 is achieved using four cores.



H Jónsson, J Lees, T Madsen, JL Herman, J Hein and G Nicholls (2012), “Algorithms for parallelising Markov chain Monte Carlo", Technical Report, Department of Statistics, University of Oxford

Approximate inference in continuous-time Bayesian networks

Models of this type in a variety of different circumstances, including in many areas of statistical physics. The particular motivating example is that of biological sequence analysis, whereby the joint system consists of a DNA or amino acid sequence, and the constituent subsystems are the individual letters or sites in the sequence.

The evolution of individual characters in such sequences as a result of mutation events has been successfully modelled using the continuous-time Markov chain framework, facilitating a wide range of algorithms for sequence analysis, ranging from sequence comparison and alignment to phylogenetic inference. However, in most cases since the state space of the joint system typically increases exponentially with the number of subsystems, and naive methods often scale very poorly with system size, such that joint systems are often modelled instead as an ensemble of non-interacting subsystems. In the context of biological sequence analysis this involves assuming that each site in a sequence evolves independently. Although such independent-site models and their extensions have proved highly useful, they are often insufficient to capture many of the important details of the evolutionary process. 


Even if subsystems evolve in a conditionally independent fashion, over a finite time interval all parts of the system become inseparably couple with each other. However, for weakly coupled subsystems, a factorised approximation may still give rise to a low error.


In this work, I explore the use of such approximations for particular classes of processes whose neighbour dependencies can be cast into the form of a Gibbs random field, such that explicit error bounds can be formulated. The approximate likelihood is used inside an MCMC algorithm on spanning forests, allowing dependency structures to be sampled according to the approximate model.

X Boyen and D Koller (1999), "Exploiting the Architecture of Dynamic Systems", AAAI-99

U Nodelman, D Koller and CR Shelton (2005) "Expectation propagation for continuous time Bayesian networks",  Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005)

JL Herman (in preparation), “Factorisation approximations for inference in continuous-time Bayesian networks”