Information Theoretic Approaches to Biology

David Dowe


This tutorial will be on applications of Minimum Encoding Length inference in biology, particularly Minimum Message Length (Wallace and Boulton, 1968). MML gives a Bayesian framework in which one can estimate the most likely theory to explain some observed data. A variety of theoretical and practical empirical results attest to the success of MML for machine learning, statistics and "data mining".

We present the Minimum Message Length (MML) principle in its general form, and show its successes in estimating the parameter of the von Mises circular distribution. This distribution is highly suited to modelling protein dihedral angles.

We present some work by Dowe et al. (PSB96) and some subsequent work by Edgoose et al. (PSB98) applying the MML clustering program, Snob, to cluster protein conformation classes from protein data-bases. Both pieces of work uncover protein classes, but the more recent work better takes into account the serial correlation in the data. The more recent work also finds a class whose presence seems to suggest something about the order in which parts of a protein fold.

We also investigate a simple boolean theory of secondary structure conformation (Extended, Helix or Other) as a function of the Amino Acids surrounding a site.

We also look at work by Powell et al. (1998) on using MML and related information-theoretic methods for finding significant strings in DNA, and compare this with earlier work by Milosavljevic and Jurka (1993) and Milosavljevic (1995).

Other work will also be presented.

Biographical Sketch

I work primarily with Lloyd Allison, Trevor Dix and Chris Wallace. Most of my work is in the theory and applications of the (information-theoretic) Minimum Message Length (MML) principle of statistical and inductive inference and machine learning (and "knowledge discovery" and "data mining"), dating back to Wallace and Boulton (1968).

I was Program Chair of the Information, Statistics and Induction in Science (ISIS) conference, held in Melbourne, Australia on 20-23 August 1996; attended by R. J. Solomonoff, C. S. Wallace, J. J. Rissanen and others.

Chris Wallace and I are authors of the Snob program for unsupervised clustering and mixture modelling. Snob does Minimum Message Length (MML) mixture modelling of Gaussian, discrete multi-state (Bernoulli or categorical), Poisson and von Mises circular distributions. Further details on Snob are given here. The Snob software is available (subject to conditions) for private, academic use.

Dr. David L Dowe
Senior Lecturer
School of Computer Science and Software Engineering
Monash University, Clayton, Victoria 3168
e-mail :
Tel:+61 3 9905-5776 Fax:+61 3 9905-5146

Back to the main PSB page