Phylogeny Estimation: Hypothesis Testing

Maximum Likelihood: Phylogeny Estimation

For: Prof. Partensky

By: Neelima Lingareddy

August.3.2003

Introduction to Maximum likelihood……………………………………………..3

Introduction to phylogenetics……………………………………………………..5

Phylogenetic trees…………………………………………………………………6

Tree building methods…………………………………………………………….8

Maximum Likelihood procedure for estimating phylogenetic trees………………8

References………………………………………………………………………..15

Introduction to Maximum likelihood

The method of maximum likelihood was first proposed by the English statistician and population geneticist R.A.Fisher in 1922. The method of maximum likelihood depends

on the complete specification of the data and a probability model to describe the data.

The probability of observing the data under the assumed model will change depending on

the parameter values of the model. The aim of maximum likelihood is to choose the value of the parameter that maximizes the probability of finding the data.

A Coin Tossing Experiment

We consider the simple procedure of tossing a coin with the goal of estimating the probability of heads for the coin. The probability of heads for a fair coin is 0.5. However, for this example we will assume that the probability of heads is unknown (maybe the coin is strange in some way or that we are testing whether or not the coin is fair). The act of tossing the coin n times forms an experiment, a procedure that, in theory, can be repeated an infinite number of times and has a well - defined set of possible outcomes.

The three main components of the maximum likelihood approach are (i) the data, (ii) a model describing the probability of observing the data, and (iii) a criterion that allows us to move from the data and model to an estimate of the parameters of the model.

Data: Assume that we have actually performed the coin flip experiment, tossing a coin n = 10 times. We observe that the sequence of heads and tails was {H, H, H, T, H, T, T, H, T, H.} In tossing the coin, we note that heads appeared 6 times and tails appeared 4 times.

Model: An appropriate model that describes the probability of observing h heads out of n tosses of a coin is the binomial distribution. The Binomial distribution has the following form:

P[h|p,n] = Cn,h ph(1-p)n-h

where p is the probability of heads, the binomial coefficient Cn,h gives the number of ways to order h successes out of n trials.

Criterion: Now that we have specified the data and the model we need a criterion to estimate the parameters of the model. Here, the parameter to be estimated is p. The likelihood function is simply the joint probability of observing the data under the model assuming independence of the individual and discrete outcomes. The likelihood function for the coin tossing experiment becomes

L[p|h,n] = Cn,h ph(1-p)n-h

Taking the natural log of the function does not change the value of p that maximizes the likelihood. The log-likelihood can be written as

logL[p|h,n] = log(n!) – log(h!) – log((n-h)!) + hlog p +(n-h)log(1-p)

The reason this is often done is that it makes calculations much easier. In fact because the factorials in the equation never change (for different values of p) they can be dropped (and usually are!) when evaluating the equation.

The following graphs show plots of the likelihood (L) as a function of probability p for four different possible outcomes when tossing a coin ten times. The likelihood appears to be maximized when p is the proportion of the time that heads appear in the experiment.

The maximum likelihood estimate of p can be found analytically by taking the derivative of the likelihood function with respect to p and finding where the slope is zero. The estimate of p is h/n.

The maximum likelihood estimate of p for different possible outcomes is given below.

Outcomes / Estimate of p / Maximum likelihood
3 Heads, 7 Tails / 0.3 / 0.266828
5 Heads, 5 Tails / 0.5 / 0.246491
8 Heads, 2 Tails / 0.8 / 0.30199
9 Heads, 1 Tail / 0.9 / 0.38742

The example discussed here is a fairly simple example. But not all problems are this simple. In this paper we will review a strong statistical method – the maximum likelihood method applied in phylogenetics.

Introduction to Phylogenetics

Phylogenetics is that field of biology which deals with identifying and understanding the relationships between the many different kinds of life on earth. This includes methods for collecting and analyzing data as well as interpretations of those results as new biological information. It is a part of a larger field called systematics, which also includes the fields of taxonomy.

The field of phylogenetics is rather recent and has received a huge push forward because of the development of stronger and faster computers. Phylogenetics addresses biological fields such as ecology, developmental biology, population genetics and epidemiology.

Phylogenetic analysis of DNA or protein sequences has become an important tool for studying the evolutionary history of organisms from bacteria to humans. Since the rate of sequence evolution varies extensively with gene or DNA segment, the evolutionary relationships of all levels of classification of organisms(e.g., Kingdoms, phyla, families, species) can be studied by using different genes or DNA segments. Experience tells that closely related organisms have similar sequences, more distantly related organisms have more dissimilar sequences.

A realistic and major obstacle that the field of phylogenetics is struggling with is reaching an accepted answer to the process of evolution. The evolutionary biologist is often uncertain which method of analysis should be used to explain the data. The outcomes may be different when the same data is examined by different phylogenetic methods.

The purpose of phylogenetics is to reconstruct the evolutionary relationship between species and to estimate the time of divergence between two organisms since they shared a last common ancestor. The evolutionary relationships can be represented using phylogenetic trees.

Phylogenetic trees

Phylogenetic relationships of genes or organisms are usually represented in a tree-like form with either with a root, a rooted tree or without any root, an unrooted tree.

Figure: The tree terminology

For a rooted tree, the root is the common ancestor. For each species there is unique path that leads from the root to the species. The direction of each path corresponds to the evolutionary time. An unrooted tree specifies the relationships among species and does not define the evolutionary path. Most of the phylogenetics methods construct unrooted trees.

The branching pattern of a tree whether rooted or unrooted is called a topology. There are many possible rooted and unrooted trees for a sizable number of taxa. The number of taxa (m) is four in the figure below.

A C

B D A B C D

A. Unrooted tree B. Rooted tree

The number of possible topologies for a rooted tree of m taxa for m>= 2 is given by

1*3*5……….(2m-3) = [(2m-3)!] / [2m-2(m-2)!]

The number of possible topologies for an rooted tree of m taxa for m>= 2 is given by 1*3*5……….(2(m-1)-3) = [(2m-5)!] / [2m-3(m-3)!]

The number of possible topologies for a bifurcating rooted and bifurcating unrooted tree for a given number of taxa (m) is shown in the table below.

m / Rooted tree / Unrooted tree
2 / 1 / 1
3 / 3 / 1
4 / 15 / 3
5 / 105 / 15
6 / 945 / 105
7 / 10395 / 945
8 / 135135 / 10395
9 / 2027025 / 135135
10 / 34459425 / 2027025

Tree-Building methods

There are many statistical methods that can be used for reconstructing phylogenetic trees from molecular data. Commonly used methods are classified into three major groups:

(1)distance methods, (2) parsimony methods, and (3) likelihood methods.

In distance methods or distance matrix methods, evolutionary distances are computed for all pairs of taxa, and a phylogenetic tree is constructed by considering the relationships among these distance values.

In Parsimony methods, all possible trees are evaluated and are given a score based on the number of evolutionary changes needed to produce the observed sequence changes. The most parsimonious tree is the one with the fewest evolutionary changes for all sequences to derive from a common ancestor. This is a more time consuming method than the distance methods.

Maximum likelihood (ML) method was first used by Cavalli-Sforza and Edwards(1967)for gene frequency data, but they encountered a number of problems in implementing the method. Later Felsenstein(1981) developed an algorithm for constructing a tree by the Maximum likelihood method for nucleotide sequence data. Kishino et al. (1990) extended this method to protein sequence data using Dayhoff et al.’s (1978) transition matrix. In ML methods, the likelihood of observing a given set of sequence data for a specific substitution model is maximized for each topology, and the topology that gives the maximum likelihood is chosen as the final tree. The parameters to be considered are not the topologies but the branch lengths for each topology and the likelihood is maximized to estimate branch lengths.

Maximum likelihood procedure for estimating phylogenetic trees

The Maximum Likelihood procedure requires three elements, the tree, the model and the observed data in phylogenetic tree estimation. The data is the alignment of sequences, the tree is the splitting sequence and the branch lengths and the model is the mechanism by which we think things work.

There are two main challenges in estimating phylogenetic trees: (1) for a given topology which branch lengths make the data most likely, (2) which of all the possible topologies is most likely.

We will outline the basic strategy for obtain maximum likelihood estimates of phylogeny, given a set of aligned nucleotide or amino acid sequences using simple examples. We will then find out how the basic strategy can be extended to realistic data.

Calculation of likelihood of molecular sequences:

For DNA sequence comparison the model has 2 parts, the base composition and the process. The composition is just the proportion of the four nucleotides A, C, G, T. Example1: Likelihood of a single sequence with two nucleotides AC

If the model is Jukes – Cantor model, which has a base composition of ¼ for each nucleotide then the likelihood will be 1/4 X 1/4 = 1/16. If the model has a composition of 40%a and 10%c the likelihood of the sequence will be 0.4 x 0.1=0.04

If we take the 16 possible nucleotide combinations and calculate the sum of all of them the sum of those likelihoods is 1. For any model, the sum of the likelihoods of all the different data possibilities should be 1.

Example2: Likelihood of a one branch tree between two aligned sequences

The other part of the model, the process part is needed if we have more than one sequence related by a tree. The process might be described by sentences, or by a matrix of numbers, describing how the nucleotides change from one to another. Let the composition part of the model be denoted by  = [0.1, 0.4, 0.2, 0.3]. The order of the bases is A, C, G, and T. There are 16 possible changes from one nucleotide to the other. The changes can be represented as a 4 X 4 matrix.

A C G T

P = 0.976 0.01 0.007 0.007 A

0.002 0.983 0.005 0.01 C

0.003 0.01 0.979 0.007 G

0.002 0.013 0.005 0.979 T

Likelihood of going from sequence1 to sequence 2 is:

= c Pc-c c Pc-c a Pa-g t Pt-t

= 0.4 * 0.983 * 0.4 * 0.983 * 0.1* 0.007 * 0.3 * 0..979

= 0.0000300

In the above example we did not consider branch lengths. Intuitively for short branch lengths the probability of a base change is low and for long branch lengths it is high.

Let’s assume the matrix we have chosen describes a branch with a Certain Evolutionary Distance (CED). The likelihood we calculated was for 1 CED.

The likelihood for the same alignment for 2 CED units is found by multiplying matrix P by itself.

P2 = 0.953 0.02 0.013 0.015 A

0.005 0.966 0.01 0.02 C

0.007 0.02 0.959 0.015 G

0.005 0.026 0.01 0.959 T

Likelihood for 2 CED units is:

= c Pc-c c Pc-c a Pa-g t Pt-t

= 0.4 * 0.966 * 0.4*0.966 * 0.1* 0.013 * 0.3 * 0..959

=0.0000559

As the branch length increases the values on the diagonal decrease and the other values increase because change becomes more likely than being the same.

The table lists the likelihoods for increasing branch lengths.

Branch length
(CED units) / Likelihood
1 / 0.0000300
2 / 0.0000559
3 / 0.0000782
10 / 0.000162
15 / 0.000177
20 / 0.000175
30 / 0.000152

The likelihood rises to a maximum somewhere between 15 and 20 ced units.

Example3: Likelihood of a tree with four taxa

We consider a simple tree of four taxa and assume that DNA sequences are n nucleotides long and are aligned with no insertions /deletions. We observe the nucleotides for sequences 1,2,3, and 4 at a given site (Kth site) and denote them by x1 , x2, x3, x4 respectively. We do not know the nucleotides at nodes 0,5,6 but assume they are x0, x5, x6.

(A) 1 2 3 4 (B) 3

v1 v2 v4 1 v1 v3

5 v3 v2 5 v5 6

v5 6 2 v4 4

Figure: Rooted and unrooted phylogenetic trees for four taxa. vi = riti, where r is the rate ofnucleotide substitution and ti is the evolutionary time for branch i. In tree B v5 represents the sum of v5 and v6 in tree A.

Consider a nucleotide site and let Pij(t) be the probability that nucleotide i at time 0 becomes nucleotide j at time t at a given site. Here i and j refer to any A, G, C, T. In the ML method, the rate of substitution (r) is allowed to vary from branch to branch so that it is convenient to measure evolutionary time in terms of expected number of substitutions (v=rt). The expected number of substitutions for the I-th branch is vi=riti . In the maximum likelihood method the branch length vi ‘s are regarded as the parameters.

Each site has a likelihood. This differs depending on the model and tree. The likelihood function for a nucleotide (k-th site) for a rooted tree is given by

L = gx0Px0x5(v5)Px5x1(v1)Px5x2(v2)Px0x6(v6)Px6x3(v3)Px6x4(v4)

where gx0 is the prior probability that node 0 has nucleotide x0.

A specific substitution model needs to be used to compute the values. If we use a reversible model there is no need to consider the root. A reversible model means that the process of nucleotide substitutions between time 0 and time t remains the same whether we consider the evolutionary process backward or forward.

The likelihood function for the unrooted tree

L =gx5Px5x1(v1)Px5x2(v2)Px5x6(v5)Px6x3(v3)Px6x4(v4)

Since in practice, we do not know x5 and x6 the likelihood is the sum of the above quantity over all possible nucleotides at nodes 5 and 6. Since nodes 5 and 6 can take 4 nucleotides each, there are 4 * 4 = 16 possible combinations.

Lk = gx5Px5x1(v1)Px5x2(v2)Px5x6(v5)Px6x3(v3)Px6x4(v4) (1a)

x5 x6

=  gx5Px5x1(v1)Px5x2(v2) Px5x6(v5)Px6x3(v3)Px6x4(v4) (1b)

x5 x6

Felsenstein pointed out that it is possible to reduce the computational time considerably if Equation (1a) is written as Equation (1b). Felsenstein described this efficient method by taking advantage of the tree topology when summing over all possible assignments of nucleotides to interior nodes.

We considered only one nucleotide site in the above example. We must consider all nucleotide sites in practice. The likelihood (L) for the entire sequence is the product of Lk’s for all sites m, the likelihood becomes

L =  L k

k=1

The log likelihood of the entire tree becomes

lnL = lnLk

k=1

We can maximize lnL by changing parameters vi’s. The computation can be done using Newton’s method or some other numerical computation. We find the maximum likelihood value for this topology and record it. The ML values are computed for the two remaining topologies that are possible for 4 sequences. The ML tree is the topology that has the highest ML value. The branch lengths for the topology are of course given by the ML estimates of Vi ,s obtained for this topology.

In the above formulation a simple model of nucleotide substitution was used. In general the likelihood function L for a given topology maybe written as

L = f(x;)

where x is a set of observed nucleotide sequences and  is a set of parameters such as branch lengths, nucleotide frequencies, and substitution parameters in the mathematical model used.

In principle, to find the maximum likelihood tree, all possible topologies need to be considered. For each tree one finds the combination of branch lengths and other parameters that maximizes the likelihood of the tree. The maximum likelihood estimate of phylogeny is the tree with the greatest likelihood. Efficient algorithms are used and these algorithms concentrate on only those trees that have a good chance of maximizing the likelihood function.

The basic algorithm for protein ML methods is the same as that for DNA ML methods, but we need a 20 x 20 matrix of transition probabilities, Pij(v), because there are 20 different amino acids. Kishino et al used Dayhoff et al.’s transition matrix to compute the likelihood values. An element of this matrix is the probability of change of amino acid i to amino acid j for a given evolutionary time or branch length v. The likelihood function is the same as above and we estimate vi’s by maximizing the likelihood and evaluate the ML value for each topology. The same computation can be done using Jones et al.’s substitution matrix.

It is very time consuming to construct the ML tree because we have to consider all possible nucleotides at each interior node. The number of nucleotide combinations to be examined for a tree of m taxa of DNA sequences is given by 4 m-2 since there are m-2 interior nodes. If m= 10 we need to consider 65,356 different combinations of nucleotides and 2,027,025 topologies. Since the number of topologies and the number of nucleotide combinations increases with m, the amount of computation is substantial even if efficient algorithms are used.

The maximization of lnL is done numerically and thus the actual ML value depends on the numerical method used. Therefore different computer programs may give different ML values. When a large number of sequences are used, the differences in ML value between different topologies can be small and so the accuracy of the method for computing ML values becomes very important. The existence of multiple peaks becomes a problem when a large number of sequences are analyzed.

The DNA likelihood methods work well when the DNA sequences are closely related to one another. If they are distantly related, many complications arise because the rate of synonymous substitution is generally much higher than that of non synonymous. Substitutions that result in amino acid replacements are said to be nonsynonymous while substitutions that do not cause an amino acid replacement (such as a GGG to GGC change - both codons still encode glycine) are said to be synonymous substitutions.

References

1. Ewens, W.J.&Grant G.R.2002.Statistical Methods in Bioinformatics, An

introduction. New York: Springer.

2. Huelsenbeck, J.P &Crandall, K.A.1997. Phylogeny estimation and hypothesis testing

using maximum likelihood. Annu.Rev.Ecol.Syst.28: 437-445

3. Nei, Masatoshi & Kumar, Sudhir 2000. Molecular Evolution and Phylogenetics.

Oxford University Press

4. Foster, Peter G. 2001. The Idiot’s Guide to the Zen of Likelihood in a Nutshell in

Seven Days for Dummies

5. Zhu Jimin, Sharma Rama, Sravanthi Polsani, Xin Gong, Shlomit klopman. Phylogeny

Estimation and Hypothesis Testing using Maximum Likelihood