Maximum Likelihood in Phylogenetics

Maximum likelihood in Phylogenetics

Group 3

JAN JONES

SHANTHI IYANPERUMAL

SYUSANNA KOYFMAN

Project for

Probability and Statistics

Dr. M. Partensky

4/19/04

Table of Contents

Maximum likelihood in Phylogenetics

P&S Project Part 1:

Maximum Likelihood Overview

P&S Project Part II:

General Model of DNA Substitution

Calculation of likelihood of molecular sequences:

Example1:

Example2:

Example 3 :

DNA substitution Models

Jukes-Cantor (JC):

Felsenstein 1981(F81)

Kimura 2-parameter(K80)

Hasegawa-Kishino-Yano (HKY)

Tamura-Nei (TrN):

Kimura 3-parameter (K3P)

Transition Model (TIM)

Transversion Model (TVM)

Symmetrical Model (SYM)

General Time Reversible (GTR)

Gamma Distribution (G)

Proportion of Invariable Sites (I)

Amino acid Substitution Models

Empirical substitution models

PAM matrices

Dayhoff matrices

JTT matrices

Other empirical models

Blosum (Block substitution matrices)

Poisson models

P&S Project Part III:

Phylogenetic trees

Maximum parsimony

Maximum likelihood

LIKELIHOOD RATIO TESTS IN PHYLOGENETICS

Reference:

P&S Project Part 1:

Maximum Likelihood Overview

By: Janice Jones

04/19/04

This section of the project presents an overview of the Maximum Likelihood method and introduces its application to the construction of phylogenetic trees. The two other sections of this project, submitted by Shanthi Iyanperumal and Syusanna Koyfman, provide additional information about the method’s application, and explore the topics of models of DNA substitution, the testing of these models, likelihood ratio tests, and a comparison with the maximum parsimony method.

Phylogenics is an area of high interest in bioinformatics. Construction of a phylogenic tree provides insights into the origins of genes and their protein products. A tree can assist in formulation of questions to ask about the behavior of a gene or protein in various organisms. A number of statistical methodologies have been devised to construct a phylogenic tree from sequence data for a set of related genes/proteins in one or more organisms. One of these, the maximum likelihood method, is reviewed by Huelsenbeck and Crandall(1). What follows is an interpretation of the methodology as they describe it, supplemented by an additional reference and two examples in Mathematica.

The maximum likelihood method was said (1) to have been first described in 1922 by English statistician RA Fisher. It was apparently not widely used until after 1990, when increased computing power and optimizations of the method itself made its application more practical. It is said (3) to be one of the most popular sequence-based criteria for evaluating trees, along with parsimony and compatibility. It is inherently more computationally expensive than parsimony but various newer optimizations have been proposed to further address this problem.

General Approach

The job of the maximum likelihood method is to construct a probability model that best describes a set of known data. As model parameters are changed, the method seeks those parameter values that maximize the probability of observing the actual data. Felsenstein (2) notes that when the method is applied to an evolutionary tree, the likelihood is the probability of evolving the observed sequences given the hypothesis (a proposed tree). It was emphasized that the likelihood is not the probability that the tree itself is correct.

To explain how a likelihood calculation works, Huelsenbeck and Crandall (1) start with the now-familiar binomial distribution to describe a coin tossing experiment. The distribution formula specifies the probability of getting h successes, given n trials and a probability of success of p.

p(h|n,p) = C(h, n) * ph * (1-p)(n-h)

They then proceed to turn this calculation around, even though the right hand is the same. In the conventional binomial probability calculation, the p of success is constant and the number of successes in n trials is what we want. In a likelihood calculation, the number of success and trials are kept constant and the probability of a single success is what we are trying to find. The left hand side of the equation above becomes L(p|h,n): the likelihood that the value of p results in an outcome of h successes in n trials. The calculation is iterated for various reasonable values of p and the result, the likelihood L of p, is plotted against p. The value of p for which L reaches a maximum is presumed to be the “best” value of p.

Data from a hypothetical coin toss experiment is then used to illustrate the method. In this example, the goal is to determine the probability of heads, given an outcome of h heads in n trials. I have provided Example 1 in the accompanying Mathematica file, Proj_Group3JaniceJones.nb, to demonstrate this use of maximum likelihood. The authors mention that in this simple experiment, an alternate to the computational solution is to determine the value of p for which the slope of the likelihood L is zero. This works in the simple experiment but may be more difficult to apply when the probability model becomes more complex.

Second Simple Example

I have prepared a second example that has a slightly more complex probability model that is based on a permutations. Here, we choose 10 marbles from a jar that has a total of 17 marbles of three different colors. We are told the colors of the marbles we picked. The problem is to determine how many marbles of each color still remain in the jar. The goal of the probability function used in the solution was not to calculate the probability that a random marble would be of a particular color, but the probability that 10 marbles would be of the specified colors. The test parameters (colors of the 7 remaining marbles) that resulted in the highest probability for the 10 marble colors picked were considered to be the “most likely” values. See Example 2 in Proj_JaniceJones.nb for additional comments and code.

One interesting aspect of this problem is that the solution does not require the likelihood measure to be a probability value; a value that is directly proportional to the probability works just as well. When the probability function is simplified to return a permutation count rather than a probability, the “most likely colors” part of the result is the same, even though the maximum likelihood value itself has changed. Another interesting aspect is that since any two of the parameters (number of marbles of each color) can vary independently, a simple graphical solution is not an option.

Application to Phylogenetics

How is this approach used in the construction of a phylogenic tree? The data we are starting with, instead of a count of heads in n trials or marbles of different colors, is a set of “s” sequences aligned in “n” positions. For the next example, to keep the possibilities simple, say the sequences are of nucleic acids. Then a data point is the set of bases in a given aligned position. For the following tiny sample alignment,

there are two data points: {A,A,T} for position 1 and {G,C,G} for position 2. For each of these positions, there are r = 4s possible values, where s in the number of sequences being aligned. Since there are 3 sequences in this example, there are 64 possible values for each position: {A,A,A},{A,A,T},{A,A,G},{A,A,C},{A,T,A}...etc. Each of the 64 possible patterns can be assigned a id number (1 to 64) to distinguish it from the other patterns. Disregarding data point order for now, the probability of observing particular values for the collection of data points (positions) in the alignment can be described by the multinomial distribution, copied from (1):

The authors compare the 64 possible outcomes for each data point with the coin toss that had only two possible outcomes. Although maximum likelihood was successfully applied to the marble problem, its solution did not require that this same model be used. The marble problem can be said to have only one data point, but with many possible outcomes. Each outcome was one of the possible permutations of the choice of 10 marbles.

But back to the sequence. The probability question changes from “What is the probability of h heads in n tosses?” or “What is the probability of drawing 5R, 3G, and 2Y marbles?” The new question is “What is the probability of getting n1 occurrences of data point (1), n2 occurrence of data point (2),... n64 of data point (64), given 3 sequences of 2 bases each?” As was done with the coin toss and marble examples, the likelihood function is a probability function that estimates the probability for data that is already known.

A trivial probability function estimates the probability (of getting the sequence data points) based on 64 parameters. Each parameter is the probability of getting a particular base combination for the three sequences in a random position. The most likely value for each parameter turns out to match the proportion of the time that the sequence combination occurs over all the positions in the alignment. In the small 2-residue, 3-sequence example, the parameters pi are 0 for all but two of the 64 possibilities. For each of these, the maximum likelihood estimate for pi is ni/n or ½.

Need for More Complex Model

While this type of likelihood equation does call attention to sequences, the probability question that it addresses is not biologically interesting. It only differs from the coin toss in the number of possible outcomes for each experiment. To become interesting, the model that defines the probability of observing the given site patterns must become much more complex. One interesting model is a definition of a phylogenetic tree that contains nodes, branch lengths, and tips. Each tip represents a sequence and each node represents a convergence of the sequences on its adjacent branches. The resulting tree, in most circumstances, reflects the expected “tree of life” in its picture of how various species are related. (Humans would be seen as most closely related to chimpanzee, then other primates, then mouse/rat, etc. ) For this discussion, all the sequences are assumed to be nucleotides. The same principles could be applied to protein sequences but the model becomes more complex.

For the likelihood probability function, branch lengths of the tree are specified as a function of the expected number of changes per site and a model of sequence change(1). A given site pattern can be represented as the tips of branches that are connected directly or indirectly. At each branch node, any of the four nucleotides is possible. Assuming this model, there are 64 possible nodes that may contribute to a site pattern that represents one position in a hypothetical alignment of four sequences.

Such a node arrangement is illustrated in the following figure copied from (1). It represents a single position in which the aligned sequences have bases G, G, T, and T respectively. The nodes i, j, and k may each have any of the four nucleotides, allowing a total of 64 different trees that have the leaves shown.

Figure 1 – copied from (1)

Instead of using a simple probability value for the site pattern, as is implied by the multinomial probability model, the phylogenetic model calculates the site pattern probability as a sum of terms, one term for each possible configuration of nodes that could lead to the pattern. Here, a configuration of nodes refers to the assignment of nucleotides to the nodes. Each term is itself dependent on the “equilibrium frequency” of the “base” node, the presumed length of all the branches that lead (directly or indirectly) to the leaves, the probability of observing two given nucleotides at the ends of the branches, and other unspecified parameters. The “base” node (my term) is the one furthest from the “leaves”; in the above diagram, it is node k. Its equilibrium frequency has a different value for each of the four nucleotides that is assigned to the node. It is typically determined by the overall base composition of the sequences and is intended to estimate the probability that the nucleotide, going back in evolutionary time, was of that base.

The following equation, copied from (1), represents a probability value for the sequences observed in a single position, illustrated in the diagram above on this page. The values v, as in the diagram, represent branch lengths; the value (pi) represents the equilibrium frequency for node k; the values (theta) represent “additional parameters” of the nucleotide substitution model. It is presented here just to illustrate the complexity and potential computational expense of a tree likelihood calculation.

As extensive as this equation appears, it only represents one position in an alignment that contains just four sequences. As the number of sequences increases, the number of summations required by the above probability equation goes up exponentially. According to Felsenstein (2), the number of terms becomes 22n-2, where n is the number of sequences; if we are working with 10 sequences (a fairly modest number), we would have 218 terms for just one aligned position.

The likelihood of a tree for the entire sequence alignment is the product of these estimated probability values taken across all the sequence positions. Each sequence contributes equally and it is assumed that all positions are independent. This is to say that the above probability function for a single candidate tree involves a very large number of computations. If all the possible trees are considered, the computation becomes impossibly long.

For the maximum likelihood method to be a useable phylogenetic tool, some optimizations were needed. Several interesting ones were proposed by Felsenstein (2) in 1981.

Felsenstein’s first optimization reduced the number of summations in the probability function for the tree represented by a single sequence position. It is best explained by referring to the tree diagram (Figure 1). In the “full tree” calculation shown above, the left side of the tree is kept static while all possible combinations of the right side are calculated. Then one change is made to the left side, and all possible combinations on the right are calculated again. This continues for all possible combinations on the left side. Then the base at node k is changed and all the calculations on both sides are repeated.

In the optimized calculation, the likelihood (probability function) is first calculated separately for each of the “outermost” nodes. In Figure 1, these are nodes labeled i and j. For each of these nodes, four likelihood values are calculated, one for each possible value of its parent node k. For node i, a likelihood value would be based on its two leaves and the base of node k. Then the likelihood for the parent node is the summation, for each of its possible values (A, T, G, or C) of the product of its child nodes and the equilibrium constant for node k. If is the equilibrium value of the base at node k, andLik and Ljk represent likelihood values of nodes i and j for a given value of k, then the likelihood of the tree would calculated as the summation over k ofk * Lik * Ljk. Calculation of the L values at each of nodes i and j would require 16 summations and the final likelihood calculation would require 4 summations, for a total of 36 summations.

Felsenstein’s second optimization involves the choice of tree topologies to test. As stated in his paper, two million topologies are possible for a tree that represents just 10 sequences. The optimization was to first build the tree with only two sequences, and then to add just one sequence at a time without altering the arrangement of the existing nodes. A stated disadvantage of this approach was that the final topology could vary, depending on the order in which sequences were added.

A third optimization from the same paper involved the “Pulley Principle”. This was an observation that the probability of a tree did not change if the total length of two branches attached to the same node was kept equal. For example, in Figure 1, the probability of the tree will not change as long as the sum of distances v5 and v6 are equal. This permitted incremental changes to be made to a branch length until no further improvement appeared in the probability value.

Citations

1. Hulsenbeck J., Crandall, K. Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood. Annu. Rev. Ecol. Syst., 1997, 28:437-66.

2. Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-76

3. Kim J, Warnow T. Tutorial on Phylogenetic Tree Estimation.

P&S Project Part II:

DNA Substitution Models

By: Shanthi Iyanperumal

Phylogeny

Phylogeny is the evolution of a genetically related group of organisms. It is the study of relationships between collection of "things" (genes, proteins, organs..) that are derived from a common ancestor.

General Model of DNA Substitution

Maximum likelihood evaluates the probability that the choosen evolutionary model will have generated the observed sequences. Phylogenies are then inferred by finding those trees that yield the highest likelihood.

The rate matrix for a general model of DNA substitution is given by

.r2pCr4pGr6pT

r1pA.r8pGr10pT

Q = q(i,j) =

r3pAr7pC.r12pT

r5pAr9pCr11pG.

The rows and columns are ordered A, C, G and T. The matrix gives the rate of change from nucleotide i(arranged along the rows) to nucleotide j(along the columns).

For example r2pC gives the rate of change from A to C.

Let P(v,s) be the transition probability matrix where pi,j(v,s) is the probability that nucleotide i changes into j over branch length v. The vector s contains the parameters of the substitution model(eg. pA, pC, pG, pT, r1,r2…).

For two-state case, to calculate the probability of observing a change over a branch of length v, the following matrix calculation is performed:

P (v,s) = eQv

Calculation of likelihood of molecular sequences:

For DNA sequence comparison the model has 2 parts, the base composition and the process. The composition is just the proportion of the four nucleotides A, C, G, T.

Example1:

Likelihood of a single sequence with two nucleotides AC

If the model is Jukes – Cantor model, which has a base composition of ¼ for each nucleotide then the likelihood will be 1/4 X 1/4 = 1/16. If the model has a composition of 40%a and 10%c the likelihood of the sequence will be 0.4 x 0.1=0.04