Introduction to Likelihood

Maximum likelihood and Bayesian analysis have become important alternatives to parsimony. These methods have the advantage that they offer a clear statistical framework, making it possible to more precisely evaluate alternative phylogenetic hypotheses. Like parsimony, these methods share the same challenges of navigating tree space and assessing the confidence of particular clades on the tree.

The maximum likelihood criterion

Maximum likelihood was first proposed as more robust and consistent alternative to parsimony for phylogenetic inference (Felsenstein 1973, 1978). This approach aims to find the tree that is most likely to have given rise to the observed data under some specific model of evolution. Before delving into the application of likelihood to biological data, we will begin by exploring what likelihood means with a coin example.

Suppose you have a sack of coins and you know that half of the coins are fair (50% chance of a head) and half of the coins are biased (75% chance of a head). You draw one coin from the sack and wish to consider two alternative hypotheses: the coin is fair vs. the coin is biased. The coin is tossed 10 times and each time falls heads-up. You may now apply likelihood to ask whether the observed data (10 heads) supports one of the hypotheses and, furthermore, whether the data are decisive enough to make it reasonable to reject the alternative hypothesis.

The first stage is to select a model of how coin tossing works. Let us assume that there are only two sides to the coin, that each toss is independent of previous tosses, and that you are 100% accurate in distinguishing heads and tails. Under this model we can calculate the likelihood, the probability of the data under the hypothesis that the coin was fair. With 10 heads, the likelihood for a fair coin is 0.510 or ~0.00098.

This result does not mean that there is a 0.1% chance that the coin is fair. It just means that this specific outcome is unlikely for a fair coin. The question is, how likely is the observed data under the alternative hypotheses? The likelihood of the data under the bias hypothesis is 0.7510 or ~0.056. This is still a low number, which tells us that even under the alternative hypothesis this particular outcome is unlikely. But what counts is not the absolute value of the likelihood but a comparison of the likelihoods of the two hypotheses. The data are about 57 times as likely under the biased coin hypothesis than under the fair coin hypothesis. This likelihood ratio (usually presented as a natural logarithm, in this case, 4.04) is a measure of the evidential support for one hypothesis over the other. Because 4.04 is well above 2.0, a common used threshold of significance for such a situation, we would say that the data strongly support the conclusion that the coin is biased.

Now we can consider how likelihood is applied to phylogenetic inference. In this case, the observed data are the characters for each taxon (the character state matrix) and the hypotheses are all the possible tree topologies. Our aim is to determine the likelihood of the data arising under each possible tree on the principle that the maximum likelihood tree is the best estimate of the true tree. Also the likelihood score of a suboptimal tree topology can help us decide whether the data significantly reject this tree. So how can we calculate the likelihood of a set of data given a particular tree?

The key to calculating the likelihood of data given a hypothesis is to have a mathematical model of how data would arise if that hypothesis were true. In the case of coin tossing, the model was simple: each toss was assumed to be independent with either a 0.5 or a 0.75 chance of being heads. It is normal to assume independence of phylogenetic characters (such as positions in the DNA alignment), meaning that changes of state in one character do not alter the probability of changes at another character. Nonetheless, even when we assume characters and independent, the evolution of characters along a tree is much more complicated than the coin-tossing scenario.

The first issue is that one needs to be able to list all the potential states that a character can have. In the coin case there were two, head and tails. For DNA sequence data there are four, A, C, G, and T (we generally view a gap as being a position whose base is unknown rather than as a fifth state). But for morphology it can be difficult to decide a priori how many different states a trait could manifest. This is one of the reasons why likelihood methods have proved difficult to adapt to morphological data. For our discussion here, we will focus on the application of maximum likelihood to DNA sequences.

Having specified the universe of possible character states, we need to make assumptions about the relative rates at which evolution jumps between these states. The simplest set of assumptions is that all possible changes tend to occur at the same rate. However, we know enough about the molecular mechanisms of DNA sequence evolution to think that more sophisticated models are often justified. For instance, abundant molecular data supports the idea that we should recognize two kinds of mutations: transitions, in which a purine (A or G) changes into another purine or a pyrimidine (C or T) changes into another pyrimidine, and transversions where a purine changes into a pyrimidine, or a pyrimidine changes into a purine. Transitions tend to occur more often than transversions, and thus, many likelihood models allow these two processes to have different rates. Also, we can observe that DNA sequences rarely have equal (25%) representation of each kind of base. As the rate of going to a state is influenced by the frequency of this state in the entire matrix, we often need to incorporate base frequencies into the model.

Another factor we might consider is the rate at which different characters change. Thinking back to the coin example, we imagined tossing the same coin ten times, so whatever rate of turning up heads (0.5 or 0.75) applied to the first data point applied to all of them. The simplest models of evolution make the same assumption: while different characters in the matrix might have experienced a different number of changes in their history, all characters had the same underlying rate of change. However, we know rates of change vary. For example, regions of DNA that are not transcribed will tend to evolve faster than regions that code for proteins because selection filters out many mutations in the latter. It can become tempting to continue elaborating the model to include more and more possible factors affecting sequence evolution. However, we typically aim to use the least complicated model necessary to capture the basic factors shaping the variation in our dataset.

Having fully specified the model by which characters evolve it is possible to calculate the likelihood of the data given one particular tree topology. In broad outline likelihood calculations can be divided into four steps.

First, we need to propose a particular length for each branch in the tree. Branch length is scaled relative to the rate of substitution so that it becomes possible to calculate the probability of starting at one state (A, C, G, or T) and ending at the same or another state.

Second, we have to consider each character in turn and assign a state to each node of the tree. For example, we could set all internal nodes to A. This amounts to specifying one of the possible histories that could have resulted in the observed tip states for this character. We have to repeat this for all possible assignments of states to the internal nodes. In a tree with 12 taxa, for example, there are 10 internal nodes and thus 410, over a million, different combinations of ancestral states. For each of these combinations we need to calculate the probability that evolution would pass through these ancestral states and end up at the states seen in the tips. Then, to obtain the total likelihood for this character given this tree and set of branch lengths, we sum the probabilities associated with each set of ancestral states. (We add these probabilities together because the outcome could arise via any one of these sets of ancestral states.)

Third, we multiply the likelihoods of each character. This follows because, to get this exact matrix of DNA sequences, all characters had to end up in their observed state. Because the likelihood of any one character achieving the observed pattern is low, the product is even lower. To facilitate communication it is normal to report the negative of the natural logarithm of the raw likelihood. Thus, a raw likelihood of 10-100 would be given as 227.956. This is the likelihood score.

Fourth we need to recall that we have done all these calculation under a single

set of branch lengths (and other parameters of the model, but we will ignore these). The likelihood score of a topology is taken to be the likelihood score of the set of branch lengths that has the highest raw likelihood (or lowest likelihood score). Therefore, before we can say we know the likelihood of the data given a particular tree topology we need to slide the branch lengths up and down to find the optimal set of branch lengths.

Even without considering the details it should be clear that it is tremendously challenging to calculate the likelihood of even a modest sized tree. This is why there was such a lag between the 1970’s when the maximum likelihood (ML) method of phylogenetic inference was first laid out (Felsenstein, 1973) and the mid-1990’s when it became feasible to apply the method to realistic data sets. In the intervening years computers became much faster and also theoreticians developed various ways to more quickly arrive at a good estimate of the likelihood score. Even so, large data sets typically take much longer to analyze with likelihood than with parsimony.

Given how complicated it is to use likelihood to estimate trees, you may wonder if it is worth it. What does likelihood gain us over parsimony? There has been considerable debate over this point but we would highlight two main advantages of likelihood. First, because parsimony does not take account of branch lengths, it can be lead astray. Parsimony will often yield an incorrect tree when the true tree has branches that have very different lengths (Felsenstein, 1978). Indeed, with true trees that have some very long and some very short branches, parsimony will tend to provide very strong support for an incorrect tree (supplemental materials). Specifically, there is a tendency for parsimony to support a tree in which the long branches are clustered in one part of the tree and the short branches in another part of the tree. This problem with parsimony is called long-branch attraction. For example, if the upper tree in the figure were true, the characters, when analyzed with parsimony, would tend yield a like that shown below. In contrast, so long as you pick a reasonable model of evolution, maximum likelihood is not prone to long-branch attraction.

The second advantage we see with maximum likelihood over parsimony is that it is a fully statistical method. In likelihood, we use explicit assumptions about how the data evolve to arrive at the best estimate of the true tree and to statistically compare alternative trees. Given that we often know a great deal about how characters evolve, it seems appealing to incorporate that information into a statistical framework for phylogenetic inference. Furthermore, within a likelihood framework it is possible to statistically evaluate alternative models of evolution and make an informed decision as to which is most appropriate for our data. In contrast, the assumptions that parsimony makes are rather obscure. And, while there are ways to tweak parsimony to accommodate the expectation of more transitions than transversions, there is no statistical framework for evaluating such alternatives. That being said, when speed of analysis is important, parsimony provides a fairly robust and useful alternative to maximum likelihood.

Finding maximum likelihood trees proceeds just like parsimony, almost always entailing heuristic search strategies. The only difference is that as the computer program calculates the likelihood of the data given each tree it encounters rather than tree length. Because likelihoods are calculated on a continuous scale whereas tree length is a discrete count of the number of character state changes, likelihood searches tend to recover only one or very few optimal trees. Because it takes a long time to calculate a likelihood score, these searches could take days or weeks on even moderately large datasets. However, new “genetic” algorithms, which use biological ideas about mutation and selection to move through tree space, have made likelihood searches much faster.


Using the simplest model of molecular evolution and a standard heuristic search, the following is the maximum likelihood tree for the carnivore molecular data. Its topology is identical to that found with parsimony. The numbers on the branches are the branch lengths in units of the average number of substitutions per site. The likelihood score of this tree is a little over 5282. This means that the likelihood of the data arising given this tree is e-5282.

©Baum and Smith 1/12/09. Draft. Do not circulate. Page 5