Supplemental Material for “CodonPhyML: Fast Maximum Likelihood Phylogeny Estimation under Codon Substitution Models”

Manuel Gil*1,2,4,+, Marcelo Serrano Zanetti1,3,+, Stefan Zoller1,2 and Maria Anisimova1,2

1 Department of Computer Science, Swiss Federal Institute of Technology (ETH), Zürich, Switzerland;

2 Swiss Institute of Bioinformatics (SIB), Switzerland;

3 Department of Management, Technology, and Economics, Swiss Federal Institute of Technology (ETH), Zürich, Switzerland;

4Center for Integrative Bioinformatics Vienna, Max F. Perutz Laboratories, University of Vienna, Medical University Vienna, Vienna, Austria

+ Authors contributed equally

*Corresponding author:

Address: Manuel Gil

Center for Integrative Bioinformatics Vienna (CIBIV)

Max F. Perutz Laboratories (MFPL)

Dr. Bohr Gasse 9

A-1030 Wien, Austria

Phone: ++43 +1 / 4277-24029

Email:

Running head: ML Phylogeny Estimation under Codon Models

Key words: Maximum likelihood, phylogeny inference, codon model, evolution, selection

Correspondence may also be addressed to

Supplemental Figures

Figure S1. Statistical consistency: asymptotic convergence of topology and parameter estimates for simulated 25-taxon datasets. With increasing sequence length the estimates of model parameters and topology (under true model) become closer to the true values: Shown are the absolute differences between the estimates and the truth (as known from simulated data with 25 taxa, dataset S2) under true models M0 or M3 and with search heuristics NNI and SPR.

Figure S2. Likelihood difference of various options to the default settings. Simulated datasets (S3 in Table 1) with increasing number of taxa (60 to 480) have been analyzed under the correct models (M0 and M3). The histograms show the average difference of the log-likelihood between the default settings (once with NNI and once with SPR) and a given method. Error bars correspond to ± 1.96 s.d. A positive values means that the corresponding method converged on average on trees with lower likelihood than the default setting.


Figure S3. Timings of various options. Simulated datasets (S3 in Table 1) with increasing number of taxa (60 to 480) have been analyzed under the correct models (M0 and M3) with various options. The histograms show the average running time of a given method (measured on machines of type M1 in Table S2). Error bars correspond to ± 1.96 s.d. Depicted is the CPU time, with exception of the options using multiple cores (denoted by +OMP). For those we show the total real time, measured on a machine which was not running any additional processes.


Figure S4. Difference between parameters estimated on the best tree found and on the trees found by various options. Simulated datasets (S3 in Table 1) with increasing number of taxa (60 to 480) have been analyzed under the correct models (M0 and M3) with various options. We have monitored the corresponding log-likelihood, tree topology, transition/transversion rate ratio k, synonymous/nonsynonymous substitution rate w, and the tree length (sum of ML branch lengths). The histograms show the average difference between a given method and BestML (defined in the main text). Topological distance has been quantified by the Robinson-Foulds distance (Robinson, Foulds 1981). Error bars correspond to ± 1.96 s.d.


Figure S5. Likelihood-ranks for various options. Simulated datasets (S3 in Table 1) with increasing number of taxa (60 to 480) have been analyzed under the correct models (M0 and M3) with various options. The histograms show the average likelihood-ranks.


Figure S6. Correlation of branch support values for trees inferred using an amino-acid (LG+G) and a codon model (M3). Shown are the average support values (aLRT, Anisimova and Gascuel 2006) for branches inferred using both models (“common” branches) and branches inferred exclusively by one of the two models (“non-common” branches).

Supplemental Tables

Table S1. Comparison of parameters estimated by CodonPhyML and CodeML on all quintet topologies. Shown are the absolute differences between ML estimates obtained by CodonPhyML and CodeML for the fifteen possible 5-taxa topologies.

For models M3 and M5, shown is the difference of omega averages over the discrete categories.

Model / Topology / Log-likelihood / Tree length / k / w
M0 / 1 / 0 / 0 / 0 / 0
2 / 0.0002 / 0.0001 / 0 / 0.0001
3 / 0.0001 / 0.0002 / 0.0001 / 0.0001
4 / 0 / 0 / 0 / 0
5 / 0 / 0 / 0 / 0
6 / 0.0002 / 0.0001 / 0 / 0
7 / 0 / 0 / 0 / 0
8 / 0 / 0 / 0 / 0
9 / 0 / 0 / 0 / 0
10 / 0.0002 / 0.0001 / 0 / 0.0001
11 / 0.0001 / 0.0002 / 0.0001 / 0
12 / 0 / 0 / 0 / 0
13 / 0 / 0 / 0 / 0
14 / 0 / 0 / 0 / 0
15 / 0.0002 / 0.0001 / 0 / 0.0001
M3 / 1 / 0.0001 / 0 / 0 / 0
2 / 0.0001 / 0.0001 / 0 / 0
3 / 0.0001 / 0.0002 / 0.0001 / 0.0001
4 / 0 / 0 / 0 / 0
5 / 0 / 0.0001 / 0 / 0
6 / 0.0001 / 0.0001 / 0 / 0
7 / 0.0125 / 0.0004 / 0.0006 / 0.0007
8 / 0.0002 / 0.0003 / 0.0001 / 0.0001
9 / 0.0097 / 0.0005 / 0.0006 / 0.0007
10 / 0.0002 / 0.0001 / 0.0001 / 0.0001
11 / 0.01 / 0.0003 / 0.0007 / 0.0007
12 / 0.0098 / 0.0002 / 0.0005 / 0.0006
13 / 0.0096 / 0.0003 / 0.0005 / 0.0006
14 / 0.0118 / 0.0003 / 0.0009 / 0.0004
15 / 0.0006 / 0.0002 / 0 / 0
M5 / 1 / 0.0188 / 0.0002 / 0.0003 / 0
2 / 0.0015 / 0.0002 / 0.0002 / 0
3 / 0.2313 / 0.0002 / 0.0003 / 0.0015
4 / 0.018 / 0.0002 / 0.0003 / 0
5 / 0.0188 / 0.0002 / 0.0003 / 0
6 / 0.0001 / 0.0002 / 0 / 0
7 / 0.0007 / 0.0002 / 0 / 0
8 / 0.0008 / 0.0002 / 0 / 0
9 / 0.0006 / 0.0002 / 0 / 0
10 / 0.5015 / 0.0002 / 0 / 0.0018
11 / 0.36 / 0.0001 / 0.0001 / 0.0019
12 / 0.0006 / 0.0001 / 0 / 0
13 / 0.0006 / 0.0001 / 0 / 0
14 / 0.0007 / 0.0321 / 0.0033 / 0
15 / 0.0013 / 0.0001 / 0 / 0

Table S2. Machines used for computation. All machines have a 64-bit architecture and are operated by a Ubuntu Linux distribution.

ID / Model name / Clock speed [GHz] / No. cores / RAM [GB]
M1 / Intel(R) Xeon(R) E5520 / 2.27 / 8 / 24
M2 / AMD Opteron(tm) Processor 6174 / 2.2 / 6 / 128
M3 / AMD Opteron(tm) Processor 2372HE / 2.1 / 4 / 16

Table S3. RF distances and AIC differences calculated under a codon model (M3) between trees inferred using the codon model and trees inferred using either a nucleotide (HKY+G) or an amino-acid model (LG+G). The searches were conducted under default parameter settings (which includes NNI). For entries marked with a cross (+) the tree search under M3 converged to a local optimum below the log-likelihood of the tree found by HKY+G or LG+G, respectively. In such cases the search was repeated starting from the corresponding HKY+G or LG+G tree. AIC differences exceeding two points can be considered significant (Burnham and Anderson, 2003, Model selection and multimodel inference: A practical information-theoretic approach, 2nd ed. Springer-Verlag, page 70) and are indicate by *. A positive difference indicates that the tree found by the codon model fits the data better. Data is shown for the datasets from Table 2.

Codon vs amino acid / Codon vs nucleotide
ID / AIC difference / RF distance / AIC difference / RF distance
R1 / 0 / 0.00 / -3.38* / 0.15
R2 / -15.42* / 0.48 / -5.84* / 0.48
R3 / -27.92*+ / 0.67 / -9.96*+ / 0.30
R4 / -0.32 / 0.11 / -3.86* / 0.21
R5 / -27.72* / 0.56 / -11.84* / 0.14
R6 / -0.48 / 0.15 / -7.38* / 0.77
R7 / -2.52* / 0.22 / -2.42* / 0.22
R8 / -1.58 / 0.53 / -0.08 / 0.13
R9 / 0 / 0.00 / 0 / 0.00
R10 / -104.22* / 0.43 / -217.5* / 0.53
R11 / -28.82* / 0.60 / -655* / 0.72
R12 / -26.06* / 0.47 / -26.88* / 0.37
R13 / -181.5* / 0.56 / -5.96* / 0.15
R14 / -132.2* / 0.41 / -2.16* / 0.05
R15 / -93.62* / 0.51 / -0.2+ / 0.10
R16 / -102.64* / 0.51 / -6.84* / 0.15
R17 / -307.74* / 0.62 / -2.44* / 0.10
R18 / -17.08* / 0.21 / -3.56* / 0.10
R19 / -92.1* / 0.67 / 0 / 0.00
R20 / -165.1* / 0.87 / -0.46+ / 0.05
R21 / -442.52*+ / 0.42 / -29.98*+ / 0.35
R22 / -470.54* / 0.97 / -9.64* / 0.45
R23 / 0 / 0.44 / 0 / 0.38

11

Table S4. Pairwise likelihood differences for starting trees obtained with different methods and models. Starting trees were obtained using maximum parsimony (MP), or with the BioNJ algorithm under one of three codon models: the equal-rate model (N61), and the empirical models ECMS05 and ECMK07. Log-likelihoods of every starting tree obtained by these methods were subsequently re-optimized on a fixed tree under two models: the parametric model GYM3 and the semi-parametric model ECMK07+wM3+k. Data is shown for the first 10 datasets from table 2.

Data set / GY M3 model variant / ECMS05 vs. N61 / ECMK07 vs. N61 / MP vs. N61 / ECMS05 vs. ECMK07 / MP vs. ECMK07 / MP vs. ECMS05
R1 / parametric / 132.64 / 139.13 / -70.41 / -6.5 / -209.54 / -203.04
semi- parametric / 81.97 / 52.88 / -127.25 / 29.09 / -180.13 / -209.21
R2 / parametric / 77.73 / 62.76 / 0.41 / 14.97 / -62.35 / -77.31
semi- parametric / 75.44 / 66.95 / 3.84 / 8.49 / -63.11 / -71.61
R3 / parametric / 128.24 / 75.13 / 42.86 / 53.11 / -32.28 / -85.39
semi- parametric / 201.83 / 208.87 / 53.62 / -7.04 / -155.25 / -148.21
R4 / parametric / 709.96 / 566.93 / 203.66 / 143.03 / -363.27 / -506.31
semi- parametric / 542.55 / 612.84 / 194.01 / -70.3 / -418.83 / -348.53
R5 / parametric / 691.87 / 428.74 / -68.66 / 263.12 / -497.41 / -760.53
semi- parametric / 594.32 / 568.08 / 146.53 / 26.24 / -421.55 / -447.79
R6 / parametric / 105.47 / 102.51 / -54.43 / 2.95 / -156.94 / -159.9
semi- parametric / 41.49 / 38.23 / -75.05 / 3.26 / -113.28 / -116.54
R7 / parametric / 37.94 / 16.08 / -71.07 / 21.86 / -87.15 / -109.01
semi- parametric / 15.65 / 25.25 / -54.77 / -9.6 / -80.01 / -70.42
R8 / parametric / 23.75 / 15.98 / -12.28 / 7.77 / -28.26 / -36.03
semi- parametric / 10.45 / 10.96 / -8.58 / -0.5 / -19.54 / -19.03
R9 / parametric / 94.28 / 84.36 / -61 / 9.92 / -145.36 / -155.27
semi- parametric / 49.74 / 64.24 / -84.99 / -14.5 / -149.23 / -134.73
R10 / parametric / 7602.09 / 7391.99 / 1032.96 / 210.1 / -6359.03 / -6569.13
semi- parametric / 4332.94 / 4573.12 / 1388.51 / -240.18 / -3184.61 / -2944.43

11

Table S5. Ranks of the methods to obtain a starting tree: obtained from real data sets (table 2) and computed as a number of times the method produced the best initial tree in a pairwise comparisons of log-likelihoods (e.g. see table S3).

Method / Score / Rank
ECMS05 / 53 / 1
ECMK07 / 47 / 2
N61 / 11 / 3
MP / 9 / 4

Table S6. Comparison of branch support values between trees inferred using (A) a nucleotide (HKY+G) and a codon model (M3), and (B) an amino acid (LG+G) and a codon model (M3). Shown are the average support values (approximate likelihood ratio test Anisimova and Gascuel 2006) for branches inferred using both models (“common” branches), branches inferred exclusively by one of the two models (“non-common” branches), and for all branches inferred by a model. A dash indicates that the trees did not share a common branch or that the tree topologies were identical. Data is shown for the datasets from table 2. Correlation of support values between pairs of models were computed separately for cases where codon model fits best (in bold face) or amino acid model fits best (not bold).

A.

Dataset / Common branches / Non-common branches / All Branches
M3 / HKY+G / M3 / HKY+G / M3 / HKY+G
R1 / 0.78 / 0.75 / ¾ / ¾ / 0.78 / 0.75
R2 / 0.66 / 0.47 / 0.48 / 0.19 / 0.56 / 0.31
R3 / 0.62 / 0.64 / 0.59 / 0.45 / 0.60 / 0.50
R4 / 0.93 / 0.95 / 0.23 / 0.17 / 0.67 / 0.66
R5 / 0.70 / 0.77 / 0.59 / 0.00 / 0.69 / 0.65
R6 / ¾ / ¾ / 0.47 / 0.48 / 0.47 / 0.48
R7 / 0.84 / 0.74 / 0.46 / 0.00 / 0.71 / 0.49
R8 / 0.76 / 0.61 / 0.00 / 0.00 / 0.63 / 0.51
R9 / ¾ / ¾ / 0.87 / 0.86 / 0.87 / 0.86
R10 / 0.82 / 0.75 / 0.73 / 0.50 / 0.76 / 0.59
R11 / 0.84 / 0.76 / 0.57 / 0.45 / 0.65 / 0.54
R12 / 0.83 / 0.86 / 0.54 / 0.24 / 0.73 / 0.64
R13 / 0.83 / 0.85 / 0.16 / 0.11 / 0.72 / 0.73
R14 / 0.98 / 0.88 / 0.39 / 0.10 / 0.85 / 0.71
R15 / 0.93 / 0.67 / 0.74 / 0.47 / 0.86 / 0.60
R16 / 0.84 / 0.81 / 0.32 / 0.28 / 0.72 / 0.69
R17 / 0.85 / 0.82 / 0.54 / 0.25 / 0.80 / 0.72
R18 / 0.77 / 0.77 / 0.07 / 0.00 / 0.73 / 0.73
R19 / 0.75 / 0.75 / ¾ / ¾ / 0.75 / 0.75
R20 / 0.73 / 0.74 / 0.56 / 0.30 / 0.68 / 0.62
R21 / 0.81 / 0.80 / 0.55 / 0.44 / 0.72 / 0.68
R22 / 0.14 / 0.18 / 0.03 / 0.02 / 0.09 / 0.11
R23 / 0.11 / 0.11 / 0.02 / 0.00 / 0.07 / 0.06

B.

Dataset / Common branches / Non-common branches / All Branches
M3 / LG+G / M3 / LG+G / M3 / LG+G
R1 / 0.96 / 0.90 / 0.07 / 0.36 / 0.78 / 0.79
R2 / 0.63 / 0.65 / 0.50 / 0.39 / 0.56 / 0.51
R3 / 0.87 / 0.86 / 0.56 / 0.39 / 0.60 / 0.45
R4 / 0.81 / 0.86 / 0.24 / 0.00 / 0.67 / 0.64
R5 / 0.89 / 0.86 / 0.55 / 0.41 / 0.69 / 0.59
R6 / 0.56 / 0.61 / 0.33 / 0.00 / 0.47 / 0.37
R7 / ¾ / ¾ / 0.71 / 0.57 / 0.71 / 0.57
R8 / 0.87 / 0.91 / 0.51 / 0.38 / 0.63 / 0.56
R9 / 0.87 / 0.97 / ¾ / ¾ / 0.87 / 0.97
R10 / 0.82 / 0.77 / 0.72 / 0.40 / 0.76 / 0.55
R11 / 0.82 / 0.78 / 0.56 / 0.50 / 0.65 / 0.60
R12 / 0.91 / 0.83 / 0.69 / 0.52 / 0.73 / 0.58
R13 / 0.72 / 0.97 / 0.72 / 0.43 / 0.72 / 0.55
R14 / 0.94 / 0.91 / 0.72 / 0.42 / 0.85 / 0.72
R15 / 0.96 / 0.97 / 0.79 / 0.24 / 0.86 / 0.56
R16 / 0.95 / 1.00 / 0.68 / 0.56 / 0.72 / 0.64
R17 / 0.97 / 0.98 / 0.72 / 0.42 / 0.80 / 0.61
R18 / 0.87 / 0.84 / 0.23 / 0.09 / 0.73 / 0.68
R19 / 0.89 / 0.92 / 0.69 / 0.28 / 0.75 / 0.46
R20 / 1.00 / 1.00 / 0.66 / 0.49 / 0.68 / 0.51
R21 / 0.83 / 0.82 / 0.56 / 0.42 / 0.72 / 0.66
R22 / ¾ / ¾ / 0.09 / 0.11 / 0.09 / 0.11
R23 / 0.05 / 0.05 / 0.09 / 0.05 / 0.07 / 0.05

Table S7. Example profiling of PhyML on simulated data. Shown is the relative amount of time (measured with the program Gprof (Graham, Kessler, Mckusick 1982)) PhyML spends on a 100 taxa, 1000 codon dataset (S2 in Table 1) to calculate (1) the marginal distribution of ancestral states, (2) partial likelihoods of subtrees, (3) the transition probability matrix as a function of the eigenvalue/eigenvector decompositions of the rate matrix, (4) the decomposition itself, (5) all the other operations.