Determination of appropriateness of separate-parameters and joint-parameters models

To investigate whether the additional parameterisation of the separate-parameters model was justified, we calculated AIC scores for both approaches. Under the joint-parameters approach, WS2010 calculated model parameters a single time from a global alignment of 521 species (including several duplicate sequences), and then applied these parameters to pruned trees that included only the four taxa from Gillman et al. (2009). As we were unable to re-estimate the precise model parameters – and therefore the results – reported by WS2010 these were obtained directly from the authors. AIC scores were manually calculated from log-likelihood tree scores calculated by PAUP* as follows:

AIC = 2k – 2 ln(L)

Where k is the number of free parameters, and ln(L) is the log-likelihood score of the tree. The number of free parameters differs between the separate- and joint-parameters models in that the only free parameters in joint-parameters model are the branch lengths, while an additional 10 parameters are free to vary in the separate-parameters GTR+I+Γ model (the rate matrix, base frequencies, the proportion of invariable sites and the gamma rate parameter).

Because the two approaches are applied to identical four-taxon alignments, AIC scores are comparable for the two methods. The resulting AIC scores for each pair of results (separate- and joint-parameters models) we used to determine in each instance which model was more suitably parameterised.

Effect of overparameterisation on branch length ratios

Sequence evolution was simulated on a standard tree under a simple model of evolution to generate a sequence alignment. Phylogenetic trees were estimated twice from this alignment, once with the same simple model and once using a parameter-rich model. On each tree the branch lengths for a focal pair of sister taxa were recorded and the ratio of their lengths computed.

The basic tree was a 4-taxon tree comprising two in-group taxa (in1 and in2) and two out-group taxa (out1 and out2) with the topology ((in1:ab,in2:b):b,(out1:2b,out2:2b):2b). Trees of different scale were obtained by giving branch length b the values 0.01, 0.05 and 0.1. The relative lengths of the branches leading to in1 and in2 were adjusted by giving the branch length ratio a the values 1, 1.2, 1.5 and 2. In all, 12 trees were used.

Sequence evolution was simulated using the seq-gen software (Rambaut and Grassly, 1997) on each of the trees. Alignments were 1000bp long, and were simulated using the HKY85 model with equal nucleotide frequencies and transition/transversion ratios (Ti/Tv) of 1.5, 2 and 5. For each of the 36 combinations of tree and sequence parameters, 1000 replicate sequence alignments were generated.

For each sequence alignment, two trees were estimated using PAUP* (Swofford, 2002) using a heuristic search to find the maximum likelihood tree. The first tree was estimated using the the HKY85 model with the default Ti/Tv ratio of 2 and empirical nucleotide frequencies. The second tree was estimated under the general time reversible model with site-wise rate variation modelled with a gamma distribution (GTR+G). Nucleotide frequencies, substitution rate matrix and gamma shape parameter were all estimated using maximum likelihood. From each tree the lengths of the branches (br1, br2) leading to in1 and in2 were recorded and the ratio br1/br2 calculated.

References

Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Computer applications in the biosciences 13:235-238.

Swofford, D. L. 2002. PAUP*: Phylogenetic analysis using parsimony (* and other methods). version 4. Sinauer Associates.