In this tutorial you will be guided in using PhyML and its extension, CodonPhyML, to solve common phylogenetic problems. For some of the following exercises there might be more than one single solution.
These exercises were prepared by Maria Anisimova
Dataset file: primates-nt.phy
GOAL: In this exercise you are asked to run PhyML twice in order to compare the effect of estimating nucleotide frequencies from the used dataset vs. optimising them with a maximum likelihood (ML) approach.
1. First run
Here is the list of the parameters to change:
From 2nd menu
[M] ................. Model of nucleotide substitution HKY85
[F] ................. Optimise equilibrium frequencies yes
[T] .................... Ts/tv ratio (fixed/estimated) estimated
[C] ........... Number of substitution rate categories 4
[G] ............. Gamma distributed rates across sites yes
[A] ... Gamma distribution parameter (fixed/estimated) estimated
2. Second run
Here is the list of the parameters to change:
From 2nd menu
[M] ................. Model of nucleotide substitution HKY85
[F] ................. Optimise equilibrium frequencies no
[T] .................... Ts/tv ratio (fixed/estimated) estimated
[C] ........... Number of substitution rate categories 4
[G] ............. Gamma distributed rates across sites yes
[A] ... Gamma distribution parameter (fixed/estimated) estimated
1. Do you see much difference in the tree?
The topology does not change, however we can observe a change in the computed branch lengths.
Nt-frequencies optimised | Nt-frequencies estimated |
---|---|
2. In the likelihood value (stat file)?
The likelihood value of the tree inferred using the nucleotide frequencies estimated empirically from the dataset is lower than the likelihood value of the tree inferred using the nucleotide frequencies estimated via ML. This is due to dataset dimensions (the number of the sequences used).
run 1: -6172.58045
run 2: -6173.49655
3. Which option is best and why do you think so?
The best option in this case is to use the ML approach since the ML optimisation technique provides better estimates for the nucleotide frequencies, which, in turn, affect the likelihood of the inferred tree. This is due to the nucleotide frequencies which are used in the calculation of the likelihood of each site of the alignment (, where is the vector of the nucleotide frequencies and the rate matrix).
GOAL: In this exercise you are asked to optimise the tree topology on the substitution parameters obtained using ML performing a tree search (i.e. NNI, SPR, TBR) on the initial tree topology.
1. Run
Here is the list of the parameters to change from the PhyML menu:
From 2nd menu
[M] ................. Model of nucleotide substitution HKY85
[F] ................. Optimise equilibrium frequencies yes
[T] .................... Ts/tv ratio (fixed/estimated) estimated
[C] ........... Number of substitution rate categories 4
[G] ............. Gamma distributed rates across sites yes
[A] ... Gamma distribution parameter (fixed/estimated) estimated
From 3rd menu
[O] ........................... Optimise tree topoLOGy no
1. Compare the trees obtained with and without tree-search. What do you observe and why?
The topology inferred without tree search presents a variation in the internal node attribution for the clades (DLangur, Colobus) and (Saki,Titi).
With tree-search | Without tree-search |
---|---|
2. Compare the model estimates with and without tree-search. What do you observe and why?
The transition/transversion ratio differs slightly between the two runs. As the matter of fact, parameter appears to have a higher value when tree-search is not performed.
3. Compare the likelihood of the ML and NJ trees. What do you observe and why?
The likelihood value obtained without tree-search is lower than the value obtain performing the topology optimisation. This value is expected since the substitution parameters are optimised over the first inferred tree topology which did not undergo any refinement.
with tree-search: -6172.58045
without tree-search: -6173.00555
GOAL: In this exercise you are asked to compare different substitution models (and their variations) on the same dataset.
Model | Log-likelihood | Parameters | AIC |
---|---|---|---|
JC | -6379.66383 | 38 | 12835.32766 |
JC_I | -6311.41650 | 39 | 12700.833 |
JC_G | -6304.38084 | 39 | 12686.76168 |
JC_G_I | -6304.35263 | 40 | 12688.70526 |
HKY | -6251.43051 | 42 | 12586.86102 |
HKY_I | -6180.09689 | 43 | 12446.19378 |
HKY_G | -6172.58045 | 43 | 12431.1609 |
HKY_G_I | -6172.52896 | 44 | 12433.05792 |
GTR | -6241.48788 | 46 | 12574.97576 |
GTR_I | -6171.17472 | 47 | 12436.34944 |
GTR_G | -6163.87291 | 47 | 12421.74582 |
GTR_G_I | -6172.52896 | 48 | 12441.05792 |
The number of parameters are computed in the following way for a rooted
tree:
substitution models: JC = 0 parameters | HKY = 4 parameters | GTR = 8
parameters
invariant sites = +1 parameter
gamma rates = +1 parameter
We compute the AIC value as follows:
1. Which model is the best (including HKY+Gamma), based on the AIC criterion?
We select the model with the lowest AIC value. In this case, GTR + Gamma
This exercise can also be solved using jmodeltest which performs automatically phylogenetic tree reconstructions with every model and extra parameters (i.e. invariant sites, gamma).
GOAL: In this exercise you are asked to compute the branch support using different statistical methods. We are going to evaluate it on the phylogenetic tree evaluated with the following settings:
. Tree topology search : NNIs
. Initial tree: BioNJ
. Model of nucleotides substitution: GTR
. Number of taxa: 20
. Log-likelihood: -6163.87291
. Unconstrained likelihood: -5060.29926
. Parsimony: 741
. Tree size: 0.54568
. Discrete gamma model: Yes
- Number of categories: 4
- Gamma shape parameter: 0.565
. Nucleotides frequencies:
- f(A)= 0.29130
- f(C)= 0.21021
- f(G)= 0.25260
- f(T)= 0.24589
. GTR relative rate parameters :
A <-> C 1.62195
A <-> G 4.11843
A <-> T 0.80648
C <-> G 1.47358
C <-> T 3.64931
G <-> T 1.00000
. Instantaneous rate matrix :
[A---------C---------G---------T------]
-1.00085 0.21604 0.65916 0.12565
0.29937 -1.10379 0.23585 0.56858
0.76015 0.19627 -1.11222 0.15580
0.14885 0.48607 0.16005 -0.79498
We are going to run PhyML twice: once using the non parametric bootstrap and the second time using the SH-aLRT method.
1. First run
[B] ................ Non parametric bootstrap analysis yes (100 replicates)
[A] ................ Approximate likelihood ratio test no
We perform a bootstrap analysis using 100 replicates (4^th^ menu).
Bootstrap analysis assumes that:
- sites are independently and identically distributed (i.i.d.)
- Amplifies the biases of the tree inference method
- Often conservative (>70% is considered strong support, but no guarantee)
- Slow for large datasets
- Difficult to interpret
2. Second run
[B] ................ Non parametric bootstrap analysis no
[A] ................ Approximate likelihood ratio test yes / aLRT statistics
We perform an sh-aLRT analysis (4^th^ menu).
SH-aLRT method assumes that:
- the branch being studied provides a significant likelihood gain.
- per each branch two alternative hypotheses are tested: a null hypothesis where the branch is collapsed and the rest of the topology is maintained vs. an alternative hypothesis where the branch length is kept and the topology is intact.
- aLRT is much faster than bootstrap
1. Are the results compatible?
Both the methods used to infer the branch support appear to agree on the support provided by long branches, however they do not agree for short ones. These two methods infer branch support using completely different strategies.
With bootstrap support | With sh-aLRT support |
---|---|
For linux/Mac Os X
We assume PhyML binary is included in the user PATH variable and it can be called with phyml
Positioning in the directory containing the dataset:
cd path/to/tutorial01_phyml
ex. 1
run1> phyml -i primates-nt.phy -m 'HKY85' -t 'e' -a 'e' f 'm'
run2> phyml -i primates-nt.phy -m 'HKY85' -t 'e' -a 'e' -f 'e'
ex. 2
run1> phyml -i primates-nt.phy -m 'HKY85' -t 'e' -a 'e' -f 'm' -o 'lr'
ex. 3
JC > phyml -i primates-nt.phy -m 'JC69' -t 'e' -a 'e' -f 'm' -c 1
JC_I > phyml -i primates-nt.phy -m 'JC69' -t 'e' -a 'e' -f 'm' -c 1 -v 'e'
JC_G > phyml -i primates-nt.phy -m 'JC69' -t 'e' -a 'e' -f 'm'
JC_I_G > phyml -i primates-nt.phy -m 'JC69' -t 'e' -a 'e' -f 'm' -v 'e'
HKY > phyml -i primates-nt.phy -m 'HKY85' -t 'e' -a 'e' -f 'm' -c 1
HKY_I > phyml -i primates-nt.phy -m 'HKY85' -t 'e' -a 'e' -f 'm' -c 1 -v 'e'
HKY_G > phyml -i primates-nt.phy -m 'HKY85' -t 'e' -a 'e' -f 'm'
HKY_I_G > phyml -i primates-nt.phy -m 'HKY85' -t 'e' -a 'e' -f 'm' -v 'e'
GTR > phyml -i primates-nt.phy -m 'GTR' -t 'e' -a 'e' -f 'm' -c 1
GTR_I > phyml -i primates-nt.phy -m 'GTR' -t 'e' -a 'e' -f 'm' -c 1 -v 'e'
GTR_G > phyml -i primates-nt.phy -m 'GTR' -t 'e' -a 'e' -f 'm'
GTR_I_G > phyml -i primates-nt.phy -m 'GTR' -t 'e' -a 'e' -f 'm' -v 'e'
ex. 4
Bootstrap(100) > phyml -i primates-nt.phy -m 'HKY85' -t 'e' -a 'e' -f 'e' -b 100
SH-aLRT > phyml -i primates-nt.phy -m 'HKY85' -t 'e' -a 'e' -f 'e' -b -4
GOAL:
Preprocessing
The dataset should not include any stop codon, therefore we modify the dataset deleting the last 3 columns (the last codon) from the MSA. In order to do so, you can open primates-nt.phy with AliView and after selecting the last 3 columns press delete on the keyboard. Save the new file as Phylip with the following name primates-nt-nostop.phy.
For models M0, M0+Gamma and M5
From 1st menu
[D] ................. Data type (DNA/AA/CODON/Generic) CODON
[E] ............................... Matrix exponential eigenvalues
[H] ... Optimization heuristic for tree reconstruction original
[O] .................. Parameter optimization strategy multivariate (BFGS)
[I] ...... Input sequences interleaved (or sequential) interleaved
[M] ....................... Analyze multiple data sets no
M0
From 2nd menu
[m] ...... Model type (parametric/empirical/semi-emp.) parametric
[M] ...................... Model of codon substitution GY
[i] ..... Initial pairwise distances using codon model KOSI07
[g] ..................................... Genetic code Standard
[W] ................. Model for Dn/Ds ratio (M0/M3/M5) M0
[P] ...................... Equilibrium frequency model CF3x4
[F] ............................ Origin of frequencies ML estim.
[T] .................... Ts/Tv ratio (fixed/estimated) estimated
[D] .................... Dn/Ds ratio (fixed/estimated) estimated
[V] . Proportion of invariable sites (fixed/estimated) fixed (p-invar = 0.00)
[R] ....... One category of substitution rate (yes/no) yes
M0+Gamma
From 2nd menu
[m] ...... Model type (parametric/empirical/semi-emp.) parametric
[M] ...................... Model of codon substitution GY
[i] ..... Initial pairwise distances using codon model KOSI07
[g] ..................................... Genetic code Standard
[W] ................. Model for Dn/Ds ratio (M0/M3/M5) M0
[P] ...................... Equilibrium frequency model CF3x4
[F] ............................ Origin of frequencies ML estim.
[T] .................... Ts/Tv ratio (fixed/estimated) estimated
[D] .................... Dn/Ds ratio (fixed/estimated) estimated
[V] . Proportion of invariable sites (fixed/estimated) fixed (p-invar = 0.00)
[R] ....... One category of substitution rate (yes/no) no
[C] ........... Number of substitution rate categories 4
[A] ... Gamma distribution parameter (fixed/estimated) estimated
[G] .........'Middle' of each rate class (mean/median) mean
M5
[m] ...... Model type (parametric/empirical/semi-emp.) parametric
[M] ...................... Model of codon substitution GY
[i] ..... Initial pairwise distances using codon model KOSI07
[g] ..................................... Genetic code Standard
[W] ................. Model for Dn/Ds ratio (M0/M3/M5) M5
[w] ................. Number of Dn/Ds ratio categories 3
[P] ...................... Equilibrium frequency model CF3x4
[F] ............................ Origin of frequencies ML estim.
[T] .................... Ts/Tv ratio (fixed/estimated) estimated
[D] .................... Dn/Ds ratio (fixed/estimated) gamma distribution
[a] .. Gamma distribution alpha/beta (fixed/estimated) estimated
[G] .........'Middle' of each rate class (mean/median) mean
[V] . Proportion of invariable sites (fixed/estimated) fixed (p-invar = 0.00)
For models LG, WAG, LG+Gamma and WAG+Gamma
[D] ................. Data type (DNA/AA/CODON/Generic) AA
[d] ............. Convert NT sequence into AA sequence yes
[G] ..................................... Genetic code Standard
[I] ...... Input sequences interleaved (or sequential) interleaved
[M] ....................... Analyze multiple data sets no
LG
[M] ................ Model of amino-acids substitution LG
[F] .. Amino acid frequencies (model/empiric/optimize) model
[V] . Proportion of invariable sites (fixed/estimated) fixed (p-invar = 0.00)
[R] ....... One category of substitution rate (yes/no) yes
WAG
[M] ................ Model of amino-acids substitution WAG
[F] .. Amino acid frequencies (model/empiric/optimize) model
[V] . Proportion of invariable sites (fixed/estimated) fixed (p-invar = 0.00)
[R] ....... One category of substitution rate (yes/no) yes
LG+Gamma
[M] ................ Model of amino-acids substitution LG
[F] .. Amino acid frequencies (model/empiric/optimize) model
[V] . Proportion of invariable sites (fixed/estimated) fixed (p-invar = 0.00)
[R] ....... One category of substitution rate (yes/no) no
[C] ........... Number of substitution rate categories 4
[A] ... Gamma distribution parameter (fixed/estimated) estimated
[G] .........'Middle' of each rate class (mean/median) mean
WAG+Gamma
[M] ................ Model of amino-acids substitution WAG
[F] .. Amino acid frequencies (model/empiric/optimize) model
[V] . Proportion of invariable sites (fixed/estimated) fixed (p-invar = 0.00)
[R] ....... One category of substitution rate (yes/no) no
[C] ........... Number of substitution rate categories 4
[A] ... Gamma distribution parameter (fixed/estimated) estimated
[G] .........'Middle' of each rate class (mean/median) mean
GTR+Gamma
From 1st menu
[D] ................. Data type (DNA/AA/CODON/Generic) DNA
[I] ...... Input sequences interleaved (or sequential) interleaved
[M] ....................... Analyze multiple data sets no
From 2nd menu
[M] ................. Model of nucleotide substitution GTR
[F] ................. Optimise equilibrium frequencies yes
[V] . Proportion of invariable sites (fixed/estimated) fixed (p-invar = 0.00)
[R] ....... One category of substitution rate (yes/no) no
[C] ........... Number of substitution rate categories 4
[A] ... Gamma distribution parameter (fixed/estimated) estimated
[G] .........'Middle' of each rate class (mean/median) mean
The number of parameters are computed in the following way for a rooted tree:
substitution models: LG = 0 parameters | WAG = 0 parameters | GTR = 8
parameters | GY = 2 | M0 = 2 | M5 = 2
gamma rates = +1 parameter
We compute the AIC value as follows:
Model | Log-likelihood | Parameters | AIC |
---|---|---|---|
GTR_G | -6158.95 | 47 | 12411.90 |
LG | -4826.14 | 38 | 9728.28 |
LG_G | -4735.17 | 39 | 9548.34 |
M0 | -6163.53 | 40 | 12407.06 |
M0_G | -6075.69 | 41 | 12233.38 |
M5 | -6057.39 | 40 | 12194.78 |
WAG | -4785.42 | 38 | 9646.84 |
WAG_G | -4702.97 | 39 | 9483.94 |
1. Based on AIC, which model fits your dataset best?
WAG + Gamma
2. Are the trees inferred using the best AA and best codon model different?
According to the AIC ranking, the best codon model should be M5.
Herewith the two trees.
Using codon model | Using AA model |
---|---|
3. Are they different to the tree inferred using the DNA model?
Yes, they are also different to the tree inferred using the GTR+Gamma model.
GOAL: Testing different amino acids substitution models on a protein dataset
Dataset: primates-aa.phy
1. Using PhyML
We performed the analyses using PhyML and combining the following
settings accordingly to the requests:
[M] ................ Model of amino-acids substitution LG
[F] . Amino acid frequencies (empirical/model defined) model
[V] . Proportion of invariable sites (fixed/estimated) fixed (p-invar = 0.00)
[R] ....... One category of substitution rate (yes/no) yes
From the analyses we retrieved:
Model | Log-likelihood | Parameters | AIC |
---|---|---|---|
LG_G | -5199.88631 | 39 | 10477.77262 |
LG_G_I | -5199.55701 | 40 | 10479.11402 |
LG_G_F | -5230.44821 | 58 | 10576.89642 |
LG_G_F_I | -5230.08104 | 59 | 10578.16208 |
2. Using ProtTest webservice
You can find the results in the file ex7/prottest_analysis.txt
1. Which is the best model based on AIC criterion? What about using invariant sites?
We selected the LG + Gamma model according to the AIC criterion. Using the invariant sites does not influence the model selection. Adding a proportion of invariant sites increases the number of parameters, therefore the AIC penalises the inferred log-likelihood value.
2. Compare the trees, likelihood and AIC values to those of LG+Gamma.
Value | Tree estimated with LG+Gamma | Tree estimated with HIVw+G+F |
---|---|---|
Log-Likelihood: | -5199.89 | -5013.50 |
AIC: | 10477.77 | 10141.01 |
Tree |
GOAL: Inferring a phylogenetic tree on a protein dataset using an appropriate substitution model and detect an event of HGT.
1. Using PhyML
We used a selection of candidate substitution models (JTT, LG, WAG) that are suitable for the protein datasets.
dataset: Prot_biob.phy
Model | Log-likelihood | Parameters | AIC |
---|---|---|---|
JTT_G_I | -4593.61928 | 40 | 9267.23856 |
LG_G_I | -4546.51747 | 40 | 9173.03494 |
WAG_G_I | -4564.41245 | 40 | 9208.8249 |
We ranked the tested models according to the AIC statistics and we selected the LG+G+I model which best fits our dataset.
2. Using ProtTest webservice
You can find the results in the file ex8/prottest_analysis.txt
The best model in output from the model selection analysis is WAG+G+F
1. What can you tell based on the ML estimates?
Parameter | Using LG+G+I model | Using WAG+G+F model |
---|---|---|
Log-likelihood: | -4546.52 | -4542.71 |
Gamma shape parameter: | 0.953 | 0.633 |
Proportion of invariant: | 0.182 | 0 |
When LG+G+I model is used, the alpha parameter of the gamma distribution is skewed towards which in turns indicate that the rate variability is centred on one category.
2. Display trees
We coloured the branches according to their support (blue = 1, red=0).
Using LG+G+I model | Using WAG+G+F model |
---|---|
3. Are you able to recover the HGT (horizontal gene transfer) published by Lerat et al.?
Document last updated on 02.02.2016
© 2016 Lorenzo Gatti – Applied Computational Genomic Team (ACGT) @
Institute of Applied Simulations (ZHAW) | Wädenswil | Zürich