Home | Download repository


Tutorial 2: Evaluating selection pressure with codon models


In this tutorial you will be guided in using PaML to detect natural selection on protein-coding datasets. For some of the following exercises there might be more than one single solution.

These exercises were prepared by Maria Anisimova


1. The simple codon model with constant

GOAL: In this exercise we apply the simplest codon model with constant parameter over tree branches and MSA sites. We run codeml twice since we want to compare the log-likelihood of the model when the parameters are optimised on a fixed tree topology (provided with the exercise) and when the branch lengths are re-estimated using the selected substitution model.

Execution

We are going to use the following dataset:
- bglobin.nuc (MSA)
- bglobin.tree (Tree)

Run 1 – branch lengths fixed
We create a new control file for codeml using the template provided in PaML folder. Then, we modify the following settings in order to run the analysis according to the exercise requests:

seqfile = /path/to/bglobin.nuc       * sequence data filename
treefile =/path/to/bglobin.trees     * tree structure file name
outfile = /path/to/output_m0_fixed   * main result file name

seqtype = 1  * 1:codons; 2:AAs; 3:codons-->AAs

model = 0    * models for codons: 0
             *  0:one, 1:b, 2:2 or more dN/dS ratios for branches   

NSsites = 0  * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;
                * 5:gamma;6:2gamma;7:beta;8:beta&w;9:betaγ
                * 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;
                * 13:3normal>0

fix_blength = 2  * 0: ignore, -1: random, 1: initial, 2: fixed

Run 2 – branch lengths re-estimated

seqfile = /path/to/bglobin.nuc       * sequence data filename
treefile =/path/to/bglobin.trees     * tree structure file name
outfile = /path/to/output_m0_ML      * main result file name

seqtype = 1  * 1:codons; 2:AAs; 3:codons-->AAs

model = 0    * models for codons: 0
             *  0:one, 1:b, 2:2 or more dN/dS ratios for branches   

NSsites = 0  * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;
                * 5:gamma;6:2gamma;7:beta;8:beta&w;9:betaγ
                * 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;
                * 13:3normal>0

fix_blength = 1  * 0: ignore, -1: random, 1: initial, 2: fixed
Questions

1. Do you observe the codon frequency bias?

Sums of codon usage counts
------------------------------------------------------------------------------
Phe F TTT      65 | Ser S TCT      29 | Tyr Y TAT      12 | Cys C TGT      22
      TTC      68 |       TCC      44 |       TAC      32 |       TGC       6
Leu L TTA       1 |       TCA       4 | *** * TAA       0 | *** * TGA       0
      TTG      15 |       TCG       0 |       TAG       0 | Trp W TGG      43
------------------------------------------------------------------------------
Leu L CTT      13 | Pro P CCT      40 | His H CAT      48 | Arg R CGT       2
      CTC      54 |       CCC      24 |       CAC      78 |       CGC      11
      CTA       6 |       CCA       9 | Gln Q CAA       7 |       CGA       0
      CTG     203 |       CCG       4 |       CAG      65 |       CGG       1
------------------------------------------------------------------------------
Ile I ATT      16 | Thr T ACT      45 | Asn N AAT      49 | Ser S AGT      31
      ATC      28 |       ACC      46 |       AAC      70 |       AGC      19
      ATA       3 |       ACA       4 | Lys K AAA      36 | Arg R AGA       2
Met M ATG      23 |       ACG       1 |       AAG     163 |       AGG      43
------------------------------------------------------------------------------
Val V GTT      62 | Ala A GCT     113 | Asp D GAT      57 | Gly G GGT      49
      GTC      51 |       GCC     132 |       GAC      65 |       GGC     109
      GTA       4 |       GCA      12 | Glu E GAA      40 |       GGA      17
      GTG     147 |       GCG       7 |       GAG      81 |       GGG      17
------------------------------------------------------------------------------

2. Which position displays the most bias? Why?

Codon position x base (3x4) table, overall

position  1:    T:0.13930    C:0.23080    A:0.23652    G:0.39338
position  2:    T:0.31005    C:0.20997    A:0.32802    G:0.15196
position  3:    T:0.26675    C:0.34191    A:0.05923    G:0.33211
Average         T:0.23870    C:0.26089    A:0.20792    G:0.29248

3. What is the ML estimate of the transition-transversion ratio κ?

fixed branch lengths re-estimated branch lengths
2.16189 2.07049

4. What is the ML estimate of the -ratio?

fixed branch lengths re-estimated branch lengths
0.20964 0.23685

5. How do you interpret these ML estimates?

The estimates obtained performing a branch length optimisation are tuned on the codon substitution model selected. This procedure optimises the parameters values together with the topology. However, this does not indicate the correctness of the values inferred since this procedure does not perform tree topology rearrangements. One can also avoid this procedure disabling the optimisation of the branch lengths. Then, the codon substitution model parameters are optimised at best on the topology provided.


2. Branch models

GOAL: In this exercise we estimate the parameter following several different strategies:

a. allowing it to change at each branch on the tree topology,
b. fixing it on one single branch (two-categories)
c. fixing it on two branches (three-categories).

Execution

For all the following tasks we are going to use the following settings:

seqfile = /path/to/lysozymeSmall.nuc     * sequence data filename
treefile = /path/to/lysozymeSmall.trees  * tree structure file name

seqtype = 1  * 1:codons; 2:AAs; 3:codons-->AAs

fix_blength = 1  * 0: ignore, -1: random, 1: initial, 2: fixed

a. free-ratio model

In the codeml.ctl file we set the following parameters:

model = 1        * models for codons:
                 *  0:one, 1:b, 2:2 or more dN/dS ratios for branches

b. two dN/dS ratios

In the codeml.ctl file we set the following parameters:

model = 2        * models for codons:
                 *  0:one, 1:b, 2:2 or more dN/dS ratios for branches

In the tree file lysozymeSmall.trees we must specify the branch where we think the parameter differs from the rest of the tree.

PaML (and codeML) accepts only the following label format at branch
level: '#1' where 1 is an integer greater than 0

((Hsa_Human, Hla_gibbon) ,((Cgu/Can_colobus, Pne_langur) '#1', Mmu_rhesus), (Ssc_squirrelM, Cja_marmoset));

In this case we labeled the branch leading to the clade [colobus, langur]. We are therefore assuming that only this branch will have a value for that differs from the rest of the tree.

c. three dN/dS ratios

In the codeml.ctl file we set the following parameters:

model = 2        * models for codons:
                 *  0:one, 1:b, 2:2 or more dN/dS ratios for branches

In the tree file lysozymeSmall.trees we must specify the branches where we think the parameter differs from the rest of the tree.

PaML (and codeML) accepts only the following label format at branch
level: '#1' where 1 is an integer greater than 0

((Hsa_Human, Hla_gibbon) '#2',((Cgu/Can_colobus, Pne_langur) '#1', Mmu_rhesus), (Ssc_squirrelM, Cja_marmoset));

In this case we labeled the branch leading to the clade [colobus, langur] and the branch leading to the human clade [Hsa_Human, Hla_gibbon]. We are therefore assuming that only those branches will
have values for that differ from the rest of the tree.

Questions

1. Are results consistent with those presented in ref. (2)?

2. Based on LRTs, what model fits your data the best (among 2-ratio, 3-ratio and free-ratio models)?

3. What are the degrees of freedoms for each comparison?

4. What can you tell about the evolution of your gene from the ML estimates under this best model?


3. Site-models

GOAL: In this exercise we apply different codon models to a protein-coding dataset in order to evaluate which one fits better our dataset w.r.t. natural selection detection. We evaluate alternative evolutionary hypotheses using nested codon models.

Execution

For all the following tasks we are going to use the following settings:

seqfile = /path/to/bglobin.nuc         * sequence data filename
treefile = /path/to/bglobin.trees      * tree structure file name

seqtype = 1  * 1:codons; 2:AAs; 3:codons-->AAs

fix_blength = 1  * 0: ignore, -1: random, 1: initial, 2: fixed

Fitting model M1 to M8 in one single run

model = 0
NSsites = 1 2 3 7 8

fix_omega = 0  * 1: omega or omega_1 fixed, 0: estimate
    omega = .4 * initial or fixed omega, for codons or codon-based AAs

Fitting model M8a

  NSsites = 8
  fix_omega = 1  * 1: omega or omega_1 fixed, 0: estimate
  omega = 1      * initial or fixed omega, for codons or codon-based AAs
Questions

1. Which models are nested?

We compute the LRT statistic between nested hypothesis using the
following relation:

If the hypothesis is correct, then it is asymptotically distributed as the distribution with degrees of freedom ( is equal to the difference in the number of parameters in and . On the other hand, I will discard the simplest model in favour of the most complex one, if the LRT value is greater of the value of the distribution with degrees of freedom for the p-value I selected (i.e. p-value = 0.05).

model degrees of freedom LRT Significant (p-value = 0.01)
M3 vs M0 4 256.9 yes, I reject M0
M2 vs M1 2 7.190 no, I accept M1
M8 vs M7 2 22.178 yes, I reject M7
M8 vs M8a 1 12.261 yes, I reject M8a

2. How many degrees of freedom do you use each time to test for significance of the LRT statistic?

3. Do your tests suggest positive selection?

We tested for positive selection every time we compare the likelihood of two nested models (i.e. M2 vs M1) where one allows for positive selection (it allows for portion of sites on the MSA), while the other does not ().

M2a vs M1a, M8 vs M7, M8 vs M8a model tests can detect positive selection.

4. Interpret the ML estimates relevant to selective pressure.

Codon models can identify positive selection monitoring the parameter (i.e. on the branches or at each site), therefore we can observe the ML estimates of the parameter when inferred under different codon models:

parameter M0 M1 M2 M3 M7 M8 M8a
0.23685 0.12 0.12 0.02 0.001 0.002 0.004
1.00 1.00 0.30 0.008 0.015 0.022
2.94 1.69 0.03 0.038 0.050
0.07 0.070 0.088
0.13 0.112 0.139
0.21 0.165 0.207
0.32 0.233 0.307
0.45 0.323 0.491
0.63 0.446 1.00
0.85 0.651
2.081
proportion 1.00 0.77 0.76 0.39 0.10 0.09 0.11
0.23 0.20 0.54 0.10 0.09 0.11
0.03 0.08 0.10 0.09 0.11
0.10 0.09 0.11
0.10 0.09 0.11
0.10 0.09 0.11
0.10 0.09 0.11
0.10 0.09 0.11
0.10 0.09 0.11
0.10 0.09
0.06

M0: Constant : The ML estimates inferred using the M0 model provide a rough average value distributed over the MSA sites. M0 does not allow for distinct classes, therefore we will not retrieve any information regards positive selection.

M1: NearlyNeutral – (2 categories): This model allows for two different value classes ( ). We observe that the 77% of sites shows an , while 23% of sites shows an . If this model is selected by LRT, one can then discuss about a nearly neutral selection that acted on the dataset.

M2: PositiveSelection – (3 categories): This model allows for three different classes of (negative selection, neutral selection, positive selection). We observe the following that the majority (76%) of sites are under negative selection, 20% of sites is neutrally selected and only the 3% of sites is positively selected (with an value = 2.94). The LRT test comparing M2 and M1 selected M1 in favour of M2 with a significance cutoff of 1%.

M3: discrete – (3 categories): This model estimates the value without forcing it to belong to the conventional classes (i.e. negative, neutral, positive). The ML estimation estimates 3 distinct averaged values for the \omega parameter from 3 distinct site proportions. As matter of fact, here we retrieved 2/3 of values which indicate negative selection (). The centre of mass of the sites is located in the negative selection region of the value distribution.

M7: beta – (10 categories): This model infers the parameter values using a Beta distribution with parameters p and q. The ML routine optimises these parameters and cuts the distribution in 10 equal classes. We can observe that all the sites of our MSA are equally distributed across the inferred values, without bias. This model does not account for a specific class which can indicate positive selection.

M8: beta – (11 categories): M8 model is a direct extension of M7 model since it accounts for an additional class for positive selected sites. We observe an homogeneous distribution of the sites among the classes describing neutral or negative selection, while a small proportion is shown to be under strict positive selection. From LRT comparison, we select the model M8 over its alternative M7, confirming the presence of a portion of sites under positive selection.

M8a: beta&w=1 – (11 categories): This variation of the M8 model fixes the upper boundary of the parameter to assume (at maximum) a neutral value . Despite we observe an equal distribution of the sites among the classes, the hypothesis assumed by this model is discarded in favour of the M8 model allowing for positive selection.

5. If LRTs suggest positive selection, which sites are inferred by the Bayesian approach to be under positive selection (models M2 and M8)?

M2 – discarded

Naive Empirical Bayes (NEB) analysis
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: human)

        Pr(w>1)     post mean +- SE for w

 7 S      0.913         2.772
50 D      0.889         2.725
67 G      0.835         2.620
123 P      0.872         2.693

Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118)
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: human)

        Pr(w>1)     post mean +- SE for w

 7 S      0.922         2.929 +- 0.956
48 T      0.542         2.037 +- 1.072
50 D      0.905         2.904 +- 0.991
67 G      0.860         2.798 +- 1.039
123 P      0.883         2.818 +- 0.988

M8 – accepted

Naive Empirical Bayes (NEB) analysis
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: human)

        Pr(w>1)     post mean +- SE for w

 7 S      0.995**       2.074
42 S      0.845         1.856
48 T      0.942         1.997
50 D      0.990**       2.067
54 G      0.781         1.762
67 G      0.988*        2.064
85 T      0.866         1.887
123 P      0.995**       2.075


Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol. Biol. Evol. 22:1107-1118)
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: human)

        Pr(w>1)     post mean +- SE for w

 7 S      0.979*        2.442 +- 0.534
42 S      0.539         1.625 +- 0.846
48 T      0.833         2.176 +- 0.753
50 D      0.970*        2.430 +- 0.554
54 G      0.520         1.594 +- 0.870
67 G      0.961*        2.413 +- 0.573
85 T      0.616         1.774 +- 0.860
123 P      0.976*        2.435 +- 0.540

6. Do NEB and BEB agree on the sites inferred?

Both the methods agree on the sites inferred, however they compute different significance values.

7. Compare results from the LRT comparing M7 vs M8 and the LRT comparing M8a vs M8. Are they both significant (or both non-significant)? If they are both significant, does the Bayesian approach predict the same sites?

The LRT statistics selects M8 over M7 and M8 over M8a.


4. Branch-site models

GOAL: We want to test if positive selection is acting on sites and these events happened in specific times in the past on the lineages (episodic selection).

Execution

1. We test the first hypothesis H0 ()

We prepare the codeml.ctl file in the following way:

 seqfile  = lysozymeSmall.nuc        * sequence data filename
 treefile = lysozymeSmall.trees      * tree structure file name
 outfile  = output_omega_fixed.txt   * main result file name

(snip...)

seqtype = 1  * 1:codons; 2:AAs; 3:codons-->AAs
CodonFreq = 2  * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table

(snip...)

model = 2
                    * models for codons:
                    * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
(snip...)

NSsites = 2  * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;

(snip...)

fix_omega = 1  * 1: omega or omega_1 fixed, 0: estimate 
omega = 1 * initial or fixed omega, for codons or codon-based AAs

2. We test the second hypothesis H1 ( estimated from the data)

We prepare the codeml.ctl file in the following way:

 seqfile  = lysozymeSmall.nuc        * sequence data filename
 treefile = lysozymeSmall.trees      * tree structure file name
 outfile  = output_omega_estimated.txt   * main result file name

(snip...)

seqtype = 1  * 1:codons; 2:AAs; 3:codons-->AAs
CodonFreq = 2  * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table

(snip...)

model = 2
                    * models for codons:
                    * 0:one, 1:b, 2:2 or more dN/dS ratios for branches
(snip...)

NSsites = 2  * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;

(snip...)

fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate 
Questions
  1. How many LRTs are significant?
  2. How many LRTs remain significant after the Bonferroni correction for multiple testing?
  3. What can you tell about the evolution of your gene from the ML estimates obtained after the multiple LRT procedure?

5. Effect of the tree topology (coming soon)

GOAL:

Execution
Questions

6. Feeling adventurous? (coming soon)

GOAL:

Execution
Questions

References

  1. Z. Yang, R. Nielsen, N. Goldman, A. Krabbe Pedersen, Codon-Substitution Models for Heterogeneous Selection Pressure at Amino Acid Sites, GENETICS May 1, 2000 vol. 155 no. 1 431-449
  2. Z Yang, Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution, Molecular Biology and Evolution, Oxford University Press, 1998
  3. Jianzhi Zhang, Rasmus Nielsen, and Ziheng Yang, Evaluation of an Improved Branch-Site Likelihood Method for Detecting Positive Selection at the Molecular Level, Mol Biol Evol (December 2005) 22(12): 2472-2479 first published online August 17, 2005
    doi:10.1093/molbev/msi237
  4. PaML documentation

Document last updated on 03.02.2016


© 2016 Lorenzo Gatti – Applied Computational Genomic Team (ACGT) @ Institute of Applied Simulations (ZHAW) | Wädenswil | Zürich