Exercise 3 - Site models

We will use codeml program from PAML by Ziheng Yang. Use the command line mode for the tasks below. First, you need to understand which control file options to use. Next, try to reproduce the same analyses with codeml.

You will need a dataset of homologous protein-coding DNA sequences (starting with the 1^st codon position and ending with the 3^rd). We will use data from published articles and will regenerate published results:

Site-models: Yang, Z., R. Nielsen, N. Goldman, A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155: 431-449.
Data 1: bglobin.nuc Tree 1: bglobin.tree
Data 2: HIVenvSweden.nuc Tree 2: HIVenvSweden.trees
Data 3: adh.nuc Tree 3: adh.trees

Site-models.

Choose a dataset from publication 1 and fit the following site models to your data:
M1, M2, M3, M7, M8a, and M8 (always estimate branch lengths by ML). Note 1:Model M0 was already fitted in exercise 1 (make sure you have the output file).
Note 2: M8a is model M8 with ω for the discrete category fixed to 1. Which models are nested?
Perform likelihood ratio tests (LRTs) of nested hypotheses. How many degrees of freedom do you use each time to test for significance of the LRT statistic? Do your tests suggest positive selection?
Interpret the ML estimates relevant to selective pressure.
If LRTs suggest positive selection, which sites are inferred by the Bayesian approach to be under positive selection (models M2 and M8)?
Do NEB and BEB agree on the sites inferred?
Compare results from the LRT comparing M7 vs M8 and the LRT comparing M8a vs M8. Are they both significant (or both non-significant)? If they are both significant, does the Bayesian approach predict the same sites?

Please refer to PaML/codeml documentation available here

Getting Started

Day 1 - Phylogenetic tree reconstruction

Day 2 - Detecting positive selection

Day 3.1 - Detecting episodic selection

Day 3.2 - Pipelines exercises

Solutions

Exercise 3 - Site models