Exercise 1 - Simple codon model

We will use codeml program from PAML by Ziheng Yang. Use the command line mode for the tasks below. First, you need to understand which control file options to use. Next, try to reproduce the same analyses with codeml</code>.

You will need a dataset of homologous protein-coding DNA sequences (starting with the 1^st codon position and ending with the 3^rd). We will use data from published articles and will regenerate published results:

Site-models: Yang, Z., R. Nielsen, N. Goldman, A.-M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155: 431-449.
Data 1: bglobin.nuc Tree 1: bglobin.tree
Data 2: HIVenvSweden.nuc Tree 2: HIVenvSweden.trees
Data 3: adh.nuc Tree 3: adh.trees

The simple codon model with constant ω.

Choose a dataset from the publication above and fit model M0 - the most simple codon model with constant ω over time and sites. Run model M0 twice: first with branch lengths fixed to those in the tree file, and once with branch lengths estimated by ML.
Compare the optimised log-likelihoods for the two runs? In which case is it higher? Why?
Next, study the output file for the run with estimated branch lengths: Do you observe the codon frequency bias?
Study the statistics of nucleotide usage for different codon positions. Which position displays the most bias? Why? What is the ML estimate of the transition-transversion ratio κ? What is the ML estimate of the ω-ratio? How do you interpret these ML estimates?

Please refer to PaML/codeml documentation available here

Getting Started

Day 1 - Phylogenetic tree reconstruction

Day 2 - Detecting positive selection

Day 3.1 - Detecting episodic selection

Day 3.2 - Pipelines exercises

Solutions

Exercise 1 - Simple codon model