Exercise 7

We will use codeml program from PAML by Ziheng Yang. Use the command line mode for the tasks below. First, you need to understand which control file options to use. Next, try to reproduce the same analyses with codeml</code>.

You will need a dataset of homologous protein-coding DNA sequences (starting with the 1^st codon position and ending with the 3^rd). We will use data from published articles and will regenerate published results:

Branch models: Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568-573.
Data 1: lysozymeSmall.nucTree 1: lysozymeSmall.trees

Branch-site models.

Use the small lysozyme example to to fit branch-site models:

For each branch of the tree (one at a time) perform the LRT comparing model MA (ω is estimated) with model MA (fixed ω = 1).
How many LRTs are significant?
How many LRTs remain significant after the Bonferroni correction for multiple testing?
What can you tell about the evolution of your gene from the ML estimates obtained after the multiple LRT procedure?

About Bonferroni Correction
Any time you reject a null hypothesis because a P value is less than your critical value, it's possible that you're wrong; the null hypothesis might really be true, and your significant result might be due to chance.

Instead of setting the critical P level for significance, or alpha, to a certain value (i.e 0.05), you use a lower critical value. If the null hypothesis is true for all of the tests, the probability of getting one result that is significant at this new, lower critical value is 0.05.

The most common way to correct for multiple testing is with the Bonferroni correction. You find the critical value (alpha) for an individual test by dividing the P-value (i.e. 0.05) by the number of tests. Thus if you are doing 100 statistical tests, the critical value for an individual test would be 0.05/100=0.0005, and you would only consider individual tests with P<0.0005 to be significant.

Getting Started

Day 1 - Phylogenetic tree reconstruction

Day 2 - Detecting positive selection

Day 3.1 - Detecting episodic selection

Day 3.2 - Pipelines exercises

Solutions

Exercise 7