Inference of positive selection with PAML

PAML (Phylogenetic Analysis with Maximum Likelihood) is a software package written by Ziheng Yang in 1993, and further developed in the 90s and years 20**. We will in this course focus on the program called codeml, dedicated to protein sequence analysis.

Codeml is a command line program, meaning that it has to be run from a terminal. In order to work, it requires 3 files:

A text file containing the sequences, in a dedicated format (extended Phylip)
A tree file, describing the relationships between the sequences, as a text file in the so-called parenthetic (aka New Hampshire or Newick) format
A text file containing several options for tuning the program.

The option file is the most important one. It contains links toward the two others. It contains a series of options in the form tag = value. Everything following a * is ignored by the program. The tree file is either:

A phylogeny, in case sequences are from different species. The tree therefore depicts the history of species.
A genealogy, in case sequences are from different individuals of the same species. The tree then depicts the closeness between all individuals.

In many cases the tree is inferred from the sequences themselves, which codeml can do. However, it is rather inefficient for that task, and it is recomended to use external software for this purpose. We here used a software called PhyML, and we provide the corresponding trees for the two data sets.

Estimating omega = dN/dS with maximum likelihood.

The basic function of codeml is its ability to compute the likelihood of a large set of models. We will demonstrate this using the first dataset and a homogeneous model (all positions share the same dN/dS).

Go to directory Mg1/M0_profile
Edit the codeml.ctl file with a text editor (e.g. Notepad)
Find the line setting the initial value of omega, and set it to 1.0, and save the file
cd to the M0_profile directory, and launch codeml by typing codeml in the command line
Codeml outputs several statistics… The interesting one here is the one stating lnL = - ???? This corresponds to the value of the log likelihood. Note it with its corresponding omega value (for example in Excel).
Repeat steps 3 to 5 with different values for omega. What is the value of omega which maximizes lnL?

Fitting a model with codeml

Finding the value which maximize the likelihood function is in most cases tedious, as several parameters have to be jointly estimated. Luckily codeml can do that for us…

Go to directory Mg1/M0. The codeml.ctl file here has been modified to let codeml find the best parameter values. The modifications are:

fix_kappa = 0
fix_omega = 0
and the line fix_blength = -1 is commented (a * was added at the beginning of the line).

Run codeml. In the output stands the value for omega (noted w), together with the estimated value sof dN and dS, among others. Codeml also gather results in a file named mlc.txt, which you can open in a text editor.
Discuss the value of omega obtained
Repeat the procedude for the second data set (Mg3). Compare the value of omega obtained.

More complex models and hypothesis testing

Codeml can fit several types of models. Here we will consider to model which allow each position to chose between distinct selection regime. In the first model, called M1a, each position in the alignment can chose between a value of omega = 1 (neutral evolution) and a value of omega < 1 (purifying selection), which is estimated by the program. This model therefore has 1 additional parameter than the homogeneous one (called M0), namely the proportion of positions with omega < 1. The second model is called M2a, and allows each position to chose between three possible values for omega, namely omega = 1 (neutral evolution), omega < 1 (purifying selection) and omega > 1 (positive selection), the two latter ones being estimated. This model therefore contains two more parameters than M1a, the proportion of positively selected sites and their corresponding omega value. We will fit these two models to our data:

cd to the M1a-M2a directory. The codeml.ctl file has been modified to run the M1a and M2a models instead of M0 (line Nssites = 1 2).
Run codeml
Edit the mlc.txt file. Note the lnL values for both models, and the corresponding parameter estimates.
Repeat steps 1 to 3 for the second data set. In both cases, which model has the better likelihood?
Perform the likelihood ratio test for both data sets. Conclusions?