PAML (Phylogenetic Analysis with Maximum Likelihood) is a software package written by Ziheng Yang in 1993, and further developed in the 90s and years 20**. We will in this course focus on the program called codeml, dedicated to protein sequence analysis.
Codeml is a command line program, meaning that it has to be run from a terminal. In order to work, it requires 3 files:
The option file is the most important one. It contains links toward the two others. It contains a series of options in the form tag = value
. Everything following a *
is ignored by the program. The tree file is either:
In many cases the tree is inferred from the sequences themselves, which codeml can do. However, it is rather inefficient for that task, and it is recomended to use external software for this purpose. We here used a software called PhyML, and we provide the corresponding trees for the two data sets.
The basic function of codeml is its ability to compute the likelihood of a large set of models. We will demonstrate this using the first dataset and a homogeneous model (all positions share the same dN/dS).
Mg1/M0_profile
codeml.ctl
file with a text editor (e.g. Notepad)cd
to the M0_profile
directory, and launch codeml by typing codeml
in the command lineFinding the value which maximize the likelihood function is in most cases tedious, as several parameters have to be jointly estimated. Luckily codeml can do that for us…
Mg1/M0
. The codeml.ctl
file here has been modified to let codeml find the best parameter values. The modifications are:fix_kappa = 0
fix_omega = 0
fix_blength = -1
is commented (a *
was added at the beginning of the line).mlc.txt
, which you can open in a text editor.Codeml can fit several types of models. Here we will consider to model which allow each position to chose between distinct selection regime. In the first model, called M1a, each position in the alignment can chose between a value of omega = 1 (neutral evolution) and a value of omega < 1 (purifying selection), which is estimated by the program. This model therefore has 1 additional parameter than the homogeneous one (called M0), namely the proportion of positions with omega < 1. The second model is called M2a, and allows each position to chose between three possible values for omega, namely omega = 1 (neutral evolution), omega < 1 (purifying selection) and omega > 1 (positive selection), the two latter ones being estimated. This model therefore contains two more parameters than M1a, the proportion of positively selected sites and their corresponding omega value. We will fit these two models to our data:
M1a-M2a
directory. The codeml.ctl file has been modified to run the M1a and M2a models instead of M0 (line Nssites = 1 2
).mlc.txt
file. Note the lnL values for both models, and the corresponding parameter estimates.