02.06.2015

Simulating sequence data

The Seq-Gen program only support nucleotide and protein models of sequence simulation. In order to simulate coding sequences, we need to account for the underlying genetic code. We will therefore use the Evolver program from Ziheng Yang's PAML package.

Preparing Evolver's input file

We generate a genealogy as before:

ms 10 1 -T | tail -n +4 | grep -v // > genealogy.dnd

Evolver reads its parameters from an input file, including the tree. We therefore have to edit MCcodon.dat and copy-paste the tree on line 6.

Using Evolver

We then launch Evolver and chose option 6:

evolver
EVOLVER in paml version 4.8a, August 2014
Results for options 1-4 & 8 go into evolver.out

  (1) Get random UNROOTED trees?
  (2) Get random ROOTED trees?
  (3) List all UNROOTED trees?
  (4) List all ROOTED trees?
  (5) Simulate nucleotide data sets (use MCbase.dat)?
  (6) Simulate codon data sets      (use MCcodon.dat)?
  (7) Simulate amino acid data sets (use MCaa.dat)?
  (8) Calculate identical bi-partitions between trees?
  (9) Calculate clade support values (evolver 9 treefile mastertreefile <pick1tree>)?
  (11) Label clades?
  (0) Quit?

Looking at the output

Simulated sequences are in mc.paml, in the Phylip format. We need to reformat a bit the output (removing blank lines etc.):

sed '/^\s*$/d' -i mc.paml

and then open in Seaview

seaview mc.paml

Note that Seaview has a convenient option to translate to proteins (Props -> View as proteins)

Testing for selection

Then run the BioPerl script to compute Ka/Ks:

perl NeiGojobori.plx mc.paml

Batching…

We simulate a 100 alignments as before:

evolver 6 MCcodonMulti.dat
mv mc.paml simul_codon.paml
perl AllNeiGojobori.plx simul_codon.paml > allkaks.csv

What's going on?

We set all codon frequencies to equal values:

evolver 6 MCcodonMultiEqual.dat
mv mc.paml simul_codon_equal.paml
perl AllNeiGojobori.plx simul_codon_equal.paml > allkaks_equal.csv

Analysis of real datasets

seaview Mg1LysM_coding.fas
perl NeiGojoboriData.plx Mg1LysM_coding.fas
seaview Mg3LysM_coding.fas
perl NeiGojoboriData.plx Mg3LysM_coding.fas

Using sliding windows

Slide a window of 9 codons with a step of 3 codons (for instance):

perl NeiGojoboriDataWindow.plx Mg1LysM_coding.fas 9 3 > Mg1LysM_w9s3.csv

Exercise

  • Compare different window parameters, on both Mg1 and Mg3 data sets. Which biological insights did you gain?
  • Use the window approach on a simulated dataset for comparison.