01.06.2015

Simulating sequence data

The MS program generates a genealogy of the sample and then distribute mutations on this genealogy. In order to generate a sequence alignment, one need to superimpose a model of mutation. This can be achieved using the seq-gen program (for instance).

The creation of simulated sequence alignments therefore proceeds in 2 steps:

  1. Use MS to generate a genealogy
  2. Use Seq-Gen to generate an alignment from the genealogy

In practice

ms 10 1 -T

Notes:

  • the -T switch tells MS to output a genealogy
  • the genealogy is in the Newick format
  • branch lengths are in units of \(4 \cdot Ne\) generations

Visualizing the genealogy

Save the file to a newick file:

ms 10 1 -T | tail -n +4 | grep -v // > genealogy.dnd

Open the tree in Seaview:

seaview genealogy.dnd

Exercise

Generate several genealogies. What are the characteristics of these genealogies?

Generating sequence alignments

Call the Seq-Gen program for 40 sites, with \(\theta = 0.2\) per bp:

seq-gen -mHKY -l 40 -s .2 < genealogy.dnd

The resulting alignment is in the so-called Phylip format. It can be saved to a file:

seq-gen -mHKY -l 40 -s .2 < genealogy.dnd > alignment.phy

and visualized with Seaview

seaview alignment.phy

Exercise

  1. Generate a sequence alignment
  2. Reconstruct the genealogy using seaview ("Trees" menu)
  3. Compare to the input genealogy
  4. Try different values of alignment lengths and \(\theta\)

Computing diversity statistics

We use a simple BioPerl script for computing Watterson's theta, Tajima's D and Fu and Li's D*.

perl Diversity.plx alignment.phy

Exercise

Analyze the results from the Diversity.plx script. Simulate several data sets: conclusions?

More advanced exercise:

Look at the manual of MS in order to generate a data set with an outgroup sequence. Use the simulated data set to compute Fu and Li's D statistics.

Batching…

Let's repeat this procedure… Ms can sample several trees, let's say a 100:

ms 10 100 -T | tail -n +4 | grep -v // > genealogies.dnd

Seq-Gen can simulate data sets for each input tree:

seq-gen -mHKY -l 100 -s .2 < genealogies.dnd > alignments.phy

Finally, we can run a modified verison of the BioPerl script which compute estimates for all alignments and output them as a table:

perl AllDiversity.plx alignments.phy > estimates.csv

Visualizing

dat <- read.csv("estimates.csv")
par(bg = "transparent", mar=c(4,4,0,0))
hist(dat$WTheta, col="cornsilk3", freq = FALSE, ylim = c(0, 10),
     xlab = "Number of segregating sites", main = NA)
lines(density(dat$WTheta), col="steelblue", lwd = 2)
abline(v = 0.2, col = "red", lwd = 2)

Exercise

Look at the distribution of the number of segregating sites, number of singletons, Tajima's D and Fu and Li's D*.

Number of singletons

par(bg = "transparent", mar=c(4,4,0,0))
hist(dat$NbSingletons, col="cornsilk3", freq = FALSE,
     xlab = "Number of singletons", main = NA)
lines(density(dat$NbSingletons), col="steelblue", lwd = 2)
abline(v = 0.2 * 100, col = "red", lwd = 2)

Number of segregating sites

par(bg = "transparent", mar=c(4,4,0,0))
hist(dat$NbSegregatingSites, col="cornsilk3", freq = FALSE,
     xlab = "Number of segregating sites", main = NA)
lines(density(dat$NbSegregatingSites), col="steelblue", lwd = 2)
abline(v = 0.2 * 100 * sum(1/(1:9)), col = "red", lwd = 2)

Tajima's D

par(bg = 'transparent', mar=c(4,4,0,0))
hist(dat$TajimaD, col="cornsilk3", freq = FALSE,
     xlab = "Tajima's D", main = NA)
lines(density(dat$TajimaD), col="steelblue", lwd=2)
abline(v = 0, col = "red", lwd = 2)

Fu and Li's D*

dat <- read.csv("estimates.csv")
par(bg = 'transparent', mar=c(4,4,0,0))
hist(dat$FuAndLiDStar, col="cornsilk3", freq = FALSE,
     xlab = "Fu & Li's D*", main = NA)
lines(density(dat$FuAndLiDStar), col="steelblue", lwd=2)
abline(v = 0, col = "red", lwd = 2)

Demography and selection : population growth

Simulate a genealogy with population growth

ms 10 1 -T -G 20 | tail -n +4 | grep -v // > genealogyG20p.dnd
seaview genealogyG20p.dnd

Call the Seq-Gen program for 100 sites, with \(\theta = 0.2\) per bp:

seq-gen -mHKY -l 100 -s .2 < genealogyG20p.dnd > sequencesG20p.phy
seaview sequencesG20p.phy

Compute diversity:

perl Diversity.plx sequencesG20p.phy

Demography affects the shape of genealogies

Effect of population growth on statistics

We generate a 100 trees, with an exponential growth:

ms 10 100 -T -G 20 | tail -n +4 | grep -v // > genealogiesG20p.dnd
seq-gen -mHKY -l 100 -s .2 < genealogiesG20p.dnd > alignmentsG20p.phy
perl AllDiversity.plx alignmentsG20p.phy > estimatesG20p.csv

Exercise

Look at the distribution of Watterson's \(\theta\), number of singletons and segregating sites, Tajima's D and Fu and Li's D* statistics. Conclusion?

Population growth leads to underestimation of diversity

par(bg = "transparent", mar=c(4,4,0,0))
hist(datG20p$WTheta, xlim = c(0, 0.2), col="cornsilk3", freq = FALSE,
     xlab = "Watterson's theta", main = NA)
lines(density(datG20p$WTheta), col="steelblue", lwd=2)
abline(v = 0.2, col = "red", lwd = 2)

Number of singletons

par(bg = "transparent", mar=c(4,4,0,0))
hist(datG20p$NbSingletons, col="cornsilk3", xlim = c(0, 20),
     freq = FALSE, xlab = "Number of singletons", main = NA)
lines(density(datG20p$NbSingletons), col="steelblue", lwd=2)
abline(v = 0.2 * 100, col = "red", lwd = 2)

Number of segregating sites

par(bg = "transparent", mar=c(4,4,0,0))
hist(datG20p$NbSegregatingSites, xlim = c(0, 60), col="cornsilk3",
     freq = FALSE, xlab = "Number of segregating sites", main = NA)
lines(density(datG20p$NbSegregatingSites), col="steelblue", lwd=2)
abline(v = 0.2 * 100 * sum(1/(1:9)), col = "red", lwd = 2)

Population growth creates negative Tajima's D

par(bg = "transparent", mar=c(4,4,0,0))
hist(datG20p$TajimaD, col="cornsilk3", freq = FALSE,
     xlab = "Tajima's D", main = NA)
lines(density(datG20p$TajimaD), col="steelblue", lwd=2)
abline(v = 0, col = "red", lwd = 2)

Population growth creates negative Fu & Li's D*

par(bg = "transparent", mar=c(4,4,0,0))
hist(datG20p$FuAndLiDStar, col="cornsilk3", freq = FALSE,
     xlab = "Fu & Li's D*", main = NA)
lines(density(datG20p$FuAndLiDStar), col="steelblue", lwd=2)
abline(v = 0, col = "red", lwd = 2)

Exercise

Look at the effect of population shrinkage on diversity estimators.

Population shrinkage

Effect on diversity estimates

We generate a 100 trees, with an exponential shrinkage:

ms 10 100 -T -G -0.2 | tail -n +4 | grep -v // > genealogiesG0c2n.dnd
seq-gen -mHKY -l 100 -s .2 < genealogiesG0c2n.dnd > alignmentsG0c2n.phy
perl AllDiversity.plx alignmentsG0c2n.phy > estimatesG0c2n.csv

Population shrinkage and diversity

par(bg = "transparent", mar=c(4,4,0,0))
hist(datG02n$WTheta, col="cornsilk3", freq = FALSE,
     xlab = "Watterson's theta", main = NA)
lines(density(datG02n$WTheta), col="steelblue", lwd=2)
abline(v = 0.2, col = "red", lwd = 2)

Number of singletons

par(bg = "transparent", mar=c(4,4,0,0))
hist(datG02n$NbSingletons, col="cornsilk3",
     freq = FALSE, xlab = "Number of singletons", main = NA)
lines(density(datG02n$NbSingletons), col="steelblue", lwd=2)
abline(v = 0.2 * 100, col = "red", lwd = 2)

Number of segregating sites

par(bg = "transparent", mar=c(4,4,0,0))
hist(datG02n$NbSegregatingSites, xlim = c(0, 60), col="cornsilk3",
     freq = FALSE, xlab = "Number of segregating sites", main = NA)
lines(density(datG02n$NbSegregatingSites), col="steelblue", lwd=2)
abline(v = 0.2 * 100 * sum(1/(1:9)), col = "red", lwd = 2)

Population shrinkage creates positive Tajima's D

par(bg = "transparent", mar=c(4,4,0,0))
hist(datG02n$TajimaD, col="cornsilk3", freq = FALSE,
     xlab = "Tajima's D", main = NA)
lines(density(datG02n$TajimaD), col="steelblue", lwd=2)
abline(v = 0, col = "red", lwd = 2)

Population shrinkage and Fu & Li's D*

par(bg = "transparent", mar=c(4,4,0,0))
hist(datG02n$FuAndLiDStar, col="cornsilk3", freq = FALSE,
     xlab = "Fu & Li's D*", main = NA)
lines(density(datG02n$FuAndLiDStar), col="steelblue", lwd=2)
abline(v = 0, col = "red", lwd = 2)