These and other basic questions regarding molecular evolution are not answered yet, or only vaguely, despite the huge amount of data available, and the strength of the population genetic theory. One limiting factor is probably the taxonomic bias of current population genomic data sets, especially in animals - most of the evolutionary genomic literature is in mammals or Drosophila. The PopPhyl project aims at characterizing within- and between-species molecular variations in a substantial number of metazoan taxa thanks to next-generation sequencing technology, with the hope of linking genome evolutionary patterns to species biology and ecology.
We want to approach the neutral (synonymous) and selected (non-synonymous) variation within species (ΠS, ΠN, allele frequency spectra), and between closely-related species (dN, dS), for large numbers of genes, in >30 animal taxa. Such data are indicative of the evolutionary forces at work – mutation biases, distribution of selection coefficients, recombination, genetic drift, GC-biased gene conversion. We will measure these parameters in a diversified panel of species, thus providing a broader view of population genomic processes across animals. We will estimate the ratio of adaptive to neutral amino-acid changes in many species, assess the prevalence of GC-biased gene conversion and codon usage optimization in various phyla, and identify the causes of high substitution rate in fast-evolving lineages.
Molecular evolution is governed by population genetic parameters (mutation, selection, drift, recombination), which in turn presumably depend on species life-history and ecology. We want to check whether genomic variation patterns reflect species biology according to the population genetic theory. Here are examples of the predictions we want to test:
Trait | population genetic parameter | predicted genomic pattern |
---|---|---|
marine (vs terrestrial) small (vs large) |
large population size | high within-species diversity low ΠN/ΠS ratio |
selfer (vs outcrosser) | low recombination rate small population size |
high ΠN/ΠS ratio slow adaptive rate high homozygosity no biased gene conversion |
long (vs short) lifespan | low per year mutation rate, selection for somatic repair | low dS (per million year) slow mitochondrial evolution |
Most of the sequencing effort in animals so far has been concentrated in a small number of model laboratory organisms - vertebrates, Drosophila, Caenorhabditis. The PopPhyl project will explore the molecular diversity of neglected taxa (molluscs, annelids, nemertians, cnidarians), emblematic animals (Galapagos tortoise, king penguin, bath sponge), and species showing remarkable life history or ecology (hydrothermal annelids, 400-year old bivalves, social insects), with the hope of discovering new genes and adaptations, and providing information relevant to conservation biology and environmental sciences.
More than 30 focal species spanning the metazoan diversity have been selected based on their biology and ecology. For each of them, 10 living individuals are sampled across the species geographic range with the help of expert colleagues . Two closely-related outgroup species are identified, and sampled as well (one or two individuals each). See pictures here.
PopPhyl sampling scheme
Total RNA will be extracted from the ~500 individuals sampled in PopPhyl using standard and modified protocols. Then individual transcriptomes will be sequenced using a combination of next-generation technologies. For each focal group, one ½ 454 run (~500,000 reads of length ~400) will be performed from total RNA of one or several pooled individuals. Such long reads are useful for proper assembly of cDNA contigs. In addition, individual transcriptomes will be separately sequenced using the Illumina technology (~4,000,000 reads of length ~100 per individual). SNPs and genotypes will be deduced from the mapping of short reads on assembled reference transcriptomes.
The PopPhyl data analysis pipeline should look like:
Transcriptome assembly, short read mapping and coding sequence annotation will be achieved using publicly available software, encapsulated in appropriate C++/perl/bash code. SNP and genotype calling algorithms will be developed in house. Methods will run on a dedicated cluster hosted by the Montpellier Bioinformatics & Biodiversity platform.
Funded by: European Research Council
Hosted by: CNRS – UMR 5554 – Institut des Sciences de l'Evolution