PopPhyl tools user manual This file provides instructions on how to re-use the data analysis pipeline developed in the context of the PopPhyl project (http://kimura.univ-montp2.fr/PopPhyl), described in Cahais et al (2012), Tsagkogeorga et al (2012) and Gayral et al (2013), and applied in Loire et al (2013), Romiguier et al (2014a) and Romiguier et al (2014b). 1. Software Two linux executable files and one PERL script are distributed here (http://kimura.univ-montp2.fr/PopPhyl/index.php?section=tools): - reads2snp (SNP and genotype caller) - ORF_extractor.pl (ORF management) - piNpiSdNdS (coding sequence population genomic analysis) The following programs/libraries also need to be installed: - ABySS (short read assembler) - Cap3 (contig assembler) - BWA (read mapper) - samtools - BioPerl In the PopPhyl project, we used versions reads2snp 2.0, dNdSpiNpiS 1.0, ABySS 1.3.4, Cap3 10/15/07, and BWA 0.5.8c (r1536). 2. The pipeline The PopPhyl pipeline aims at assembling contigs, calling SNPs and genotypes and calculating basic population genomic statistics based on short sequence reads from a population sample. It was designed for single end, transcriptome data and diploid individuals and focusses on coding sequence variation. Input: one file of short reads per individual (fastq, typically Illumina) Output: - a set of contigs = predicted cDNAs (fasta) - predicted SNPs and genotypes (vcf, fasta) - estimates of synonymous and non-synonymous diversity and Fit per cDNA (csv) - averages across cDNAs and confidence intervals (txt) 3. Step by step user guide There are five steps: A. de novo transcriptome assembly Description: A single, common transcriptome assembly is achieved using method B from Cahais et al (2012), which does one run of ABySS and two runs of Cap3. Input: one file of short reads per individual (fastq, typically Illumina) Output: a set of predicted cDNAs (fasta) Execution: - Run ABySS on the list of fastq files using by default options and k-mer size = 60 (reads of length 100) or 30 (reads of length 50) - Run Cap3 on the output of Abyss using by default options; this generates two fasta files: a contig file and a singlet file - Concatenate the contig and singlet files - Run Cap3 again on the concatenated contig+singlet file; this generates two fasta files: a contig file and a singlet file - Concatenate the contig and singlet files, thus generating the predicted cDNA output file (fasta). Comment: for some species of the PopPhyl project we also have used 454 data. In this case the assembly was achieved following method D from Cahais et al (2012). B. Read mapping Description: short sequence reads are mapped (aligned) to the set of predicted cDNA obtained at step A using BWA. Input: one file of short transcriptome reads per individual (fastq) + one file of predicted cDNAs (fasta) Output: one file of aligned reads per individual (bam) Execution: - Sequentially run BWA for each read file using options : bwa -n 0.04 -o 1 -e -1 -d 16 -i 5 -k 1 -t 4 -M 10 -O 11 -E 4 This generates one file of aligned reads per individual (sam) - Sequentially convert sam files (text) into bam files (binary) using the samtools samtools view -uhSt exple.fasta exple.sam | samtools sort - exple Comment: In the PopPhyl project we removed "PCR duplicates" (identical reads from the same individual). This can be achieved using the samtools (samtools -rmdup exple.bam). C. SNP/genotype calling Description: for each position of each contig and each individual, a diploid genotype is predicted based on aligned read counts using reads2snp according to the method described by Tsagkogeorga et al (2012) and improved by Gayral et al (2013, paralogy check). Input: one file of aligned reads per individual (bam) + one file of predicted cDNAs (fasta) Output: one file of predicted SNPs and genotypes (vcf) + one file of aligned diploid sequences (fasta) Execution: Run the reads2snp program with by default options: reads2snp -bam_list my_list.txt -bam_ref my_contigs.fas -out my_snps This generates a vcf file (my_snps.vcf) and a fasta file (my_snps.fas) containing the predicted SNPs and genotypes. Comment 1: In the PopPhyl project we only considered contigs of length >200 bp Comment 2: The two output files essentially carry the same information in distinct formats. The fasta file contains two (pseudo)sequences per diploid individual. Please note that SNPs are not phased in this pipeline, so that the two sequences should not be considered as haplotypes. The association of alleles across distinct heterozygous positions in a contig is arbitrary (alphabetic order). D. Open reading frame prediction Description: ORFs are predicted and polymorphism sequence alignments generated at step C are processed in order to only keep coding portions. Input: one file of predicted cDNAs (fasta) + one file of aligned diploid sequences (fasta) Output: one file of aligned coding diploid sequences (fasta) Execution: Apply your favourite ORF predictor to the predicted cDNA file Pass the output ORF file and the file of aligned diploid sequences to ORF_extractor.pl: ORF_extractor.pl -orf orf_file -seq my_snps.fas -out my_coding_snps This generates a file of aligned coding diploid sequences (plus two files including aligned 5' UTR and 3'UTR sequences, respectively) Comment 1: In the PopPhyl project we used the ORF prediction tool of the Trinity package Comment 2: In the PopPhyl project we only considered ORFs of length >200 Comment 3: ORF_extractor supports the getORF and prot4EST ORF file formats E. Coding sequence population genomic analysis Description: Coding sequence polymorphism alignments are cleaned for missing data; basic population genomic statistics are calculated. Input: one file of aligned diploid sequences (fasta) Output: one tabular file providing several statistics per contig (piS, piN, Fit) + one summary file Execution: dNdSpiNpiS -alignment_file=my_coding_snps.fas -ingroup=species_name -gapN_site=10 -out=my_res This generates tabular file my_res.out including one line per validated cDNA (some are removed depending on cleaning options) and a number of descriptors, and summary file my_res.sum with average estimates and confidence intervals. Comment 1: among the many options accepted by dNdSpiNpiS (type: dNdSpiNpiS), the important -gapN_site defines the tolerance to missing data: a codon position will only be kept if at least this number of sequences has a determined character (no gap, no N). Comment 2: dNdSpiNpiS can also handle files including aligned sequences from individuals from two species, in which case divergence statistics will be calculated too (e.g. see Gayral et al 2013). Two-species multi-individual alignments must be produced to achieve this. References Cahais V., Gayral P., Tsagkogeorga G., Melo-Ferreira J., Ballenghien M., Weinert L., Chiari Y. Belkhir K., Ranwez V. & Galtier N. 2012. Reference-free transcriptome assembly in non-model animals from next generation sequencing data. Molecular Ecology Resources 12:834-845. Gayral P., Melo-Ferreira J., Glémin S., Bierne N., Carneiro M., Nabholz B., Lourenço J.M., Alves P.C., Ballenghien M., Faivre N., Belkhir K., Cahais V., Loire E., Bernard A. & Galtier N. 2013. Reference-free population genomics from Next-Generation transcriptome data and the vertebrate-invertebrate gap. PLoS Genetics 9:e10003457. Loire E., Chiari Y., Bernard A., Cahais V., Romiguier J., Nabholz B., Lourenço J., Galtier N. 2013. Population genomics of the endangered giant Galápagos tortoise. Genome Biology 14:R136. Romiguier J., Lourenço J.M., Gayral P., Faivre N., Weinert L.A., Ravel S., Ballenghien M., Cahais V., Bernard A., Loire E., Keller L., Galtier N. 2014a. Population genomics of eusocial insects: the costs of a vertebrate-like effective population size. Journal of Evolutionary Biology 27:1-11. Romiguier J., Gayral P., Ballenghien M., Bernard A., Cahais V., Chenuil A., Chiari Y., Dernat R., Duret L., Faivre N., Loire E., Lourenco J.M., Nabholz B., Roux C., Tsagkogeorga G., Weber A., Weinert L.A., Belkhir K., Bierne N., Glémin S., Galtier N. 2014b. Comparative population genomics in animals uncovers the determinants of genetic diversity (submitted) Tsagkogeorga G., Cahais V. & Galtier N. 2012. The population genomics of a fast evolver: high levels of diversity, functional constraint and molecular adaptation in the tunicate Ciona intestinalis. Genome Biology and Evolution 4:740-749.