Variation Calling Challenge 2013 - Variathon v0.1

The goal of this pilot Variation Calling Challenge is to analyze existing or new pipelines for variant calling in terms of accuracy and efficiency. Participants should use their variation calling pipelines on public datasets and they will have the same amount of time to run the tools and collect the results. We provide different simulated data sets for the analysis.

The input consists of a reference genome in fasta format and a set of reads in fastq format from an artificial organism with a genome similar to the reference. The output should be composed of a file containing a table of called variants (e.g., vcf format) and a file with reads mapping (e.g., bam format).

The output will be evaluated in terms of reads mapping, variants calls, and resources usage. Results will be presented at the Workshop on NGS data and the Variation Calling Challenge that will be held on May 2013 in Udine, Italy.

Variathon v0.1 2013 and the Workshop on NGS data and the Variation Calling Challenge are organized by COST as part of SeqAhead project.

cost seqahead

Top

Important dates

  • 15.02.2013: Submission system opens
  • 15.03.2013: Submission deadline
  • 29.03.2013: Performance evaluation announced -> postponed to 10.04.2013
  • 21.05.2013: Workshop in Udine, Italy

Top

Datasets

In this section, simulated reads and reference genomes can be downloaded. Datasets are as follows.

Human data

With human datsets we also provide a file with a set of variants from where a random subset was chosen to create a diploid genome. You can use the variation files to improve read alignment with some methods and especially to normalise your final variation predictions to match a subset. Reads were generated with wgsim using uniform base error rate 0.02. We also added reads from mouse genome and filtered out the reads that were aligned to the reference with no errors.

  • Artificial Chromosome 20, frequent variations.
    reference genome (Chromosome 20), variations, reads_set_1 and reads_set_2 (20000000 paired reads, read length 70 bp, insert size 500 bp).
    • Alleles in diploid chromosomes were generated independently.
    • Otherwise the dataset tries to mimic a typical set of frequent variations in an individual.
    • The maximal indel size is 617 bp.
  • Artificial Chromosome 20, long deletions.
    reference genome (Chromosome 20), variations, reads_set_1 and reads_set_2 (20000000 paired reads, reads length 70 bp, insert size 500 bp).
    • Alleles in diploid chromosomes were generated independently.
    • This dataset is meant for testing the ability to detect challenging frequent long overlapping deletions. It does not represent a realistic scenario as such, but extrapolates a plausible local scenario to genome-scale in order to evaluate the performance of deletion detection separately.
    • The longest possible deletion is roughly 350 bp.
  • Artificial Chromosome 2.
    reference genome (Chromosome 2), variations, reads_set_1 and reads_set_2 (80000000 paired reads, reads length 70 bp, insert size 500 bp).
    • Alleles in diploid chromosomes were not generated independently. If a homozygous variation was chosen, it was applied to both chromosomes. Heterozygous variations were applied only to one of the chromosomes.
    • The maximal indel size is 23 bp.

Bacterial data

Reads simulated from a Wolbachia endosymbiont species with a single chromosome.
reference genome (Wolbachia endosymbiont) and paired_reads (351919 paired reads, reads length 100 bp, insert size 300 bp).

  • Data were generated with toy_seq. The maximal indel size is 10 bp.
  • Average sequencing depth is 20 and Minor Variant Frequency is 20%.

Yeast data

Saccharomyces cerevisiae, S288C strain.
reference genome (Saccharomyces cerevisiae, S288C strain) and paired_reads (2891487 paired reads, reads length 100 bp, insert size 300 bp).

  • Data were generated with toy_seq. The maximal indel size is 10 bp.
  • Sequencing depth is 20 and Minor Variant Frequency is 20%.

Top

Submission guidelines

For each data set analyzed, do separately the following steps:

  1. Store your read mappings (sam file) and called variants (vcf file) to some web address compressed into a single file called results.zip.
    • sam/bam files should include both unique and multiply placed alignments and unaligned reads.
    • Artificial Chromosome 2. For this data set please provide only the sam file. No variation file is needed.
    • Bacterial and yeast data. Zygosity is taken into account and should be indicated in the Genotype field (GT) of the VCF file output by toy_seq. Please see the vcf file documentation for further details.
      At most two alleles are generated (i.e., the reference and one allele): field GT is 0|0 or 1|1 for homozygous poisitions, 0|1 or 1|0 for heterozygous positions (the order is not meaningful).
  2. Fill the submission form with the name of the dataset analyzed and the link to your results.zip.
  3. Multiple submissions for the same dataset are allowed (maximum two submissions per dataset).

Top

Evaluation

  • Precision / recall measures are computed for the read mapping results as follows.
    • Artificial genomes are generated so that their mapping to reference is kept.
    • For each generated read from artificial genome, we keep its original location in artificial genome and can therefore map it to the reference.
    • With this information, we can compute false positive (FP) and true positive (TP) counts.
    • We also generate reads from other genomes to enable counting of false negative (FN) and true negative (TN) counts.
  • Precision / recall measures are computed for variant calling results as follows.
    • We keep the true variants chosen to generate the artificial genome.
    • Predicted variants are compared to the true variants to compute TP, FP, and FN counts.
    • The contribution of heterozygous variations to the amount of TP is the number of alleles. For example, in correspondence of positions with 2 alternative alleles, two answers are expected and not one heterozygous position.
    • Notice that comparing variants directly is tricky, as typically predictions may differ slightly from true variants; we plan to use some thresholds for variant similarity and plot several precision / recall values for different thresholds.
  • Resource usage
    • We ask all teams to provide amount of euros spent for their variant calling run. This is computed by:
      • estimating the current purchase price of similar system as where the software was excecuted. Say this amount is X euros.
      • estimating the proportion of life time of the system used for the variant calling run, assuming the lifetime of the system is 4 years. Say this proportion is Y.

      The variant calling then has costed X*Y euros.

    • In addition, peak memory consumption, wall-clock time, level of parallelism, and specifications of the system used for the run should be reported.

Top

Organizers

Veli Mäkinen - University of Helsinki (Finland)
Alberto Policriti - University of Udine (Italy)
Eric Rivals - CNRS and Univ. Montpellier 2 (France)
Vincent Maillol - CNRS and Univ. Montpellier 2 (France)
Simone Scalabrin - IGA Technology Services (Italy)
Annie Chateau - CNRS and Univ. Montpellier 2 (France)
Krista Longi - University of Helsinki (Finland)
Nicola Vitacolonna - University of Udine (Italy)
Francesca Nadalin - University of Udine (Italy)

Contacts

For any question please send an e-mail to variathon@cs.helsinki.fi

Top