Title: Multithreaded variant calling elPrep 5 and future developments in genomics analysis

Abstract

We present elPrep 5, the latest release of our software framework for analyzing sequencing data. The main new feature of elPrep 5 is the introduction of variant calling. This allows elPrep 5 to execute the full pipeline described by the GATK best practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF outputs as GATK 4, while parallelizing and merging the computation of the different pipeline steps to significantly speed up the runtime. Concretely, elPrep speeds up the variant calling pipeline by a factor 8-16x compared to GATK on both whole-exome and whole-genome data without requiring specialized or proprietary accelerator hardware. elPrep 5 is developed as an open source project on Github and is designed for use with community-defined standards and file formats for NGS analysis. While computational performance is a main focus of elPrep, we also strive to improve the user experience with the software. elPrep is distributed as a single stand-alone binary, making it easy to install, and has a simple user interface where a full variant calling pipeline can be expressed as a singled command-line invocation. elPrep has an active user community, mainly at hospitals, research facilities, but also companies. This community actively supports elPrep by making it available on platforms such as Bioconda (over 15k downloads) and Seven Bridges genomics who have independently validated elPrep. In this talk, we present an overview of the elPrep software, as well as future developments for our sequencing software. We will in particular address the challenges we see with further optimizations and privacy preservation for supporting population genomics.

+1 (873) 371-5878