Whole-Genome Variant Calling

This section explores whole-genome germline small variant calling using the HG002 Genome-in-a-Bottle Sample.

Software Requirements

  • samtools

  • BWA

  • Parabricks version 3.7 or later

Hardware Requirements

These are the minimum hardware requirements for this how-to:

  • 512GB of disk space (preferably on "balanced" or SSD storage)

  • Two Nvidia V100 GPUs; the commands and time estimates below will utilize four Nvidia V100 GPUs

  • 24 CPU cores; the commands below use 32 vCPU cores

  • 48GB of CPU RAM per GPU

The examples below use a Google Cloud Project VM with 32 vCPU cores, 120 GB of RAM, and four NVIDIA V100 GPUs running Ubuntu 20.04.

Introduction

This how-to will run through a full whole-genome germline pipeline for calling SNPs, MNPs, and indels on real 30X short-read human data. Such analyses are common in a variety of settings:

  • Population studies

  • Genome-wide association studies

  • Trio analysis (when combined with downstream filtering)

  • When analyzing data from biobanks (such as reanalysis of 1000 Genomes data)

  • When looking for possible hereditary cancer predisposition mutations (e.g. Lynch Syndrome or mutations in certain BRCA genes)

  • When looking for disease-associated mutations in clinical sequencing

The data was generated from the son in a trio sequenced by the Genome In A Bottle Consortium. This sample, identified as HG002, has been highly characterized across multiple sequencing platforms and variant callers, and a high-quality "truth set" of variants exists that allows you to check the results.

After variant calling, you will annotate the VCF with several databases to determine which variants are common or might be associated with disease. You will filter out common variants (those observed frequently in 1000 Genomes) and then use NVIDIA Parabricks tools for quality control to assess the variant caller results.

The first steps of this workflow (alignment, variant calling, and quality control) are common across many different analyses. Depending on your use case, however, the annotation and filtering steps may differ. This how-to runs through several different filtering scenarios to cover examples of interesting questions to ask of the data.

Example data

The example data for this how-to resides in a public Google Cloud Storage bucket. It can be downloaded using the gsutil tool from the Google Cloud SDK; alternatively, the file can be downloaded with wget using the second set of instructions below. Note that there are two fastq files to download (ending in "R1.fastq.gz" and "R2.fastq.gz").

$ gsutil cp gs://brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R1.fastq.gz .
$ gsutil cp gs://brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R2.fastq.gz .

To download the data with wget (if gsutil is not available):

$ wget https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R1.fastq.gz
$ wget https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/hiseqx/wgs_pcr_free/30x/HG002.hiseqx.pcr-free.30x.R2.fastq.gz

If you have your own data, you'll likely be able to substitute it directly into the commands in this how-to to get accurate, multi-caller variant calls.

Downloading and Indexing a Reference Genome and Known Sites

Next, you will download and index a reference genome to use in alignment and variant calling. You will use GRCh38 without alt contigs. Indexing should take between 30 minutes and one hour.

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

## Unzip and index the reference
$ gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

## Create an FAI index
$ samtools faidx GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

## Create the BWA indices
$ bwa index GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

To generate a BQSR report (used for improving the base qualities within the BAM and downstream HaplotypeCaller calls), we'll also need a VCF file with known variant calls. We'll retrieve these from the Broad HG38 resource bundle:

$ wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
$ wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

Aligning Reads

The fq2bam command runs read alignment--as well as sorting, duplicate marking, and base quality score recalibration (BQSR)--according to GATK best practices, but at a much faster rate than community tools by leveraging up to 8 NVIDIA GPUs.

Note that this how-to adds a custom read group to make the sample easier to identify if it is later merged with others in downstream analysis; this is essential when running alignment for matched tumor-normal or multi-sample analyses. You also have to pass a set of known sites to generate the BQSR report.

$ RGTAG="@RG\tID:HG002\tLB:lib\tPL:Illumina\tSM:HG002\tPU:HG002"
$ pbrun fq2bam \
    --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
    --in-fq HG002.hiseqx.pcr-free.30x.R1.fastq.gz HG002.hiseqx.pcr-free.30x.R2.fastq.gz "${RGTAG}" \
    --knownSites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    --out-bam HG002.hiseqx.pcr-free.30x.pb.bam \
    --out-recal-file HG002.hiseqx.pcr-free.30x.pb.BQSR-report.txt

The read alignment, sorting, duplicate marking and indexing process (all included in fq2bam) should take about 50 minutes with four NVIDIA V100 GPUs. The output of fq2bam is a BAM file, a BAI index and a BQSR report.

$ ls HG002.hiseqx.pcr-free.30x.pb*
HG002.hiseqx.pcr-free.30x.pb.bam
HG002.hiseqx.pcr-free.30x.pb.bam.BAI
HG002.hiseqx.pcr-free.30x.pb.BQSR-report.txt

These files will be used as inputs to the multiple SNP/indel callers run in the next step.

Generating a Comprehensive Set of Variant Calls with DeepVariant, HaplotypeCaller, and Strelka

Parabricks currently supports three variant callers (DeepVariant, HaplotypeCaller, and Strelka). Each of these uses slightly different methods for calling variants, with particular trade-offs in sensitivity and specificity for each.

This how-to runs all three callers to generate a comprehensive set of variant calls. If you were looking for variants of clinical interest, you might consider taking the intersection of two or more callers to generate a highly specific callset. On the other hand, if you weren't concerned about false positives but needed a highly-sensitive callset, you could take the union of all three callers. Such vote-based consensus approaches have been employed in numerous large studies to address the shortcomings of individual callers; using the votebasedvcfmerger you can generate consensus callsets directly from the command line, which you'll do in a later section.

First, run DeepVariant, a deep-learning based variant caller originally developed by Google; Parabricks provides an accelerated version of DeepVariant that is 10-15X faster than the community version. DeepVariant should run in under 30 minutes on a 4xV100 system.

$ pbrun deepvariant \
  --in-bam HG002.hiseqx.pcr-free.30x.pb.bam \
  --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  --out-variants HG002.hiseqx.pcr-free.30x.pb.deepvariant.vcf

Next, run HaplotypeCaller, long considered the gold standard of germline small variant callers. Parabricks HaplotypeCaller takes roughly 15 minutes on four NVIDIA V100 GPUs, a speed increase of roughly 60X over the community version.

$ pbrun haplotypecaller \
  --in-bam HG002.hiseqx.pcr-free.30x.pb.bam \
  --in-recal-file HG002.hiseqx.pcr-free.30x.pb.BQSR-report.txt \
  --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  --out-variants HG002.hiseqx.pcr-free.30x.pb.haplotypecaller.vcf

Finally, run Strelka, a SNP and indel caller from Illumina. Parabricks provides a convenient interface to Strelka and automatically parallelizes the run over available CPUs. On a 32 vCPU node, the below strelka command takes roughly 1 hour to complete.

$ pbrun strelka \
  --in-bams HG002.hiseqx.pcr-free.30x.pb.bam \
  --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  --out-prefix HG002.hiseqx.pcr-free.30x.pb.strelka

The results of DeepVariant and HaplotypeCaller are plain-text VCF files, while Strelka produces a compressed and indexed VCF file (named variants.vcf.gz) in the HG002.hiseqx.pcr-free.30x.pb.strelka.strelka_work directory. To ensure that each caller worked as expected, you will apply the vcfqc tool to generate some summary statistics for the calling runs. While it would be permissible to run any of these variant calling stages individually for many studies, the speed with which they run in Parabricks means it makes sense to go ahead and run multiple variant callers per sample.

Intermediate Quality Control with vcfqc

Parabricks contains built-in quality control report tools for validating VCF outputs. These scripts take arguments about which INFO fields contain values to be plotted. When finished, they create a simple HTML report of plots and summary statistics that can help verify whether DeepVariant, HaplotypeCaller, and Strelka produced reasonable results. If you see something abnormal in the quality control plots, you can check whether you ran the alignment and calling stages using the correct files and parameters.

Each variant caller uses slightly different INFO and FORMAT fields to hold the information used in the quality control reports, so for each VCF you must tell the vcfqc tool which fields correspond to which values. You'll run vcfqc once for each of the output VCFs.

$ pbrun vcfqc \
    --in-vcf HG002.hiseqx.pcr-free.30x.pb.deepvariant.vcf \
    --output-dir HG002.hiseqx.pcr-free.30x.pb.deepvariant.qc \
    --map-quality MAPQ \
    --depth DP \
    --allele-depth AD \
    --vaf AF
$ pbrun vcfqc \
    --in-vcf HG002.hiseqx.pcr-free.30x.pb.haplotypecaller.vcf \
    --output-dir HG002.hiseqx.pcr-free.30x.pb.haplotypecaller.qc \
    --map-quality MAPQ \
    --depth DP \
    --allele-depth AD \
    --vaf AF
$ pbrun vcfqc \
    --in-vcf HG002.hiseqx.pcr-free.30x.pb.strelka.variants.vcf.gz \
    --output-dir HG002.hiseqx.pcr-free.30x.pb.strelka.qc \
    --map-quality MAPQ \
    --depth DP \
    --allele-depth AD \
    --vaf AF

You can look at the HaplotypeCaller QC report, which is the most comprehensive of the three, by opening the report in a web browser such as Firefox. Each of the directories given by the --output-dir will have the report (in PNG format) from that run.

Merging Variant Callsets from Multiple Callers into one VCF File

The votebasedvcfmerger tool takes in VCF files from multiple callers and generates union and intersection VCFs. You'll use votebasedvcfmerger to merge the VCFs but won't immediately do any filtering. You'll then filter the merged VCF after annotation to demonstrate common queries for different kinds of studies. votebasedvcfmerger takes a set of VCF files prefixed with the name of the caller they came from and a minimum number of "votes," or supporting callers for a given variant. You'll pass --min-votes 1 in the call to votebasedvcfmerger to keep all variants in the callset.

$ pbrun votebasedvcfmerger \
    --in-vcf strelka:HG002.hiseqx.pcr-free.30x.pb.strelka/variants.vcf.gz \
    --in-vcf DeepVariant:HG002.hiseqx.pcr-free.30x.pb.deepvariant.vcf \
    --in-vcf HaplotypeCaller:HG002.hiseqx.pcr-free.30x.pb.haplotypecaller.vcf \
    --min-votes 1 \
    --out-dir merged_vcfs

Annotating Variant Calls with Information from Databases

On their own, the VCF files produced by DeepVariant, HaplotypeCaller, and Strelka are not particularly useful.

You will need to download all of the databases and indices before you can annotate the VCF with them. This how-to uses the following databases:

  • 1000Genomes: A database of variants called in 2,504 individuals from healthy individuals

  • ClinVar: An NCBI database of variants associated with disease

  • gnomAD: The Genome Aggregation Database, containing population variants from 125,748 exomes and 15,708 whole genomes

  • COSMIC: Coding and non-coding variants from the Catalogue of Somatic Mutations in Cancer

  • dbSNP: SNPs, MNPs and indels for both common and clinical mutations

## Download the 1000Genomes VCF (and .tbi) on hg38
$ wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz
$ wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz.tbi

## Download the gnomad VCF (and .tbi). We'll use v2.1 on GRCh38
$ wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz
$ wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz.tbi

## Download the ClinVar VCF and index
$ wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz
$ wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz.tbi


## Download the dbSNP VCF and index. Make sure to use the proper version;
## some versions of dbSNP use alternative naming conventions for contig names
## that will prevent proper annotation.
$ wget https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz
$ wget https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz.tbi

Downloading the COSMIC VCF files requires signing in to the COSMIC site; instructions for doing so are not included here, but we have tested the vcfanno tool on both the CosmicCodingMuts.normal.vcf.gz and CosmicNoncodingVariants.normal.vcf.gz files. Make sure to use the normalized variant files; otherwise, variant alleles at indels may not match between the database and query. For this how-to, you will annotate the query VCF(s) with both the COSMIC NonCoding and COSMIC Coding files.

Once you have downloaded all of the VCFs ind TBI indices, use the vcfanno tool in Parabricks to annotate the VCF file.

$ pbrun vcfanno \
    --in-vcf merged_vcfs/filteredVCF.vcf \
    --dbsnp dbSNP:/path/to/00-All.vcf.gz \
    --annotations COSMIC_Coding:/path/to/CosmicCodingMuts.normal.vcf.gz \
    --annotations COSMIC_NonCoding:/path/to/CosmicNoncodingVariants.normal.vcf.gz \
    --annotations gnomAD_v2:/path/to/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz \
    --annotations ClinVar:/path/to/clinvar.vcf.gz \
    --annotations 1000Genomes_Phase3:/path/to/ALL.wgs.shapeit2_integrated_v1a.GRCh38.20181129.sites.vcf.gz \
    --num-threads 16 \
    --out-vcf merged.annotated.vcf

Please note that the vcfanno tool needs the absolute path(s) to the VCF files used for annotation.