Whole-Genome Somatic Small Variant Calling

This how-to explores calling variants using Mutect2 in the SEQC2 Consortium Sample.

Software Prerequisites

  • Samtools and BWA

  • SRA Toolkit

  • Parabricks

  • GATK4 v4.1.0.0

  • bedtools

  • R (vcfR, UpSetR)

Compute Requirements and Configuration

  • At least 512GB of disk space (preferably on “balanced” or SSD storage)

  • At least two NVIDIA Tesla T4 GPUs; the commands and time estimates below utilize four NVIDIA Tesla T4 GPUs.

  • At least 24 CPU cores; the commands below use 48 vCPU cores.

  • At least 48GB of RAM per GPU

  • A functional Parabricks install (version 3.7 or newer)

The examples below use an AWS VM g4dn.12xlarge with 48 vCPU cores, 192GB of RAM, and 4 NVIDIA Tesla T4 GPUs running Ubuntu 18.04.5 LTS.


This how-to runs through a full Whole Genome Sequencing (WGS) somatic varaint analysis pipeline for calling SNPs, MNPs, and small indels on real 30X short-read human data. Such analyses are commonly used in cancer genomics studies. For WGS somatic variant analysis, you will utilize the example data generated by "The Somatic Mutation Working Group of the SEQC2 Consortium". The dataset contains multiplatform sequencing from HCC1395, which is a triple negative breast cancer cell line, and HCC1395 BL is the matched normal cell line. The SEQC2 dataset contains whole genome (WGS) and Exome, sequencing performed at six different sequencing centers, and multiple replicates to minimized potential biases from sequencing technologies/assays. Furthermore, the dataset was processed through nine bioinformatics pipelines to evaluate accuracy and reproducibility. The SEQC2 consortium reports artifacts generated due to sample and library processing, along with evaluating the capabilities and limitation of bioinformatics tools for artifact detection and removal. A series of detailed articles are available below:

The SEQC2 data set is well charaterized, and the "Truth Variant Call set" provided by the consortium allow you to validate the results from WGS somatiuc variant analysis pipeline. When you finish variant calling, you will annotate the VCF with several databases to determine which variants are common or might be associated with disease. You will filter out common variants (those observed frequently in 1000 Genomes) and then use NVIDIA Parabricks tools for quality control to assess the variant caller results.

Download Example FASTQ Files

The example dataset used for this how-to resides on NCBI SRA and can be downloaded using wget. Once the sra files are downloaded, install SRA Toolkit to convert sra files to paired end FASTQ files using the set of instructions given below. Note that SRR7890827 is a normal sample and SRR7890824 is a tumor sample from the HCC1395 cell line. Paired end WGS sequencing was performed for each sample on illumina HiSeq X.

## Download publically available SRA files using wget
# Normal sample
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR7890827/SRR7890827
# should be 69.84G and will take 30 min. to download

# Tumor sample
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR7890824/SRR7890824
# should be 64.83G and takes around 30 min. to download

## Convert SRA to FASTQ files
fastq-dump -I --split-files SRR7890827 --gzip
fastq-dump -I --split-files SRR7890824 --gzip

If you have your own data, you’ll likely be able to substitute it directly into the commands in this how-to to get accurate, multi-caller variant calls.

Downloading and Indexing a Reference Genome and Known Sites

Next, you’ll download and index a reference genome to use in alignment and variant calling. You’ll use GRCh38 for alignment. The BWA indexed genome is available at the following locations:

If the index is not available, you can use the GRCh38 genome fasta file to index the genome using BWA. Indexing should take between 30 to 60 minutes.

Download the GRCh38.d1.vd1 reference fasta sequence: GRCh38.d1.vd1.fa.tar.gz (md5 checksum: 3ffbcfe2d05d43206f57f81ebb251dc9, file size: 875.3 MB)

Index Reference Genome

tar -xvzf GRCh38.d1.vd1.fa.tar.gz
## Note this is baseline bwa.
bwa index GRCh38.d1.vd1.fa

Step1: Aligning the Fastq Files to the Reference Genome

The pbrun fq2bam command runs read alignment as well as sorting, duplicate marking, and base quality score recalibration (BQSR) according to GATK best practices, but at a much faster rate than community tools by leveraging up to 8 NVIDIA GPUs. The pbrun fq2bam command also generates a BQSR report, which is used to improve the base qualities within the BAM files and is utilized by MuTect2 downstream for variant calling. Note that this how-to adds a custom read group to make the sample easier to identify. If the sample is later merged with others in downstream analysis, adding a custom read group will be essential when running alignment for matched tumor/normal or multi-sample analyses. You will also need to pass a set of known sites (dbsnp_146.hg38.vcf.gz, ALL.wgs.1000G_phase3.GRCh38.ncbi_remapper.20150424.shapeit2_indels.vcf.gz, Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf.gz) to generate the BQSR report.

## Aligning Normal sample FASTQ file
$ pbrun fq2bam --ref GRCh38.d1.vd1.fa \
    --in-fq WGS_FD_N_2_R1.fq.gz WGS_FD_N_2_R2.fq.gz "@RG\tID:sm_normal_rg1\tLB:lib1\tPL:bar\tSM:sm_normal\tPU:sm_normal_rg1" \
    --knownSites Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf.gz \
    --knownSites dbsnp_146.hg38.vcf.gz  \
    --knownSites ALL.wgs.1000G_phase3.GRCh38.ncbi_remapper.20150424.shapeit2_indels.vcf.gz \
    --out-recal-file SRR7890827-WGS_FD_N_BQSR_REPORT.txt \
    --bwa-options=-Y --out-bam SRR7890827-WGS_FD_N.bam

## Aligning Tumor sample FASTQ file
$ pbrun fq2bam --ref GRCh38.d1.vd1.fa \
    --in-fq WGS_FD_T_2_R1.fq.gz WGS_FD_T_2_R2.fq.gz "@RG\tID:sm_tumor_rg1\tLB:lib1\tPL:bar\tSM:sm_tumor\tPU:sm_tumor_rg1" \
    --knownSites Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf.gz \
    --knownSites dbsnp_146.hg38.vcf.gz  \
    --knownSites ALL.wgs.1000G_phase3.GRCh38.ncbi_remapper.20150424.shapeit2_indels.vcf.gz \
    --out-recal-file SRR7890824-WGS_FD_T_BQSR_REPORT.txt \
    --bwa-options=-Y --out-bam SRR7890824-WGS_FD_T.bam

The above step contains read alignment, sorting, duplicate marking, and indexing process (all included in pbrun fq2bam); it should take about 50 minutes on four NVIDIA Tesla T4 GPUs. The output of fq2bam is a BAM file, a BAI index, and a BQSR report file. These files will be used as inputs to the multiple SNP/indel callers run in the next step.

Step2: Generating a Set of Small Variant Calls Using Clara Parabricks Mutect2

Parabricks currently supports five variant callers: Mutect2, MuSE, LoFreq, Somatic Sniper, and Strelka. Each uses slightly different methods for calling variants, with tradeoffs in sensitivity and specificity for each. This example will run pbrun mutectcaller (Mutect2) to generate a set of small variant calls.

$ pbrun mutectcaller --ref GRCh38.d1.vd1.fa \
    --in-tumor-bam  SRR7890824-WGS_FD_T.bam \
    --in-normal-bam SRR7890827-WGS_FD_N.bam \
    --in-tumor-recal-file SRR7890824-WGS_FD_T_BQSR_REPORT.txt  \
    --in-normal-recal-file SRR7890827-WGS_FD_N_BQSR_REPORT.txt \
    --out-vcf SRR7890824-SRR7890827-WGS_FD.vcf \
    --tumor-name SRR7890824-WGS_FD_T \
    --normal-name SRR7890827-WGS_FD_N \
    --num-gpus 4 --x3 --tmp-dir $TMP_DIR

Looking for variants of clinical interest, you might consider running the other variant callers and taking the intersection of two or more callers to generate a highly specific callset. On the other hand, if you are not concerned about false positives but need a highly-sensitive callset, you could take the union of all five callers. Such vote-based consensus approaches have been employed in numerous large studies to address the shortcomings of individual callers; using pbrun votebasedvcfmerger in conjunction can generate consensus callsets directly from the command line, which you'll do in a later section.

You will first run MuTect2, which is the most widely used variant caller developed at Broad; Parabricks provides an accelerated version of MuTect2 v4.1.0.0 that is 10-15X faster than the community version. MuTect2 should run in under 30 minutes on a 4 T4 GPU system.

The results of MuTect2 are plain-text VCF files. To ensure that each caller worked as expected, you'll apply the vcfqc tool to generate some summary statistics for the calling runs. While it would be permissible to run any of these variant calling stages individually for many studies, the speed with which they run in Parabricks makes it feasible to run multiple variant callers per sample.

Step3: For Validating the SNV and MNV Variants with Truthset

Next, you will process the Mutect2 VCF files to extract non-indel variants using the GATK4 SelectVariants tool, which makes it possible to select a subset of variants based on various criteria in order to facilitate certain analyses. This example analysis extracts only the SNV and MNV variants, excluding small indels for further downstream validation of the variant calls in comparison to truthset (see detailed instructions below):

$ java -jar gatk- SelectVariants \
    -R GRCh38.d1.vd1.fa \
    -V SRR7890824-SRR7890827-WGS_FD.vcf \
    --select-type-to-exclude INDEL \
    -O SRR7890824-SRR7890827-WGS_FD.SNV-MNV.vcf

Use GATK4 FilterMutectCalls to Apply Filters to the Raw Output of Mutect2

Once you have the SNV/MNV .vcf file, use GATK4 FilterMutectCalls to apply filters to the raw output of Mutect2. Parameters contained in M2FiltersArgumentCollection are described here.

The following example uses a highconfidence region bedfile to intersect the Mutect2 variants calls in the high confidence region and use the PASS filter variants for validation.

$ java -jar gatk- FilterMutectCalls \
-V SRR7890824-SRR7890827-WGS_FD.SNP.vcf \
-O SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.vcf

$ bedtools intersect -header \
-a SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.vcf \
-b Truthset_version1.2/High-Confidence_Regions_v1.2.bed >SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.vcf

$ grep "#" SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.vcf >mutect_header.txt

$ grep -v "#" SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.vcf | awk '{if ($7 == "PASS")print}' >SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.PA.vcf

$ cat mutect_header.txt SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.PA.vcf  >SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.PASS.vcf

Finally once you have the filtered Mutect2 calls, use R vcfR and upSetR packages to generate an upset plot to evaluate the performance of the variant calling pipeline. Use the following code to generate upset plot:

> library(vcfR)
> library(UpSetR)

> ## SEQC2 Somatic SNV upset plot
> ## Load the vcf files from truthset and Mutect2
> Truthset_snv_hc <- read.vcfR("high-confidence_sSNV_in_HC_regions_v1.2.hc.vcf")
> mutect_snv_hc <- read.vcfR("SRR7890824.SRR7890827.baseline-mutect2-")

> ### extract chromosome and position to compare
> Truthset_snv_hc <-
  as.vector(paste(Truthset_snv_hc@fix[, "CHROM"], Truthset_snv_hc@fix[, "POS"], sep = "_"))
> mutect_snv_hc <-
  as.vector(paste(mutect_snv_hc@fix[, "CHROM"], mutect_snv_hc@fix[, "POS"], sep = "_"))

> ## create the read set
> read_sets <- list(Truthset_SEQC2_SNV = Truthset_snv_hc,
                  Mutect2 = mutect_snv_hc)

> upset(fromList(read_sets), order.by = c("freq"), keep.order = TRUE, sets = c("Mutect2", "Truthset_SEQC2_SNV"), point.size = 3.5, line.size = 2, mainbar.y.label = "Variant Intersections", text.scale = c(2, 2, 1.5, 1.5, 1.5, 1.5),sets.x.label = "Variants Calls Per Subset")

You can also use som.py from the Illumina hap.py package to compare the two VCF files.

Intermediate Quality Control with vcfqc

Parabricks contains built-in quality control report tools for validating VCF outputs. These scripts take arguments about which INFO fields contain values to be plotted. When finished, they create a simple HTML report of plots and summary statistics that can help verify whether MuTect2, MuSE, LoFreq, Somatic Sniper, and Strelka produced reasonable results. If you see something abnormal in your quality control plots, you can check whether you ran the alignment and calling stages using the correct files and parameters. Each variant caller uses slightly different INFO and FORMAT fields to hold the information used in the quality control reports, so for each VCF you must tell the vcfqc tool which fields correspond to which values. You'll run vcfqc once for each output VCF. You can look at the MuTect2 QC report, which is the most comprehensive of the three. You can open the report in a web browser such as Firefox.

Merging Variant Callsets from Multiple Callers into One VCF File

The votebasedvcfmerger tool takes in VCF files from multiple callers and generates union and intersection VCFs. You will use the votebasedvcfmerger tool to merge the VCFs, but won’t immediately do any filtering. You'll filter the merged VCF after annotation to demonstrate common queries for different kinds of studies. The votebasedvcfmerger tool takes a set of VCF files prefixed with the name of the caller they came from and a minimum number of "votes", or supporting callers for a given variant. You’ll pass --min-votes1 in the call to votebasedvcfmerger to keep all variants in the callset.

Annotating Variant Calls with Information from Databases

On their own, the VCF files produced by DeepVariant, HaplotypeCaller, and Strelka are not particularly useful.

You’ll need to download all of the databases and indices before you can annotate the VCF with them. This how-to uses the following databases:

  • 1000Genomes: a database of variants called in 2,504 individuals from healthy individuals.

  • ClinVar: an NCBI database of variants associated with disease.

  • gnomAD: The Genome Aggregation Database contains population variants from 125,748 exomes and 15,708 whole genomes.

  • COSMIC: Coding and non-coding variants from the Catalogue of Somatic Mutations in Cancer

  • dbSNP: SNPs, MNPs and indels for both common and clinical mutations

Downloading the COSMIC VCF files requires signing in to the COSMIC site; instructions for doing so are not included here, but we have tested the vcfanno tool on both the CosmicCodingMuts.normal.vcf.gz and CosmicNoncodingVariants.normal.vcf.gz files. Make sure to use the normalized variant files; otherwise, variant alleles at indels may not match between the database and query. For this how-to, you will annotate the query VCF(s) with both the COSMIC NonCoding and COSMIC Coding files.

Once you have downloaded all of the VCFs ind TBI indices, use the vcfanno tool in Parabricks to annotate the VCF file.