Whole-Genome Somatic Small Variant Calling

This how-to explores calling variants using Mutect2 in the SEQC2 Consortium Sample.

Software Prerequisites

  • Samtools and BWA

  • SRA Toolkit

  • Parabricks

  • GATK4 v4.2.0.0

  • bedtools

  • R (vcfR, UpSetR)

Compute Requirements and Configuration

  • At least 512GB of disk space (preferably on “balanced” or SSD storage)

  • At least two NVIDIA Tesla T4 GPUs; the commands and time estimates below utilize four NVIDIA Tesla T4 GPUs.

  • At least 24 CPU cores; the commands below use 48 vCPU cores.

  • At least 48GB of RAM per GPU

  • A functional Parabricks install (version 3.7 or newer)

The examples below use an AWS VM g4dn.12xlarge with 48 vCPU cores, 192GB of RAM, and 4 NVIDIA Tesla T4 GPUs running Ubuntu 18.04.5 LTS.

Introduction

This how-to runs through a full Whole Genome Sequencing (WGS) somatic variant analysis pipeline for calling SNPs, MNPs and small indels on real 30X short-read human data. Such analyses are commonly used in cancer genomics studies. For WGS somatic variant analysis, you will utilize the example data generated by "The Somatic Mutation Working Group of the SEQC2 Consortium". The dataset contains multiplatform sequencing from HCC1395, which is a triple negative breast cancer cell line, and HCC1395 BL is the matched normal cell line. The SEQC2 dataset contains whole genome (WGS) and Exome, sequencing performed at six different sequencing centers, and multiple replicates to minimized potential biases from sequencing technologies/assays. Furthermore, the dataset was processed through nine bioinformatics pipelines to evaluate accuracy and reproducibility. The SEQC2 consortium reports artifacts generated due to sample and library processing, along with evaluating the capabilities and limitation of bioinformatics tools for artifact detection and removal. A series of detailed articles are available below:

The SEQC2 data set is well characterized and the "Truth Variant Call set" provided by the consortium allows you to validate the results from the WGS somatic variant analysis pipeline. When you finish variant calling, you will annotate the VCF with several databases to determine which variants are common or might be associated with disease. You will filter out common variants (those observed frequently in 1000 Genomes) and then use NVIDIA Parabricks tools for quality control to assess the variant caller results.

Download Example FASTQ Files

The example dataset used for this how-to resides on NCBI SRA and can be downloaded using wget. Once the SRA files are downloaded, install the SRA Toolkit to convert SRA files to paired end FASTQ files using the instructions given below. Note that SRR7890827 is a normal sample and SRR7890824 is a tumor sample from the HCC1395 cell line. Paired end WGS sequencing was performed for each sample on an Illumina HiSeq X.

## Download publicly available SRA files using wget. Both files are
# about 65 GB in size.
# Normal sample
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR7890827/SRR7890827 --output-document=SRR7890827.sra

# Tumor sample
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR7890824/SRR7890824 --output-document=SRR7890824.sra

## Convert SRA to FASTQ files
fastq-dump --split-files ./SRR7890827.sra --gzip
fastq-dump --split-files ./SRR7890824.sra --gzip

Once the SRA files have been converted to FASTQ format they are no longer needed and may be deleted.

If you have your own data you’ll likely be able to substitute it directly into the commands in this how-to to get accurate, multi-caller variant calls.

Downloading and Indexing a Reference Genome and Known Sites

Next, download and index a reference genome to use in alignment and variant calling. We’ll use GRCh38 for alignment, which may be downloaded from this page:

From that website please download and extract the contents of:

  • GRCh38.d1.vd1.fa.tar.gz

If you don't want to build your own BWA indices also download and extract the contents of:

  • GRCh38.d1.vd1_BWA.tar.gz and

  • GRCh38.d1.vd1_GATK_indices.tar.gz

Otherwise you'll need to build your indices. The commands for doing so are given below.

We’ll need these additional VCF files and their indices:

along with BED and VCF files for the truth set:

Index Reference Genome

If you use a reference file for which the index is not available you can index it using BWA. Indexing should take 30 to 60 minutes.

tar -xvzf GRCh38.d1.vd1.fa.tar.gz
## Note this is baseline bwa.
bwa index GRCh38.d1.vd1.fa

Step1: Aligning the Fastq Files to the Reference Genome

The pbrun fq2bam command runs read alignment as well as sorting, duplicate marking and base quality score recalibration (BQSR) according to GATK best practices, but at a much faster rate than community tools by leveraging up to 8 NVIDIA GPUs. The pbrun fq2bam command also generates a BQSR report which is used to improve the base qualities within the BAM files and is used by MuTect2 downstream for variant calling. Note that this how-to adds a custom read group to make the sample easier to identify. If the sample is later merged with others in downstream analysis, adding a custom read group will be essential when running alignment for matched tumor/normal or multi-sample analyses. You will also need to pass a set of known sites:

  • GCF_000001405.39.gz

  • ALL.wgs.1000G_phase3.GRCh38.ncbi_remapper.20150424.shapeit2_indels.vcf.gz

  • Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf.gz

to generate the BQSR report.

## Aligning Normal sample FASTQ file
$ pbrun fq2bam --ref GRCh38.d1.vd1.fa \
    --in-fq SRR7890827_1.fastq.gz SRR7890827_2.fastq.gz "@RG\tID:id_SRR7890827_rg1\tLB:lib1\tPL:bar\tSM:sm_SRR7890827\tPU:pu_SRR7890827_rg1" \
    --knownSites Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf.gz \
    --knownSites GCF_000001405.39.gz \
    --knownSites ALL.wgs.1000G_phase3.GRCh38.ncbi_remapper.20150424.shapeit2_indels.vcf.gz \
    --out-recal-file SRR7890827-WGS_FD_N_BQSR_REPORT.txt \
    --bwa-options=-Y \
    --out-bam SRR7890827-WGS_FD_N.bam

## Aligning Tumor sample FASTQ file
$ pbrun fq2bam --ref GRCh38.d1.vd1.fa \
    --in-fq SRR7890824_1.fastq.gz SRR7890824_2.fastq.gz "@RG\tID:id_SRR7890824_rg1\tLB:lib1\tPL:bar\tSM:sm_SRR7890824\tPU:pu_SRR7890824_rg1" \
    --knownSites Mills_and_1000G_gold_standard.indels.b38.primary_assembly.vcf.gz \
    --knownSites GCF_000001405.39.gz \
    --knownSites ALL.wgs.1000G_phase3.GRCh38.ncbi_remapper.20150424.shapeit2_indels.vcf.gz \
    --out-recal-file SRR7890824-WGS_FD_T_BQSR_REPORT.txt \
    --bwa-options=-Y \
    --out-bam SRR7890824-WGS_FD_T.bam

The pbrun fq2bam command performs read alignment, sorting, duplicate marking and indexing; it should take about 50 minutes running on four NVIDIA Tesla T4 GPUs. The output of fq2bam is a BAM file, a BAI index and a BQSR report file. These files will be used as inputs to the multiple SNP/indel callers run in the next step.

Step2: Generating a Set of Small Variant Calls Using Clara Parabricks Mutect2

Parabricks currently supports five variant callers: Mutect2, MuSE, LoFreq, Somatic Sniper, and Strelka. Each uses slightly different methods for calling variants with trade-offs in sensitivity and specificity for each. This example will run pbrun mutectcaller (Mutect2) to generate a set of small variant calls.

$ pbrun mutectcaller --ref GRCh38.d1.vd1.fa \
    --in-tumor-bam  SRR7890824-WGS_FD_T.bam \
    --in-normal-bam SRR7890827-WGS_FD_N.bam \
    --in-tumor-recal-file SRR7890824-WGS_FD_T_BQSR_REPORT.txt  \
    --in-normal-recal-file SRR7890827-WGS_FD_N_BQSR_REPORT.txt \
    --out-vcf SRR7890824-SRR7890827-WGS_FD.vcf \
    --tumor-name sm_SRR7890824 \
    --normal-name sm_SRR7890827

This command puts the called variants in SRR7890824-SRR7890827-WGS_FD.vcf and creates the SRR7890824-SRR7890827-WGS_FD.vcf.stats file, which will be used in subsequent steps.

Looking for variants of clinical interest, you might consider running the other variant callers and taking the intersection of two or more callers to generate a highly specific callset. On the other hand if you are less concerned with false positives but need a highly-sensitive callset, you could take the union of all five callers. Such vote-based consensus approaches have been employed in numerous large studies to address the shortcomings of individual callers; running votebasedvcfmerger can generate consensus callsets directly from the command line. We'll do that in a later section.

First run MuTect2, which is the most widely used variant caller developed at Broad. Parabricks provides an accelerated version of MuTect2 v4.2.0.0 that is 10-15X faster than the community version. MuTect2 should run in under 30 minutes on a four-T4 GPU system.

The results of MuTect2 are plain-text VCF files. To ensure that each caller worked as expected apply the vcfqc tool to generate some summary statistics for the calling runs. While it would be permissible to run any of these variant calling stages individually for many studies, the speed with which they run in Parabricks makes it feasible to run multiple variant callers per sample.

Step3: For Validating the SNV and MNV Variants with truth set

Next, process the Mutect2 VCF files to extract non-indel variants using the GATK4 SelectVariants tool which makes it possible to select a subset of variants based on various criteria in order to facilitate certain analyses. This example analysis extracts only the SNV and MNV variants, excluding small indels for further downstream validation of the variant calls in comparison to truthset (see detailed instructions below):

$ java -jar gatk-4.2.0.0/gatk-package-4.2.0.0-local.jar FilterMutectCalls \
    -O SRR7890824-SRR7890827-WGS_FD.FilterMutectCalls.vcf \
    -R GRCh38.d1.vd1.fa \
    -V SRR7890824-SRR7890827-WGS_FD.vcf

Use GATK4 FilterMutectCalls to Apply Filters to the Raw Output of Mutect2

Once you have the SNV/MNV .vcf file, use GATK4 FilterMutectCalls to apply filters to the raw output of Mutect2. Parameters contained in M2FiltersArgumentCollection are described here.

The following example uses a high confidence region bedfile to intersect the Mutect2 variants calls in the high confidence region and use the PASS filter variants for validation.

$ java -jar gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar SelectVariants \
    -R GRCh38.d1.vd1.fa \
    -V SRR7890824-SRR7890827-WGS_FD.FilterMutectCalls.vcf \
    --select-type-to-exclude INDEL \
    -O SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.vcf

$ bedtools intersect \
    -header \
    -a SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.vcf \
    -b High-Confidence_Regions_v1.2.bed > SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.vcf

$ grep "#" SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.vcf > mutect_header.txt

$ grep -v "#" SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.vcf | awk '{if ($7 == "PASS")print}' > mutect_body.txt

$ cat mutect_header.txt mutect_body.txt > SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.PASS.vcf

Once you have the filtered Mutect2 calls, use R vcfR and upSetR packages to generate an upset plot to evaluate the performance of the variant calling pipeline. Use the following code to generate an upset plot:

> library(vcfR)
> library(UpSetR)

> ## SEQC2 Somatic SNV upset plot
> ## Load the vcf files from truthset and Mutect2
> Truthset_snv_hc <- read.vcfR("high-confidence_sSNV_in_HC_regions_v1.2.hc.vcf.gz")
> mutect_snv_hc <- read.vcfR("SRR7890824-SRR7890827-WGS_FD.SNV-MNV.FilterMutectCalls.hc.PASS.vcf")

> ### extract chromosome and position to compare
> Truthset_snv_hc <-
  as.vector(paste(Truthset_snv_hc@fix[, "CHROM"], Truthset_snv_hc@fix[, "POS"], sep = "_"))
> mutect_snv_hc <-
  as.vector(paste(mutect_snv_hc@fix[, "CHROM"], mutect_snv_hc@fix[, "POS"], sep = "_"))

> ## create the read set
> read_sets <- list(Truthset_SEQC2_SNV = Truthset_snv_hc,
                  Mutect2 = mutect_snv_hc)

> upset(fromList(read_sets), order.by = c("freq"), keep.order = TRUE, sets = c("Mutect2", "Truthset_SEQC2_SNV"), point.size = 3.5, line.size = 2, mainbar.y.label = "Variant Intersections", text.scale = c(2, 2, 1.5, 1.5, 1.5, 1.5),sets.x.label = "Variants Calls Per Subset")

You should get a plot similar to the following:

../_images/HowTo_SomaticCallingPlot.png

You can also use som.py from the Illumina hap.py package to compare the two VCF files.

Intermediate Quality Control with vcfqc

Parabricks contains built-in quality control report tools for validating VCF outputs. These scripts take arguments about which INFO fields contain values to be plotted. When finished, they create a simple HTML report of plots and summary statistics that can help verify whether MuTect2, MuSE, LoFreq, Somatic Sniper and Strelka produced reasonable results. If you see something abnormal in your quality control plots you can check whether you ran the alignment and calling stages using the correct files and parameters. Each variant caller uses slightly different INFO and FORMAT fields to hold the information used in the quality control reports, so for each VCF you must tell the vcfqc tool which fields correspond to which values. Run vcfqc once for each output VCF. You can look at the MuTect2 QC report, which is the most comprehensive of the three. You can open the report in a web browser such as Firefox.

Merging Variant Callsets from Multiple Callers into One VCF File

The votebasedvcfmerger tool takes in VCF files from multiple callers and generates union and intersection VCFs. The votebasedvcfmerger tool can be used to merge the VCFs but not do any filtering. Filter the merged VCF after annotation to demonstrate common queries for different kinds of studies. The votebasedvcfmerger tool takes a set of VCF files prefixed with the name of the caller they came from and a minimum number of "votes", or supporting callers for a given variant. Pass --min-votes 1 in the call to votebasedvcfmerger to keep all variants in the callset.

Annotating Variant Calls with Information from Databases

On their own the VCF files produced by DeepVariant, HaplotypeCaller, and Strelka are not particularly useful.

You’ll need to download all of the databases and indices before you can annotate the VCF with them. This how-to uses the following databases:

  • 1000Genomes: a database of variants called in 2,504 individuals from healthy individuals.

  • ClinVar: an NCBI database of variants associated with disease.

  • gnomAD: The Genome Aggregation Database contains population variants from 125,748 exomes and 15,708 whole genomes.

  • COSMIC: Coding and non-coding variants from the Catalogue of Somatic Mutations in Cancer

  • dbSNP: SNPs, MNPs and indels for both common and clinical mutations

Downloading the COSMIC VCF files requires signing in to the COSMIC site; instructions for doing so are not included here, but we have tested the vcfanno tool on both the CosmicCodingMuts.normal.vcf.gz and CosmicNoncodingVariants.normal.vcf.gz files. Make sure to use the normalized variant files; otherwise, variant alleles at indels may not match between the database and query. For this how-to, you will annotate the query VCF(s) with both the COSMIC NonCoding and COSMIC Coding files.

Once you have downloaded all of the VCFs ind TBI indices, use the vcfanno tool in Parabricks to annotate the VCF file.