mutectcaller - NVIDIA Docs

This tool is an accelerated version of the GATK somatic variant caller, Mutect2, which takes aligned BAMs from the FQ2BAM tool, and outputs a VCF file. This can take as input either a single (“tumor-only”) BAM, or a pair of BAMs (“tumor-normal”) to provide a baseline to call somatic variants against.

The figure below shows the high-level functionality of mutectcaller. All dotted boxes indicate optional data, with some constraints.

The names of the tumor sample (for the --tumor-name option) and the normal sample (for the --normal-name option) can be extracted from the headers of their respective BAM files with samtools, which can be installed through apt-get:

Copy
Copied!

            
            $ sudo apt-get install samtools

Or you can build it from source codes by following the instructions in samtools repo.

Once you have samtools installed on your system you can run this command to get the sample name (SM) field:

Copy
Copied!

            
            $ samtools view NA12878.bam -H | grep '@RG'
@RG ID:HJYFJ.4  SM:NA12878  LB:Pond-492093  PL:illumina PU:HJYFJCCXX160204.4.GCCGCAAC   CN:BI   DT:2016-02-04T00:00:00-0500

The sample name is the value after "SM:" (NA12878, in this example)

If there are multiple read group (@RG) lines and all of them have the same sample name you may safely use the common sample name. If there are multiple read group lines with multiple sample names, choose one sample name as the input. All reads with that sample name will be processed by mutectcaller and all other reads will be ignored. Currently only one sample name per BAM file is supported.

If there are no read group lines in the BAM header, or there is no sample name in the read group line, you will need to add read group information to the BAM file. This may be done by running this command:

Copy
Copied!

            
            $ samtools addreplacerg \
    -r "@RG\tID:sample_rg1\tLB:lib1\tPL:bar\tSM:sample_sm\tPU:sample_rg1" \
    original_file.bam \
    -o updated_file.bam \
    -O BAM

This will update the sample name of all reads in this BAM file to "sample_sm", and you can pass "sample_sm" as the sample name of this BAM file. Make sure you use the updated_file.bam as input to mutectcaller.

See the mutectcaller Reference section for a detailed listing of all available options.

Quick Start

You can download the mutect sample dataset from here. Extract all files by running:

Copy
Copied!

            
            $ tar -xvzf mutect_sample.tar.gz
mutect_sample/
mutect_sample/germline_resource.vcf.gz.tbi
mutect_sample/force_call.vcf.gz.tbi
mutect_sample/germline_resource.vcf.gz
mutect_sample/tumor.bam.bai
mutect_sample/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
mutect_sample/force_call.vcf.gz
mutect_sample/tumor.bam
mutect_sample/normal.bam.bai
mutect_sample/normal.bam

Inside the mutect_sample folder you will find the necessary input files including:

one reference FASTA (GCA_000001405.15_GRCh38_no_alt_analysis_set.fa),
one tumor BAM (tumor.bam),
one normal BAM (normal.bam),
one force_calling.vcf.gz VCF file and
one germline_resource.vcf.gz VCF file

with all necessary indexes.

Copy
Copied!

            
            # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.5.1-1 \
    pbrun mutectcaller \
    --ref /workdir/${REFERENCE_FILE} \
    --tumor-name tumor_name_inside_bam_file \
    --in-tumor-bam /workdir/${INPUT_TUMOR_BAM} \
    --in-normal-bam /workdir/${INPUT_NORMAL_BAM} \
    --normal-name normal_name_inside_bam_file \
    --out-vcf /outputdir/${OUTPUT_VCF}

Compatible GATK4 Command

The command below is the GATK4 counterpart of the Parabricks command above. The output from this command will be identical to the output from the above command. See the Output Comparison page for comparing the results.

Copy
Copied!

            
            $ gatk Mutect2 \
    -R <INPUT_DIR>/${REFERENCE_FILE} \
    --input <INPUT_DIR>/${INPUT_TUMOR_BAM} \
    --tumor-sample tumor_name_inside_bam_file \
    --input <INPUT_DIR>/${INPUT_NORMAL_BAM} \
    --normal-sample normal_name_inside_bam_file \
    --output <OUTPUT_DIR>/${OUTPUT_VCF}

Mutect2 with Panel of Normals

Parabricks Mutect2 from version 3.7.0-1 has started supporting Panel of Normals to process variants. There are three steps involved:

prepon
running mutectcaller with the index generated by prepon
postpon, updating the VCF with PON information

Copy
Copied!

            
            # The first command will generate input.pon that should be done once for the input.vcf.gz
# This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.5.1-1 \
    pbrun prepon --in-pon-file /workdir/${INPUT_PON_VCF}

# Run mutectcaller with the pon index
# This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.5.1-1 \
    pbrun mutectcaller \
    --ref /workdir/${REFERENCE_FILE} \
    --tumor-name tumor \
    --in-tumor-bam /workdir/${INPUT_TUMOR_BAM} \
    --in-normal-bam /workdir/${INPUT_NORMAL_BAM} \
    --pon /workdir/${INPUT_PON_VCF} \
    --normal-name normal \
    --out-vcf /outputdir/${OUTPUT_VCF}

# Add the annotation to the output.vcf generated above
# This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.5.1-1 \
    pbrun postpon \
  --in-vcf /workdir/${OUTPUT_VCF} \
  --in-pon-file /workdir/${INPUT_PON_FILE} \
  --out-vcf /outputdir/${OUTPUT_ANNOTATED_VCF}

Source of Mismatches

While Parabricks MutectCaller does not lose any accuracy in functionality when compared with the GATK MutectCaller, there are a few implementation differences that can result in slightly different output files.

Log10 implementation

The log10 operation is used to compute the haplotype penalty score. The Java implementation java.lang.Math.log10() is slightly different from the C++ cmath library, giving rise to small mismatches in computed scores. Different haplotypes might be selected because of this.

AVX

GATK calls Intel GKL (Genomics Kernel Library) which contains optimized versions of compute kernels (e.g. Smith-Waterman, PairHMM) to run on Intel Architecture (AVX, AVX2, AVX-512, and multicore). However, some SIMD intrinsics such as _mm512_mul_ps can generate a slightly different output when compared with the serial operations which our GPU implementation is based on.

HashMap, HashSet iteration

GATK can give non-deterministic outputs because iterating over a Java HashMap or HashSet does not preserve order. Parabricks always gives deterministic output by using a hash table that preserves the insertion order (similar to LinkedHashMap in Java).

mutectcaller Reference

Run GPU mutect2 to convert BAM/CRAM to VCF.

Type	Name	Required?	Description
I/O	‑‑ref REF	Yes	Path to the reference file.
I/O	‑‑out‑vcf OUT_VCF	Yes	Path of the VCF file after Variant Calling. (Allowed: .vcf, .vcf.gz)
I/O	‑‑in‑tumor‑bam IN_TUMOR_BAM	Yes	Path of the BAM/CRAM file for tumor reads.
I/O	‑‑in‑normal‑bam IN_NORMAL_BAM	No	Path of the BAM/CRAM file for normal reads.
I/O	‑‑in‑tumor‑recal‑file IN_TUMOR_RECAL_FILE	No	Path of the report file after Base Quality Score Recalibration for tumor sample.
I/O	‑‑in‑normal‑recal‑file IN_NORMAL_RECAL_FILE	No	Path of the report file after Base Quality Score Recalibration for normal sample.
I/O	‑‑interval‑file INTERVAL_FILE	No	Path to an interval file in one of these formats: Picard-style (.interval_list or .picard), GATK-style (.list or .intervals), or BED file (.bed). This option can be used multiple times.
I/O	‑‑mutect‑bam‑output MUTECT_BAM_OUTPUT	No	File to which assembled haplotypes should be written. If passing with --run-partition, multiple BAM files will be written.
I/O	‑‑pon PON	No	Path of the vcf.gz PON file. Make sure you run prepon first and there is a '.pon' file already.
I/O	‑‑mutect‑germline‑resource MUTECT_GERMLINE_RESOURCE	No	Path of the vcf.gz germline resource file. Population VCF of germline sequencing containing allele fractions.
I/O	‑‑mutect‑f1r2‑tar‑gz MUTECT_F1R2_TAR_GZ	No	Path of the tar.gz of collecting F1R2 counts.
I/O	‑‑mutect‑alleles MUTECT_ALLELES	No	Path of the vcf.gz force-call file. The set of alleles to force-call regardless of evidence.
Tool	‑‑max‑mnp‑distance MAX_MNP_DISTANCE	No	Two or more phased substitutions separated by this distance or less are merged into MNPs. (default: 1)
Tool	‑‑mutectcaller‑options MUTECTCALLER_OPTIONS	No	Pass supported mutectcaller options as one string. The following are currently supported original mutectcaller options: -pcr-indel-model , -max-reads-per-alignment-start , (e.g. --mutectcaller-options="-pcr-indel-model HOSTILE -max-reads-per-alignment-start 30").
Tool	‑‑initial‑tumor‑lod INITIAL_TUMOR_LOD	No	Log 10 odds threshold to consider pileup active.
Tool	‑‑tumor‑lod‑to‑emit TUMOR_LOD_TO_EMIT	No	Log 10 odds threshold to emit variant to VCF.
Tool	‑‑pruning‑lod‑threshold PRUNING_LOD_THRESHOLD	No	Ln likelihood ratio threshold for adaptive pruning algorithm.
Tool	‑‑active‑probability‑threshold ACTIVE_PROBABILITY_THRESHOLD	No	Minimum probability for a locus to be considered active.
Tool	‑‑no‑alt‑contigs	No	Ignore commonly known alternate contigs.
Tool	‑‑genotype‑germline‑sites	No	Call all apparent germline site even though they will ultimately be filtered.
Tool	‑‑genotype‑pon‑sites	No	Call sites in the PoN even though they will ultimately be filtered.
Tool	‑‑force‑call‑filtered‑alleles	No	Force-call filtered alleles included in the resource specified by --alleles.
Tool	‑‑filter‑reads‑too‑long	No	Ignore all input BAM reads with size > 500bp.
Tool	‑‑tumor‑name TUMOR_NAME	Yes	Name of the sample for tumor reads. This MUST match the SM tag in the tumor BAM file.
Tool	‑‑normal‑name NORMAL_NAME	No	Name of the sample for normal reads. If specified, this MUST match the SM tag in the normal BAM file.
Tool	‑L INTERVAL, ‑‑interval INTERVAL	No	Interval within which to call the variants from the BAM/CRAM file. All intervals will have a padding of 100 to get read records, and overlapping intervals will be combined. Interval files should be passed using the --interval-file option. This option can be used multiple times (e.g. "-L chr1 -L chr2:10000 -L chr3:20000+ -L chr4:10000-20000").
Tool	‑ip INTERVAL_PADDING, ‑‑interval‑padding INTERVAL_PADDING	No	Amount of padding (in base pairs) to add to each interval you are including.
Performance	‑‑mutect‑low‑memory	No	Use low memory mode in mutect caller.
Performance	‑‑run‑partition	No	Turn on partition mode; divides genome into multiple partitions and runs 1 process per partition.
Performance	‑‑gpu‑num‑per‑partition GPU_NUM_PER_PARTITION	No	Number of GPUs to use per partition.
Performance	‑‑num‑htvc‑threads NUM_HTVC_THREADS	No	Number of CPU threads to use. (default: 5)
Runtime	‑‑verbose	No	Enable verbose output.
Runtime	‑‑x3	No	Show full command line arguments.
Runtime	‑‑logfile LOGFILE	No	Path to the log file. If not specified, messages will only be written to the standard error output.
Runtime	‑‑tmp‑dir TMP_DIR	No	Full path to the directory where temporary files will be stored. (default: .)
Runtime	‑‑with‑petagene‑dir WITH_PETAGENE_DIR	No	Full path to the PetaGene installation directory. By default, this should have been installed at /opt/petagene. Use of this option also requires that the PetaLink library has been preloaded by setting the LD_PRELOAD environment variable. Optionally set the PETASUITE_REFPATH and PGCLOUD_CREDPATH environment variables that are used for data and credentials. Optionally set the PetaLinkMode environment variable that is used to further configure PetaLink, notably setting it to "+write" to enable outputting compressed BAM and .fastq files.
Runtime	‑‑keep‑tmp	No	Do not delete the directory storing temporary files after completion.
Runtime	‑‑no‑seccomp‑override	No	Do not override seccomp options for docker.
Runtime	‑‑version	No	View compatible software versions.
Runtime	‑‑num‑gpus NUM_GPUS	No	Number of GPUs to use for a run. (default: 1)