VARIANT CALLERS

NVIDIA Clara Parabricks Pipelines accelerated variant callers

BCFTOOLS CALL

Accelerated bcftools call.

bcftools-call calls variants from mpileup output

QUICK START

$ pbrun bcftoolscall --in-file pileup.bcf \
--out-file output.vcf

COMPATIBLE CPU COMMAND

The command below is the CPU counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command. Please look at Output Comparison page on how you can compare the results.

bcftools call pileup.bcf -c -o output.vcf

OPTIONS

--in-file

Path to the input mpileup file (default: None)

--out-file

Path of output file. If this option is not used, it will write to standard output (default: None)

--num-threads

Number of threads for worker (default: 1)

--variant-sites

Output variant sites only (default: None)

HAPLOTYPECALLER

GPU accelerated haplotypecaller.

This tool runs GPU accelerated haplotypecaller. Users can provide an optional BQSR report to fix the BAM similar to ApplyBQSR. In that case the updated base qualities will be used.

../_images/parabricks-web-graphics-1259949-r2-haplotypecaller.svg

QUICK START

$ pbrun haplotypecaller --ref Ref/Homo_sapiens_assembly38.fasta \
--in-bam mark_dups_gpu.bam \
--in-recal-file recal_gpu.txt \
--out-variants result.vcf

COMPATIBLE GATK4 COMMAND

The command below is the GATK4 counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command. Please look at Output Comparison page on how you can compare the results.

# Run ApplyBQSR Step
$ gatk ApplyBQSR --java-options -Xmx30g -R Ref/Homo_sapiens_assembly38.fasta \
-I=mark_dups_cpu.bam --bqsr-recal-file=recal_file.txt -O=cpu_nodups_BQSR.bam

#Run Haplotype Caller
$ gatk HaplotypeCaller --java-options -Xmx30g --input cpu_nodups_BQSR.bam --output \
result_cpu.vcf --reference Ref/Homo_sapiens_assembly38.fasta \
--native-pair-hmm-threads 16

OPTIONS

--ref

(required) The reference genome in fasta format.

--in-bam

(required) Path to the input bam file.

--out-variants

(required) Path of .vcf, g.vcf, or gvcf file.

--in-recal-file

Path to the input BQSR report. Only required if ApplyBQSR step is needed.

--haplotypecaller-options

Pass supported haplotype caller options as one string. Current original haplotypecaller supported options: -min-pruning , -standard-min-confidence-threshold-for-calling , -max-reads-per-alignment-start , -min-dangling-branch-length , and -pcr-indel-model .

--static-quantized-quals

Use static quantized quality scores to a given number of levels. Repeat this option multiple times for multiple bins.

--ploidy

Defaults to 2.

Ploidy assumed for the bam file. Currently only haploid (ploidy 1) and diploid (ploidy 2) are supported.

--interval-file

Path to an interval file for BQSR step with possible formats: Picard-style (.interval_list or .picard), GATK-style (.list or .intervals), or BED file (.bed). This option can be used multiple times (default: None)

--interval

(-L) Interval within which to call variants from the input reads. All intervals will have a padding of 100 to get read records and overlapping intervals will be combined. Interval files should be passed using the –interval-file option. This option can be used multiple times.

e.g. "-L chr1 -L chr2:10000 -L chr3:20000+ -L chr4:10000-20000" (default: None)

--interval-padding

(-ip) Padding size (in base pairs) to add to each interval you are including (default: None)

--gvcf

Defaults to False.

Generate variant calls in gvcf format. When using this option –out-variants file should end with g.vcf or g.vcf.gz. If the --out-variants file ends in gz, the tool will generate gvcf.gz and index for it.

--batch

Given an input list of BAMs, run the variant calling of each BAM using one GPU, and process BAMs in parallel based on how many GPUs the system has.

--disable-read-filter

Disable the read filters for bam entries. Currently supported read filters that can be disabled are: MappingQualityAvailableReadFilter, MappingQualityReadFilter, and NotSecondaryAlignmentReadFilter. This option can be repeated multiple times.

--max-alternate-alleles

Maximum number of alternate alleles to genotype (default: None)

--annotation-group

(-G) Which groups of annotations to add to the output variant calls. Currently supported annotation groups: StandardAnnotation, StandardHCAnnotation, AS_StandardAnnotation (default: None)

--gvcf-gq-bands

(-GQB) Exclusive upper bounds for reference confidence GQ bands. Must be in the range [1, 100] and specified in increasing order (default: None)

--tmp-dir

Defaults to ..

Full path to the directory where temporary files will be stored.

--num-gpus

Defaults to number of GPUs in the system.

The number of GPUs to be used for this analysis task.

--gpu-devices

Which GPU devices to use for a run. By default, all GPU devices will be used. To set specific GPU devices, enter a comma-separated list of GPU device numbers.

MUTECTCALLER

GPU accelerated mutect2.

mutectcaller supports tumor or tumor-normal variant calling. The figure below shows high level functionality of mutectcaller. All dotted boxes are optional with some constraints.

../_images/parabricks-web-graphics-1259949-r2-mutecaller.svg

QUICK START

$ pbrun mutectcaller --ref Ref/Homo_sapiens_assembly38.fasta \
--in-tumor-bam tumor.bam \
--tumor-name foobar \
--out-vcf output.vcf

COMPATIBLE GATK4 COMMAND

The command below is the GATK4 counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command. Please look at Output Comparison page on how you can compare the results.

gatk Mutect2 -R ref.tar.gz --input tumor.bam --tumor-sample foobar --output result.vcf

OPTIONS

--ref

(required) The reference genome in fasta format. We assume that the indexing required to run bwa has been completed by the user.

--in-tumor-bam

(required) Path of bam file for tumor reads.

--tumor-name

(required) Name of sample for tumor reads.

--out-vcf

(required) Path to the VCF output file.

--in-tumor- recal-file

Path of BQSR report for tumor sample.

--in-normal-bam

Path of bam file for normal reads.

--in-normal-recal-file

Path of BQSR report for normal sample.

--normal-name

Name of sample for normal reads.

--ploidy

Ploidy assumed for the bam file. Currently only haploid (ploidy 1) and diploid (ploidy 2) are supported.

--interval-file

Path to an interval file for BQSR step with possible formats: Picard-style (.interval_list or .picard), GATK-style (.list or .intervals), or BED file (.bed). This option can be used multiple times (default: None)

--interval

(-L) Interval within which to call variants from the input reads. All intervals will have a padding of 100 to get read records and overlapping intervals will be combined. Interval files should be passed using the --interval-file option. This option can be used multiple times.

e.g. "-L chr1 -L chr2:10000 -L chr3:20000+ -L chr4:10000-20000" (default: None)

--interval-padding

(-ip) Padding size (in base pairs) to add to each interval you are including (default: None)

--mutectcaller-options

Pass supported mutectcaller options as one string. Currently supported original mutectcaller options: -pcr-indel-model <NONE, HOSTILE, AGGRESSIVE, CONSERVATIVE>. e.g. –mutectcaller- options=”-pcr-indel-model HOSTILE” (default: None)

--tmp-dir

Defaults to ..

Full path to the directory where temporary files will be stored.

--num-gpus

Defaults to number of GPUs in the system.

The number of GPUs to be used for this analysis task.

--gpu-devices

Which GPU devices to use for a run. By default, all GPU devices will be used. To set specific GPU devices, enter a comma-separated list of GPU device numbers.

VARSCAN

Accelerated Varscan.

Varscan supports tumor or tumor-normal variant calling. Parabricks has varscan tool as standalone tool or you can use varscan workflow (varscanworkflow) to generate vcf from BAM/CRAM.

QUICK START

$ pbrun varscan --in-file sample.pileup --out-prefix  output

COMPATIBLE GATK4 COMMAND

The command below is the GATK4 counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command. Please look at Output Comparison page on how you can compare the results.

java -jar VARSCAN_JAR somatic sample.pileup sample.caller -mpileup 1  output

OPTIONS

--in-file

Path to the input mpileup file (default: None)

--out-prefix

Prefix filename for output data (default: None)

--num-threads

Number of threads for worker (default: 1)

VARSCAN WORKFLOW

varscan workflow to generate VCF from BAM/CRAM input files.

../_images/varscan.png

QUICK START

$ pbrun varscan_workflow --ref Ref/Homo_sapiens_assembly38.fasta \
--in-tumor-bam tumor.bam \
--in-normal-bam normal.bam \
--out-prefix output

COMPATIBLE CPU COMMAND

The command below is the CPU counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command. Please look at Output Comparison page on how you can compare the results.

samtools mpileup -B -f reference.fasta -q 1 -d 0 normal.bam tumor.bam | java -jar VARSCAN_JAR somatic - sample.caller -mpileup 1

java -jar VARSCAN_JAR processSomatic sample.caller.snp

java -jar VARSCAN_JAR processSomatic sample.caller.indel

python vs_format_converter.py sample.caller.snp.Somatic.hc > sample.caller.snp.vcf

python vs_format_converter.py sample.caller.indel.Somatic.hc > sample.caller.indel.vcf

OPTIONS

--ref

(required) The reference genome in fasta format. We assume that the indexing required to run bwa has been completed by the user.

--in-tumor-bam

(required) Path of bam file for tumor reads.

--in-normal-bam

Path of bam file for normal reads.

--out-prefix

Prefix filename for output data (default: None)

--num-threads

Number of threads for worker (default: 1)

--min-mapq

Filtering reads with mapping quality less than this value (default: 1)

SOMATICSNIPER

Accelerated Somatic Sniper.

Somatic sniper supports tumor-normal variant calling. Parabricks has somatic sniper tool as standalone tool or you can use somatic sniper workflow (sniperworkflow) to generate vcf from BAM/CRAM.

QUICK START

$ pbrun somaticsniper --ref  Ref/Homo_sapiens_assembly38.fasta  --in-tumor-bam tumor.bam  --in-normal-bam normal.bam --out-file  output.vcf

COMPATIBLE GATK4 COMMAND

The command below is the GATK4 counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command. Please look at Output Comparison page on how you can compare the results.

bam-somaticsniper -q 1 -G -L -F vcf -f  Ref/Homo_sapiens_assembly38.fasta  tumor.bam normal.bam output.vcf

OPTIONS

--ref

Path to the reference file (default: None)

--in-tumor-bam

Path of bam/cram file for tumor reads. Path can be a Google Cloud Storage object (default: None)

--in-normal-bam

Path of bam/cram file for normal reads. Path can be a Google Cloud Storage object (default: None)

--out-file

Path of output file (default: None)

--num-threads

Number of threads for worker (default: 1)

--min-mapq

Filtering reads with mapping quality less than this value (default: 0)

--out-format

Type of output format. Possible values are {classic, vcf} (default: classic)

--correct

Fix baseline bugs. If this option is not passed, the same output will be generated as baseline (default: None)

--no-gain

Do not report Gain of Reference variants as determined by genotypes (default: None)

--no-loh

Do not report LOH variants as determined by genotypes (default: None)

SOMATICSNIPER WORKFLOW

Somatic sniper workflow to generate VCF from BAM/CRAM input files.

../_images/sniper.png

QUICK START

$ pbrun somaticsniper_workflow --ref Ref/Homo_sapiens_assembly38.fasta \
--in-tumor-bam tumor.bam \
--in-normal-bam normal.bam \
--out-prefix output

COMPATIBLE CPU COMMAND

The command below is the CPU counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command. Please look at Output Comparison page on how you can compare the results.

bam-somaticsniper -q 1 -G -L -F vcf -f  Ref/Homo_sapiens_assembly38.fasta  tumor.bam normal.bam output.vcf

bcftools mpileup -A -B -d 2147483647 -Ou -f Ref/Homo_sapiens_assembly38.fasta tumor.bam | bcftools call -c | vcfutils.pl varFilter -Q 20 | awk 'NR > 55 {print}' > output.indel_pileup_Tum.pileup

perl snpfilter.pl --snp-file output.vcf --indel-file output.indel_pileup_Tum.pileup

perl prepare_for_readcount.pl --snp-file output.vcf.SNPfilter

bam-readcount -b 15 -f  Ref/Homo_sapiens_assembly38.fasta -l output.vcf.SNPfilter.pos tumor.bam > output.readcounts.rc

perl fpfilter.pl -snp-file output.vcf.SNPfilter -readcount-file output.readcounts.rc

perl highconfidence.pl -snp-file output.vcf.SNPfilter.fp_pass.vcf

OPTIONS

--ref

(required) The reference genome in fasta format. We assume that the indexing required to run bwa has been completed by the user.

--in-tumor-bam

(required) Path of bam file for tumor reads.

--in-normal-bam

Path of bam file for normal reads.

--out-prefix

Prefix filename for output data (default: None)

--num-threads

Number of threads for worker (default: 1)

--min-mapq

Filtering reads with mapping quality less than this value (default: 1)

DEEPVARIANT

Run GPU-accelerated deepvariant algorithm.

Parabricks has accelerated Google Deepvariant to extensively use GPUs and finish 30x WGS analysis in 25 minutes. The Parabricks flavor of Deepvariant is more like other commandline tools that users are familiar with. It takes the BAM and reference as inputs and produces variants as outputs. In the next versions, we will allow users to choose the exact model to use.

QUICK START

$ pbrun deepvariant --ref Ref/Homo_sapiens_assembly38.fasta \
--in-bam mark_dups_gpu.bam \
--out-variants output.vcf

COMPATIBLE GOOGLE DEEPVARIANT COMMANDS

The command below is the Google counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command. Please look at Output Comparison page on how you can compare the results.

# Run make_examples in parallel
seq 0 $((N_SHARDS-1)) | \
parallel --eta --halt 2 --joblog "${LOGDIR}/log" --res "${LOGDIR}" \
sudo docker run \
-v ${HOME}:${HOME} \
gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/make_examples \
--mode calling \
--ref "${REF}" \
--reads "${BAM}" \
--examples "${OUTPUT_DIR}/examples.tfrecord@${N_SHARDS}.gz" \
--task {}

# Run call_variants in parallel
sudo docker run \
-v ${HOME}:${HOME} \
gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/call_variants \
--outfile "${CALL_VARIANTS_OUTPUT}" \
--examples "${OUTPUT_DIR}/examples.tfrecord@${N_SHARDS}.gz" \
--checkpoint "${MODEL}"

# Run postprocess_variants in parallel
sudo docker run \
-v ${HOME}:${HOME} \
gcr.io/deepvariant-docker/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/postprocess_variants \
--ref "${REF}" \
--infile "${CALL_VARIANTS_OUTPUT}" \
--outfile "${FINAL_OUTPUT_VCF}"

OPTIONS

--ref

(required) The reference genome in fasta format.

--in-bam

(required) Path to the input BAM file.

--out-variants

(required) Name of output vcf file.

--pb-model-file

Path of a non-default parabricks model file for deepvariant.

--interval-file

Path to an interval file for BQSR step with possible formats: Picard-style (.interval_list or .picard), GATK-style (.list or .intervals), or BED file (.bed). This option can be used multiple times (default: None)

--interval

(-L) Interval within which to call variants from the input reads. All intervals will have a padding of 100 to get read records and overlapping intervals will be combined. Interval files should be passed using the --interval-file option. This option can be used multiple times.

e.g. "-L chr1 -L chr2:10000 -L chr3:20000+ -L chr4:10000-20000" (default: None)

--disable-use-window-selector-model

Change the window selector model from Allele Count Linear to Variant Reads. This option will increase the accuracy and runtime (default: None)

--gvcf

Generate variant calls in gvcf format.

--tmp-dir

Defaults to ..

Full path to the directory where temporary files will be stored.

--num-gpus

Defaults to number of GPUs in the system.

The number of GPUs to be used for this analysis task.

--gpu-devices

Which GPU devices to use for a run. By default, all GPU devices will be used. To set specific GPU devices, enter a comma-separated list of GPU device numbers.

CNVKIT

CPU accelerated Copy number variant calling.

Run CNVkit with accelerated coverage calculation from read depths. CNVkit is not available as part of the free for Covid19 program.

QUICK START

$ pbrun cnvkit --ref Ref/Homo_sapiens_assembly38.fasta \
--in-bam mark_dups_gpu.bam
--out-file output.vcf

OPTIONS

--ref

(required) Path to the reference file.

--in-bam

(required) Path to the bam file.

--out-file

Path to the output vcf file.

--cnvkit-options

Pass supported cnvkit options as one string.

e.g. --cnvkit-options="--count-reads --drop-low-coverage".

--generate-vcf

Export the output cns to vcf after running batch (default: None)

MANTA

Structural variant (SV) and indel caller from mapped paired-end sequencing reads. This tools is not accelerated.

QUICK START

$ pbrun manta --ref Ref/Homo_sapiens_assembly38.fasta \
--in-tumor-bam tumor.bam \
--in-normal-bam normal.bam \
--out-prefix output

OPTIONS

--ref

Path to the reference file (default: None)

--in-tumor-bam

Path of bam file for tumor reads (default: None)

--in-normal-bam

Path of bam file for normal reads. This option can be used multiple times (default: None)

--bed

Optional bgzip-compressed/tabix-indexed BED file containing the set of regions to call (default: None)

--out-prefix

Prefix filename for output data (default: None)

--num-threads

Number of threads for worker (default: 1)

--manta-options

Pass supported manta options as one string. e.g. –manta-options=”–rna –unstrandedRNA” (default: None)

STRELKA

SNP and indel caller from mapped paired-end sequencing reads. This tools is not accelerated.

QUICK START

$ pbrun manta --ref Ref/Homo_sapiens_assembly38.fasta \
--in-tumor-bam tumor.bam \
--in-normal-bam normal.bam \
--indel-candidates candidates.vcf \
--out-prefix output

OPTIONS

--ref

Path to the reference file (default: None)

--in-tumor-bam

Path of bam file for tumor reads (default: None)

--in-normal-bam

Path of bam file for normal reads. This option can be used multiple times (default: None)

--indel-candidates

Path to a VCF of candidate indel alleles. Must be in vcf/vcf.gz format. This option can be used multiple times (default: None)

--bed

Optional bgzip-compressed/tabix-indexed BED file containing the set of regions to call (default: None)

--out-prefix

Prefix filename for output data (default: None)

--num-threads

Number of threads for worker (default: 1)

--strelka-options

Pass supported strelka options as one string. e.g. –strelka-options=”–exome” (default: None)

STRELKA WORKFLOW

Strelka workflow to generate VCF from BAM/CRAM input files.

../_images/strelka.png

QUICK START

$ pbrun strelka_workflow --ref Ref/Homo_sapiens_assembly38.fasta \
--in-tumor-bam tumor.bam \
--in-normal-bam normal.bam \
--out-prefix output

COMPATIBLE GATK4 COMMAND

The command below is the GATK4 counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command. Please look at Output Comparison page on how you can compare the results.

mkdir -p manta_work

python $MANTA_DIR/bin/configManta.py --referenceFasta Ref/Homo_sapiens_assembly38.fasta \
    --normalBam ${NORMAL} --tumorBam tumor.bam \
    --runDir manta_work

cd manta_work

python ./runWorkflow.py -m local -j ${MAX_NUM_PROCESSORS}

cd ..

mkdir -p strelka_work

python $STRELKA_PATH/configureStrelkaSomaticWorkflow.py \
    --referenceFasta Ref/Homo_sapiens_assembly38.fasta \
    --normalBam normal.bam --tumorBam tumor.bam \
    --indelCandidates ${WORK_PATH}/manta_work/results/variants/candidateSmallIndels.vcf.gz \
    --runDir strelka_work

cd strelka_work

python ./runWorkflow.py -m local -j ${MAX_NUM_PROCESSORS}

cd ..

OPTIONS

--ref

(required) The reference genome in fasta format. We assume that the indexing required to run bwa has been completed by the user.

--in-tumor-bam

(required) Path of bam file for tumor reads.

--in-normal-bam

Path of bam file for normal reads.

--out-prefix

Prefix filename for output data (default: None)

--num-threads

Number of threads for worker (default: 1)