Getting the Best Performance

NVIDIA Parabricks software performs at a high level when all the required computing resources are provided to it. It should meet all the Installation Requirements.

Refer to the Hardware Requirements section for minimum hardware requirements.

Refer to the Software Requirements section for minimum software requirements.

Basic Performance Tuning

The goal of the NVIDA Parabricks software is to get the highest performance for bioinformatics and genomic analysis. There are a few key, basic system options that you can tune to achieve maximum performance.

Use a Fast SSD

Parabricks software operates with two kinds of files:

Input/output files specified by the user
Temporary files created during execution and deleted at the end of the run

The best performance is achieved when both kinds of files are on a fast, local SSD. If this is not possible you can place the input/output files on a fast network storage device and the temporary files on a local SSD using the --tmp-dir option.

Note

Tests have shown that you can use up to 4 GPUs and still get good performance with the Lustre network for Input/Output files. If you plan to use more than 4 GPUs, we highly recommend using local SSDs for all kinds of files.

DGX Users

The DGX comes with a SSD, usually mounted on /raid. Use this disk, and use a directory on this disk as the --tmp-dir. For initial testing, you can even copy the input files to this disk to eliminate variability in performance.

In certain cases, Transparent HugePage Support (THP) has been found to increase performance. Consider enabling THP and testing performance on benchmark cases.

Specifying which GPUs to use

You can choose the number of GPUs to run using the command line option --num-gpus N for those tools that use GPUs. With this option only the first N GPUs listed in the output of nvidia-smi will be used.

To use specific GPUs set the environment variable NVIDIA_VISIBLE_DEVICES. GPUs are numbered starting with zero. For example, this command will use only the second (GPU #1) and fourth (GPU #3) GPUs:

Copy
Copied!

            
            $ NVIDIA_VISIBLE_DEVICES="1,3" pbrun fq2bam --num-gpus 2 --ref Ref.fa --in-fq S1_1.fastq.gz --in-fq S1_2.fastq.gz

Tool-Specific Performance Guidelines

This section details guidelines specific to individual tools.

Best Performance for Germline Pipeline

Germline Pipeline uses tools fq2bam and haplotypecaller. For more information, refer to sections Best Performance for fq2bam and Best Performance for Haplotypecaller for parameter choices. We recommend using the following command for best performance.

On an H100 DGX the germline pipeline typically runs in under ten minutes.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun germline \
    --ref /workdir/Homo_sapiens_assembly38.fasta \
    --in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
    --out-bam /outputdir/fq2bam_output.bam \
    --tmp-dir /workdir \
    --bwa-cpu-thread-pool 16 \
    --out-variants /outputdir/out.vcf \
    --run-partition \
    --read-from-tmp-dir \
    --gpusort \
    --gpuwrite \
    --keep-tmp

Best Performance for Deepvariant Germline Pipeline

DeepVariant Germline Pipeline uses tools fq2bam and deepvariant. For more information, refer to sections Best Performance for fq2bam and Best Performance for Deepvariant for parameter choices. We recommend using the following command for best performance.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun deepvariant_germline \
    --ref /workdir/Homo_sapiens_assembly38.fasta \
    --in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
    --out-bam /outputdir/fq2bam_output.bam \
    --tmp-dir /workdir \
    --bwa-cpu-thread-pool 16 \
    --out-variants /outputdir/out.vcf \
    --run-partition \
    --read-from-tmp-dir \
    --gpusort \
    --gpuwrite \
    --keep-tmp

Best Performance for PacBio Germline Pipeline

The PacBio Germline Pipeline runs minimap2 for alignment and deepvariant for variant calling.

For more information, refer to sections Best Performance for Minimap2 and Best Performance for Deepvariant for parameter choices. We recommend using the following command for best performance.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun pacbio_germline \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --out-variants /outputdir/out.vcf
    --max-queue-reads 5000000 \
    --max-queue-chunks 10000 \
    --run-partition \
    --read-from-tmp-dir \
    --num-streams-per-gpu 4 \
    --gpusort \
    --gpuwrite \
    --keep-tmp

Best Performance for Pangenome Germline Pipeline

The Pangenome Germline Pipeline runs giraffe for pangenome alignment and pangenome_aware_deepvariant for variant calling.

For more information on tuning each stage, refer to sections Best Performance for Giraffe and Best Performance for Deepvariant for parameter choices.

The index files used below are generated in the file generation section of the pangenome germline documentation.

The key performance parameters are:

Giraffe (alignment): --nstreams, --num-cpu-threads-per-gpu, --minimizers-gpu. For GPUs with less than 22 GB of device memory, use --low-memory.
Pangenome-aware DeepVariant (variant calling): --run-partition for multi-GPU systems.

We recommend using the following command for best performance on H100:

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun pangenome_germline \
    --ref /workdir/hprc-v1.1-mc-grch38.d9.fa \
    --gbz-name /workdir/hprc-v1.1-mc-grch38.d9.gbz \
    --dist-name /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.dist \
    --minimizer-name /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.shortread.withzip.min \
    --zipcodes-name /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.shortread.zipcodes \
    --ref-paths /workdir/hprc-v1.1-mc-grch38.d9.paths.sub \
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --out-variants /outputdir/${OUTPUT_VCF} \
    --nstreams 5 \
    --num-cpu-threads-per-gpu 24 \
    --minimizers-gpu \
    --run-partition

Best Performance for fq2bam

Parabricks fq2bam automatically uses an optimal number of streams based on the GPU's device memory specifications (by default --bwa-nstreams auto). You can experiment further with the --bwa-nstreams and --bwa-cpu-thread-pool parameters to potentially achieve better performance. To achieve optimal performance during the sorting, duplicate marking, and compression phases, we recommend --gpusort and --gpuwrite, where --gpusort uses GPUs to accelerate sorting and marking and --gpuwrite will use one GPU to accelerate BAM or CRAM compression. For CPU-bound workloads, try --cigar-on-gpu to offload CIGAR generation from the CPU (default) to the GPU, which may improve overall runtime. Consider enabling it on CPU-constrained systems, or if benchmarking with and without the option indicates the CPU is a bottleneck.

Below is our recommendation for best performance.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun fq2bam \
    --ref /workdir/Homo_sapiens_assembly38.fasta \
    --in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
    --out-bam /outputdir/fq2bam_output.bam \
    --tmp-dir /workdir \
    --bwa-cpu-thread-pool 16 \
    --out-recal-file recal.txt \
    --knownSites /workdir/hg.known_indels.vcf \
    --gpusort \
    --gpuwrite

Best Performance for Deepvariant

DeepVariant from Parabricks has the ability to use multiple streams per GPU. The number of streams that can be used depends on the available resources. The default number of streams is set to auto, which will use an optimal configuration based on prior testing as a function of the size of the GPU's device memory. However, the number of streams can be increased up to a maximum of six with the --num-streams-per-gpu parameter to potentially get better performance. Experiment with the number of streams until you get the optimal number on your system. The parameter --run-partition is used as it more efficiently splits up works across multiple GPUs for multi-GPU systems. If you use less than 2 GPUs, you can omit that parameter. We recommend using the following command for best performance.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun deepvariant \
    --ref /workdir/Homo_sapiens_assembly38.fasta \
    --in-bam /outputdir/fq2bam_output.bam \
    --out-variants /outputdir/out.vcf \
    --run-partition

Best Performance for Haplotypecaller

Use the parameter --run-partition as it more efficiently splits up works across multiple GPUs for multi-GPU systems. If you use less than 2 GPUs, you can omit that parameter. We recommend using the following command for best performance.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun haplotypecaller \
    --ref /workdir/Homo_sapiens_assembly38.fasta \
    --in-bam /outputdir/fq2bam_output.bam \
    --out-variants /outputdir/out.vcf \
    --num-htvc-threads 8 \
    --no-alt-contigs \# This flag will ignore all outputs after chrM
    --run-partition

Best Performance for Minimap2

Parabricks minimap2 automatically sets optimal parameters based on which --preset is selected and how many GPUs are detected on the system. The parameter --chunk-size can have a big impact on run performance and memory usage. If you use 8 GPUs, this value is most efficient at 1000. If you use fewer than 8 GPUs, 5000 potentially provides better performance. While setting this value higher may result in better performance, higher values will use much more host memory. The parameters --max-queue-reads and --max-queue-chunks serve to reduce host memory usage by limiting workloads between different stages of processing. Increasing these values will lift the restrictions and might provide some additional speedup at the cost of higher host memory usage.

The following command line options provided optimal performance for PacBio data.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun minimap2 \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --max-queue-reads 5000000 \
    --max-queue-chunks 10000 \
    --gpusort \
    --gpuwrite

The following command line options provided optimal performance for splice and splice:hq presets.

Decrease --num-threads to reduce CPU memory usage.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun minimap2 \
    --preset splice \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --num-threads 148 \
    --max-queue-reads 5000000 \
    --gpusort \
    --gpuwrite

Best Performance for Giraffe

By default, Giraffe uses auto mode (--nstreams auto) to configure streams, batch size, and GPU acceleration options based on available GPU memory. For a full description of auto mode and manual tuning options, see System Requirements and Useful Options for Performance.

While automode provides sensible defaults, for best performance, we recommend overriding auto mode with GPU-specific configurations. The key parameters for tuning performance are:

--nstreams: Controls the number of CUDA streams per GPU. More streams increases throughput but requires more device and host memory.
--num-cpu-threads-per-gpu: Controls the number of CPU worker threads per GPU. Given the heterogeneous nature of Parabricks Giraffe, increasing this value can improve performance when CPU processing is the bottleneck. The default is 16.
--minimizers-gpu: Enables computation of minimizers and seeds on GPU (SE only). This can improve performance on GPUs with sufficient memory.

Example configurations:

For GPUs with less than 22 GB of device memory, use --low-memory.
A100 (40 GB): --nstreams 3 --num-cpu-threads-per-gpu 24
H100 (80 GB): --nstreams 5 --num-cpu-threads-per-gpu 24 --minimizers-gpu
RTX PRO 6000 Blackwell Server Edition (96 GB): --nstreams 4 --num-cpu-threads-per-gpu 24 --minimizers-gpu

The index files used below are generated in the index generation section of the Giraffe documentation.

The following example uses H100 settings:

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun giraffe --read-group "sample_rg1" \
    --sample "sample-name" --read-group-library "library" \
    --read-group-platform "platform" --read-group-pu "pu" \
    --gbz-name /workdir/hprc-v1.1-mc-grch38.d9.gbz \
    --dist-name /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.dist \
    --minimizer-name /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.shortread.withzip.min \
    --zipcodes-name /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.shortread.zipcodes \
    --ref-paths /workdir/hprc-v1.1-mc-grch38.d9.paths.sub \
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --nstreams 5 \
    --num-cpu-threads-per-gpu 24 \
    --minimizers-gpu

GDS Support

For additional performance improvements and final BAM writing bandwidth use GPUDirect Storage (GDS), part of the CUDA toolkit. Note that the system must be set up and supported to use GDS.

The following are references for setting up and using GDS:

Copy
Copied!

            
            # Using GDS with the convenience docker wrapper.
$ wget https://raw.githubusercontent.com/NVIDIA/MagnumIO/main/gds/docker/gds-run-container
$ chmod +x gds-run-container
$ ./gds-run-container run \
    --rm \
    --gpus all \
    --enable-mofed \
    --enable-gds \
    --volume INPUT_DIR:/workdir \
    --volume OUTPUT_DIR:/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun fq2bam \
        --ref /workdir/Homo_sapiens_assembly38.fasta \
        --in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
        --out-bam /outputdir/fq2bam_output.bam \
        --tmp-dir /workdir \
        --out-recal-file recal.txt \
        --knownSites /workdir/hg.known_indels.vcf \
        --gpusort \
        --gpuwrite \
        --use-gds


# Using GDS without the wrapper.
$ docker run \
    --ipc host \
    --volume /run/udev:/run/udev:ro \
    --device=/dev/nvidia-fs0  \
    --device=/dev/nvidia-fs1  \
    --device=/dev/nvidia-fs2  \
    --device=/dev/nvidia-fs3  \
    --device=/dev/nvidia-fs4  \
    --device=/dev/nvidia-fs5  \
    --device=/dev/nvidia-fs6  \
    --device=/dev/nvidia-fs7  \
    --device=/dev/nvidia-fs8  \
    --device=/dev/nvidia-fs9  \
    --device=/dev/nvidia-fs10 \
    --device=/dev/nvidia-fs11 \
    --device=/dev/nvidia-fs12 \
    --device=/dev/nvidia-fs13 \
    --device=/dev/nvidia-fs14 \
    --device=/dev/nvidia-fs15 \
    --rm \
    --gpus all \
    -enable-mofed \
    --volume INPUT_DIR:/workdir \
    --volume OUTPUT_DIR:/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
    pbrun fq2bam
        --ref /workdir/Homo_sapiens_assembly38.fasta \
        --in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
        --out-bam /outputdir/fq2bam_output.bam \
        --tmp-dir /workdir \
        --out-recal-file recal.txt \
        --knownSites /workdir/hg.known_indels.vcf \
        --gpusort \
        --gpuwrite \
        --use-gds

CPU Performance

Parabricks tools are hybrid CPU+GPU accelerated applications. To ensure the best performance and keep the GPU fully utilized, it is important that the CPU be setup for maximum performance.

Recommendations for NVIDIA Grace CPUs can be found here: https://nvidia.github.io/grace-cpu-benchmarking-guide/platform/.

Below are commands which may be useful for achieving the best performance on your CPU. Specifically, we have found these commands to be most useful with x86-64 CPUs.

Note: The following commands are intended for dedicated, thermally adequate x86-64 servers where maximum performance is desired and power/thermal headroom is available. If this is a shared system, a system administrator may need to perform these actions and ensure that they persist upon reboot.

Prerequisites

The cpupower and cpufreq-info tools may be required for CPU performance tuning. Check your Linux distribution for the appropriate packages.

Check Current Settings

Before making changes, check your current CPU configuration:

Copy
Copied!

            
            # Check the current performance bias (Intel CPUs only)
sudo cpupower info
# Example output:
# analyzing CPU 0:
# perf-bias: 6
# The range of valid numbers is 0-15, where 0 is maximum performance and 15 is maximum energy efficiency

# Check available cpupower commands
cpupower help set

# Check current CPU governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq

Apply Performance Settings

Apply the following settings to maximize CPU performance:

Copy
Copied!

            
            # 1. Set the CPU governor to performance mode
sudo cpupower frequency-set -g performance

# 2. Set performance bias to maximum performance
# Note: This may not be available on AMD CPUs
sudo cpupower set -b 0

# 3. Set CPU frequency to maximum
# Use only one option; do not run both
# Option 1: Using cpupower
# This locks both minimum and maximum frequency to the highest available
# Define MAXFREQ using cpupower or another tool:
# CPU frequencies for each core can be listed with `cpupower -c all frequency-info -l`
MAXFREQ=$(cpupower frequency-info -l | awk '{print $2}' | tail -n1)
sudo cpupower frequency-set --max ${MAXFREQ}
sudo cpupower frequency-set --min ${MAXFREQ}

# Option 2: Directly writing to sysfs
# This may not be available on all systems and may be overridden by the BIOS
# /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq can also be used to read the maximum frequency
CPUFREQ=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies | tr ' ' '\n' | sort -n | tail -1)
for NODE in /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
do
    echo $CPUFREQ | sudo tee $NODE
done
for NODE in /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
do
    echo $CPUFREQ | sudo tee $NODE
done

Verify Settings

After applying the settings, verify that they are active:

Copy
Copied!

            
            # Verify the governor is set to performance
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq

# Check current CPU frequencies (should show maximum frequency)
grep MHz /proc/cpuinfo

# Verify performance bias setting
sudo cpupower info

Note: These settings are not persistent across reboots. Further work will be needed to make them persistent.