Getting the Best Performance

NVIDIA Parabricks software performs at a high level when all the required computing resources are provided to it. It should meet all the Installation Requirements.

Refer to the Hardware Requirements section for minimum hardware requirements.

Refer to the Software Requirements section for minimum software requirements.

Basic Performance Tuning

The goal of the NVIDA Parabricks software is to get the highest performance for bioinformatics and genomic analysis. There are a few key, basic system options that you can tune to achieve maximum performance.

Use a Fast SSD

Parabricks software operates with two kinds of files:

Input/output files specified by the user
Temporary files created during execution and deleted at the end of the run

The best performance is achieved when both kinds of files are on a fast, local SSD. If this is not possible you can place the input/output files on a fast network storage device and the temporary files on a local SSD using the --tmp-dir option.

Note

Tests have shown that you can use up to 4 GPUs and still get good performance with the Lustre network for Input/Output files. If you plan to use more than 4 GPUs, we highly recommend using local SSDs for all kinds of files.

DGX Users

The DGX comes with a SSD, usually mounted on /raid. Use this disk, and use a directory on this disk as the --tmp-dir. For initial testing, you can even copy the input files to this disk to eliminate variability in performance.

In certain cases, Transparent HugePage Support (THP) has been found to increase performance. Consider enabling THP and testing performance on benchmark cases.

Specifying which GPUs to use

You can choose the number of GPUs to run using the command line option --num-gpus N for those tools that use GPUs. With this option only the first N GPUs listed in the output of nvidia-smi will be used.

To use specific GPUs set the environment variable NVIDIA_VISIBLE_DEVICES. GPUs are numbered starting with zero. For example, this command will use only the second (GPU #1) and fourth (GPU #3) GPUs:

Copy
Copied!

            
            $ NVIDIA_VISIBLE_DEVICES="1,3" pbrun fq2bam --num-gpus 2 --ref Ref.fa --in-fq S1_1.fastq.gz --in-fq S1_2.fastq.gz

Tool-Specific Performance Guidelines

This section details guidelines specific to individual tools.

Best Performance for Germline Pipeline

Germline Pipeline uses tools fq2bam and haplotypecaller. For more information, refer to sections Best Performance for fq2bam and Best Performance for Haplotypecaller for parameter choices. We recommend using the following command for best performance.

On an H100 DGX the germline pipeline typically runs in under ten minutes.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun germline \
    --ref /workdir/Homo_sapiens_assembly38.fasta \
    --in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
    --out-bam /outputdir/fq2bam_output.bam \
    --tmp-dir /workdir \
    --bwa-cpu-thread-pool 16 \
    --out-variants /outputdir/out.vcf \
    --run-partition \
    --read-from-tmp-dir \
    --gpusort \
    --gpuwrite \
    --keep-tmp

Best Performance for Deepvariant Germline Pipeline

DeepVariant Germline Pipeline uses tools fq2bam and deepvariant. For more information, refer to sections Best Performance for fq2bam and Best Performance for Deepvariant for parameter choices. We recommend using the following command for best performance.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun deepvariant_germline \
    --ref /workdir/Homo_sapiens_assembly38.fasta \
    --in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
    --out-bam /outputdir/fq2bam_output.bam \
    --tmp-dir /workdir \
    --bwa-cpu-thread-pool 16 \
    --out-variants /outputdir/out.vcf \
    --run-partition \
    --read-from-tmp-dir \
    --gpusort \
    --gpuwrite \
    --keep-tmp

Best Performance for PacBio Germline Pipeline

The PacBio Germline Pipeline runs minimap2 for alignment and deepvariant for variant calling.

For more information, refer to sections Best Performance for Minimap2 and Best Performance for Deepvariant for parameter choices. We recommend using the following command for best performance.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun pacbio_germline \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --out-variants /outputdir/out.vcf
    --max-queue-reads 5000000 \
    --max-queue-chunks 10000 \
    --run-partition \
    --read-from-tmp-dir \
    --num-streams-per-gpu 4 \
    --gpusort \
    --gpuwrite \
    --keep-tmp

Best Performance for fq2bam

Parabricks fq2bam automatically uses an optimal number of streams based on the GPU's device memory specifications (by default --bwa-nstreams auto). You can experiment further with the --bwa-nstreams and --bwa-cpu-thread-pool parameters to potentially achieve better performance. To achieve optimal performance during the sorting, duplicate marking, and compression phases, we recommend --gpusort and --gpuwrite, where --gpusort uses GPUs to accelerate sorting and marking and --gpuwrite will use one GPU to accelerate BAM or CRAM compression. Below is our recommendation for best performance.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun fq2bam \
    --ref /workdir/Homo_sapiens_assembly38.fasta \
    --in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
    --out-bam /outputdir/fq2bam_output.bam \
    --tmp-dir /workdir \
    --bwa-cpu-thread-pool 16 \
    --out-recal-file recal.txt \
    --knownSites /workdir/hg.known_indels.vcf \
    --gpusort \
    --gpuwrite

Best Performance for Deepvariant

DeepVariant from Parabricks has the ability to use multiple streams per GPU. The number of streams that can be used depends on the available resources. The default number of streams is set to auto, which will use an optimal configuration based on prior testing as a function of the size of the GPU's device memory. However, the number of streams can be increased up to a maximum of six via the --num-streams-per-gpu parameter to potentially get better performance. Experiment with the number of streams until you get the optimal number on your system. The parameter --run-partition is used as it more efficiently splits up works across multiple GPUs for multi-GPU systems. If you use less than 2 GPUs, you can omit that parameter. We recommend using the following command for best performance.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun deepvariant \
    --ref /workdir/Homo_sapiens_assembly38.fasta \
    --in-bam /outputdir/fq2bam_output.bam \
    --out-variants /outputdir/out.vcf \
    --run-partition

Best Performance for Haplotypecaller

Use the parameter --run-partition as it more efficiently splits up works across multiple GPUs for multi-GPU systems. If you use less than 2 GPUs, you can omit that parameter. We recommend using the following command for best performance.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun haplotypecaller \
    --ref /workdir/Homo_sapiens_assembly38.fasta \
    --in-bam /outputdir/fq2bam_output.bam \
    --out-variants /outputdir/out.vcf \
    --num-htvc-threads 8 \
    --no-alt-contigs \# This flag will ignore all outputs after chrM
    --run-partition

Best Performance for Minimap2

Parabricks minimap2 automatically sets optimal parameters based on which --preset is selected and how many GPUs are detected on the system. The parameter --chunk-size can have a big impact on run performance and memory usage. If you use 8 GPUs, this value is most efficient at 1000. If you use fewer than 8 GPUs, 5000 potentially provides better performance. While setting this value higher may result in better performance, higher values will use much more host memory. The parameters --max-queue-reads and --max-queue-chunks serve to reduce host memory usage by limiting workloads between different stages of processing. Increasing these values will lift the restrictions and might provide some additional speedup at the cost of higher host memory usage.

The following command line options provided optimal performance for PacBio data.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun minimap2 \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --max-queue-reads 5000000 \
    --max-queue-chunks 10000 \
    --gpusort \
    --gpuwrite

The following command line options provided optimal performance for splice and splice:hq presets.

Decrease --num-threads to reduce CPU memory usage.

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun minimap2 \
    --preset splice \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --num-threads 148 \
    --max-queue-reads 5000000 \
    --max-queue-chunks 10000 \
    --gpusort \
    --gpuwrite

Best Performance for Giraffe

During runtime, VG Giraffe loads index data into GPU device memory, which can impact available memory for concurrent operations. To optimize device memory usage and performance, consider the following options tailored to your GPU device memory capacity:

For 16GB devices (e.g. T4): Use --low-memory option
For 16GB-40GB devices (e.g. L4, A10) optimize performance by adjusting:
- --nstreams: Controls the number of CUDA streams per GPU
- --batch-size: Adjusts the number of reads processed in a batch
- For L4 best performance was obtained using --nstreams 2 --batch-size 8000
For >40GB devices: Default parameters are sufficient; however, there is the potential for further optimization by adjusting the number of streams.
For >80GB devices, better performance can be achieved by increasing the number of streams and by enabling the computation of minimizers and seeds on GPU: --minimizers-gpu. On an H100 DGX, best performance was obtained using --nstreams 4 --batch-size 10000 --minimizers-gpu

Copy
Copied!

            
            $ # This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun giraffe --read-group "sample_rg1" \
    --sample "sample-name" --read-group-library "library" \
    --read-group-platform "platform" --read-group-pu "pu" \
    --dist-name /workdir/hprc-v1.1-mc-grch38.dist \
    --minimizer-name /workdir/hprc-v1.1-mc-grch38.min \
    --gbz-name /workdir/hprc-v1.1-mc-grch38.gbz \
    --ref-paths /workdir/hprc-v1.1-mc-grch38.paths.sub \
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --batch-size 10000 \
    --nstreams 4 \
    --minimizers-gpu

GDS Support

For additional performance improvements and final BAM writing bandwidth use GPUDirect Storage (GDS), part of the CUDA toolkit. Note that the system must be set up and supported to use GDS.

The following are references for setting up and using GDS:

Copy
Copied!

            
            # Using GDS with the convenience docker wrapper.
$ wget https://raw.githubusercontent.com/NVIDIA/MagnumIO/main/gds/docker/gds-run-container
$ chmod +x gds-run-container
$ ./gds-run-container run \
    --rm \
    --gpus all \
    --enable-mofed \
    --enable-gds \
    --volume INPUT_DIR:/workdir \
    --volume OUTPUT_DIR:/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun fq2bam \
        --ref /workdir/Homo_sapiens_assembly38.fasta \
        --in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
        --out-bam /outputdir/fq2bam_output.bam \
        --tmp-dir /workdir \
        --out-recal-file recal.txt \
        --knownSites /workdir/hg.known_indels.vcf \
        --gpusort \
        --gpuwrite \
        --use-gds


# Using GDS without the wrapper.
$ docker run \
    --ipc host \
    --volume /run/udev:/run/udev:ro \
    --device=/dev/nvidia-fs0  \
    --device=/dev/nvidia-fs1  \
    --device=/dev/nvidia-fs2  \
    --device=/dev/nvidia-fs3  \
    --device=/dev/nvidia-fs4  \
    --device=/dev/nvidia-fs5  \
    --device=/dev/nvidia-fs6  \
    --device=/dev/nvidia-fs7  \
    --device=/dev/nvidia-fs8  \
    --device=/dev/nvidia-fs9  \
    --device=/dev/nvidia-fs10 \
    --device=/dev/nvidia-fs11 \
    --device=/dev/nvidia-fs12 \
    --device=/dev/nvidia-fs13 \
    --device=/dev/nvidia-fs14 \
    --device=/dev/nvidia-fs15 \
    --rm \
    --gpus all \
    -enable-mofed \
    --volume INPUT_DIR:/workdir \
    --volume OUTPUT_DIR:/outputdir \
    --workdir /workdir \
    --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun fq2bam
        --ref /workdir/Homo_sapiens_assembly38.fasta \
        --in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
        --out-bam /outputdir/fq2bam_output.bam \
        --tmp-dir /workdir \
        --out-recal-file recal.txt \
        --knownSites /workdir/hg.known_indels.vcf \
        --gpusort \
        --gpuwrite \
        --use-gds