Best Performance
NVIDIA Parabricks software can give very high performance when all the required computing resources are provided to it. It should meet all the requirements in Installation Requirements section. Here are a few examples of how to get Parabricks software to give its best performance.
See the Hardware Requirements section for minimum hardware requirements.
See the Software Requirements section for minimum software requirements.
The goal of the NVIDA Parabricks software is to get the highest performance for bioinformatics and genomic analysis. There are a few key, basic system options that you can tune to achieve maximum performance.
Use a Fast SSD
Parabricks software operates with two kinds of files:
Input/output files specified by the user
Temporary files created during execution and deleted at the end of the run
The best performance is achieved when both kinds of files are on a fast, local SSD.
If this is not possible you can place the input/output files on a fast network storage
device and the temporary files on a local SSD using the --tmp-dir
Tests have shown that you can use up to 4 GPUs and still get good performance with the Lustre network for Input/Output files. If you plan to use more than 4 GPUs, we highly recommend using local SSDs for all kinds of files.
DGX Users
The DGX comes with a SSD, usually mounted on /raid
. Use this disk, and use a directory
on this disk as the --tmp-dir
. For initial testing, you can even copy the input files to
this disk to eliminate variability in performance.
Specifying which GPUs to use
You can choose the number of GPUs to run using the command line option --num-gpus N
for those
tools that use GPUs. With this option only the first N
GPUs listed in the output of
will be used.
To use specific GPUs set the environment variable NVIDIA_VISIBLE_DEVICES
. GPUs are
numbered starting with zero. For example, this command will use only the second (GPU #1) and fourth (GPU #3) GPUs:
$ NVIDIA_VISIBLE_DEVICES="1,3" pbrun fq2bam --num-gpus 2 --ref Ref.fa --in-fq S1_1.fastq.gz --in-fq S1_2.fastq.gz
This section details guidelines specific to individual tools.
Best Performance for Germline Pipeline
For backwards-compatible results with less performance when
using --fq2bamfast
, set --bwa-options="-K 10000000"
$ # This command assumes all the inputs are in INPUT_DIR and all the outputs go to OUTPUT_DIR.
docker run --rm --gpus all --volume INPUT_DIR:/workdir --volume OUTPUT_DIR:/outputdir \
--workdir /workdir \
pbrun germline \
--ref /workdir/Homo_sapiens_assembly38.fasta \
--in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
--out-bam /outputdir/fq2bam_output.bam \
--tmp-dir /workdir \
--num-cpu-threads 16 \
--bwa-cpu-thread-pool 16 \
--out-recal-file recal.txt \
--knownSites /workdir/hg.known_indels.vcf \
--out-variants /outputdir/out.vcf \
--run-partition --no-alt-contigs \
--gpusort \
--gpuwrite \
Best Performance for fq2bam
Use the new beta version, fq2bamfast. For backwards-compatible results
with less performance, set --bwa-options="-K 10000000"
$ # This command assumes all the inputs are in INPUT_DIR and all the outputs go to OUTPUT_DIR.
docker run --rm --gpus all --volume INPUT_DIR:/workdir --volume OUTPUT_DIR:/outputdir \
--workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \ \
pbrun fq2bamfast \
--ref /workdir/Homo_sapiens_assembly38.fasta \
--in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
--out-bam /outputdir/fq2bam_output.bam \
--tmp-dir /workdir \
--num-cpu-threads 16 \
--bwa-cpu-thread-pool 16 \
--out-recal-file recal.txt \
--knownSites /workdir/hg.known_indels.vcf \
--gpusort \
Best Performance for deepvariant
DeepVariant from Parabricks has the ability to use multiple streams on a GPU. The number of streams that can be used depends on the available resources. The default number of streams is set to two but can be increased up to a maximum of six to get better performance. This is something that has to be experimented with, before getting the optimal number on your system.
$ # This command assumes all the inputs are in INPUT_DIR and all the outputs go to OUTPUT_DIR.
docker run --rm --gpus all --volume INPUT_DIR:/workdir --volume OUTPUT_DIR:/outputdir \
--workdir /workdir --env TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 \
pbrun deepvariant \
--ref /workdir/Homo_sapiens_assembly38.fasta \
--in-bam /outputdir/fq2bam_output.bam \
--out-variants /outputdir/out.vcf \
--num-streams-per-gpu 4 \
--run-partition \
--gpu-num-per-partition 1
For additional performance improvements and final BAM writing bandwidth use GPUDirect Storage (GDS), part of the CUDA toolkit. Note that the system must be set up and supported to use GDS.
The following are references for setting up and using GDS:
# Using GDS with the convenience docker wrapper.
$ wget
$ chmod +x gds-run-container
$ ./gds-run-container run \
--rm \
--gpus all \
--enable-mofed \
--enable-gds \
--volume INPUT_DIR:/workdir \
--volume OUTPUT_DIR:/outputdir \
--workdir /workdir \
pbrun fq2bam \
--ref /workdir/Homo_sapiens_assembly38.fasta \
--in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
--out-bam /outputdir/fq2bam_output.bam \
--tmp-dir /workdir \
--out-recal-file recal.txt \
--knownSites /workdir/hg.known_indels.vcf \
--gpusort \
--gpuwrite \
# Using GDS without the wrapper.
$ docker run \
--ipc host \
--volume /run/udev:/run/udev:ro \
--device=/dev/nvidia-fs0 \
--device=/dev/nvidia-fs1 \
--device=/dev/nvidia-fs2 \
--device=/dev/nvidia-fs3 \
--device=/dev/nvidia-fs4 \
--device=/dev/nvidia-fs5 \
--device=/dev/nvidia-fs6 \
--device=/dev/nvidia-fs7 \
--device=/dev/nvidia-fs8 \
--device=/dev/nvidia-fs9 \
--device=/dev/nvidia-fs10 \
--device=/dev/nvidia-fs11 \
--device=/dev/nvidia-fs12 \
--device=/dev/nvidia-fs13 \
--device=/dev/nvidia-fs14 \
--device=/dev/nvidia-fs15 \
--rm \
--gpus all \
-enable-mofed \
--volume INPUT_DIR:/workdir \
--volume OUTPUT_DIR:/outputdir \
--workdir /workdir \
pbrun fq2bam
--ref /workdir/Homo_sapiens_assembly38.fasta \
--in-fq /workdir/fastq1.gz /workdir/fastq2.gz \
--out-bam /outputdir/fq2bam_output.bam \
--tmp-dir /workdir \
--out-recal-file recal.txt \
--knownSites /workdir/hg.known_indels.vcf \
--gpusort \
--gpuwrite \