Parabricks FQ2BAM NIM
Important
NVIDIA NIM currently is in limited availability, sign up here to get notified when the latest NIMs are available to download.
Parabricks FQ2BAM is not an AI model but rather GPU optimized code for genomic analysis, therefore there is no need to explicitly pull the model as is common with other NIMs.
After sequencing, the next step in genomic analysis is alignment. This NIM allows for efficiently setting up a microservice capable of scheduling multiple jobs to run one after another. This is especially handy for running large quantities of data, as the NIM ensures that when one job finishes, the next one starts, thus maximizing throughput. In addition, the NIM runs on an open port, so it can be configured such that multiple users can submit jobs to the same queue, thus creating a centralized compute environment. Thirdly, NIMs can be combined. For example, the Parabricks DeepVariant NIM can be run on the output of this FQ2BAM NIM, thus enabling end-to-end workflows.
For more information on the Parabricks FQ2BAM workflow, there are extra tutorials in the Parabricks documentation.
Specific Requirements
Operating System
Linux(x86_64/amd64): Linux distributions supported by the NVIDIA Container Toolkit.
Hardware
Hopper, Ampere, Ada GPUs with Minimum 24GB of GPU (VRAM) Memory
Software
Minimum Driver version: 535.161.07
If you want to use the Python examples, you will need to install the third-party requests module. This can be done by running pip install requests
.
Quickstart Guide
Note
This page assumes Prerequisite Software (Docker, NGC CLI, NGC registry access) is installed and set up.
Pull the NIM container.
docker pull nvcr.io/nvidia/nim/fq2bam:24.03.01
Run NIM.
1docker run --rm -d \ 2 --volume /etc/passwd:/etc/passwd:ro \ 3 --volume /etc/group:/etc/group:ro \ 4 --volume /etc/shadow:/etc/shadow:ro \ 5 --volume $PWD:/workspace \ 6 --shm-size=2G \ 7 --ulimit memlock=-1 \ 8 --ulimit stack=67108864 \ 9 --runtime=nvidia \ 10 --gpus=1 \ 11 -p 8003:8003 \ 12 --name fq2bam \ 13 nvcr.io/nvidia/nim/fq2bam:24.03.01
Download the data.
1mkdir test-data 2cd test-data 3wget https://s3.amazonaws.com/parabricks.sample/nim/References_chr22.fasta.tar 4wget https://s3.amazonaws.com/parabricks.sample/nim/test-datachr22.100x.fq_1.fastq.gz 5wget https://s3.amazonaws.com/parabricks.sample/nim/test-datachr22.100x.fq_2.fastq.gz 6cd .. 7mkdir results 8 9python3 -c ' 10import requests 11import json 12invoke_url = "http://localhost:8003/genomics/parabricks/fq2bam/run" 13headers = { 14 "accept": "application/json", 15 "content-type": "application/json", 16} 17# The /workspace directory referred to here is the /workspace directory found in 18# the server startup command. 19in_fq = [ 20 { 21 "fq1" : "/workspace/test-data/test-datachr22.100x.fq_1.fastq.gz", 22 "fq2" : "/workspace/test-data/test-datachr22.100x.fq_2.fastq.gz", 23 "rg" : "@RG\\tID:foo\\tLB:lib1\\tPL:bar\\tSM:HG002\\tPU:foo" 24 } 25] 26payload = { 27 "in_ref_tarball" : "/workspace/test-data/References_chr22.fasta.tar", 28 "in_fq" : json.dumps(in_fq), 29 "out_bam" : ["/workspace/results/test-datachr22.bam"], 30 "out_bam_parts_manifest" : "/workspace/results/test-datachr22.bam.parts_manifest.txt", 31 "out_bai" : "/workspace/results/test-datachr22.bam.bai", 32 "out_chrs" : "/workspace/results/test-datachr22_chrs.txt", 33 "out_stderr" : "/workspace/results/test-datachr22.stderr", 34 "out_stdout" : "/workspace/results/test-datachr22.stdout", 35 "additional_args" : "" 36} 37response = requests.post(invoke_url, headers=headers, json=payload, stream=True) 38print(response.status_code) 39print(response.text) 40'
Detailed Instructions
In this workflow, we will submit a pair of .fastq.gz
files (Chromosome 22 from HG002 sampled at 100x) for alignment. The microservice will use Parabricks fq2bam to perform the alignment and return a .bam
file of our aligned sample.
The Genomics NIMs take either local file paths or pre-signed URLs. A local file path is one that
is mounted on the computer on which the server is running. This could be locally attached storage, an NFS share, etc
and looks like /path/to/file/name.ext
. A pre-signed URL looks like
https://s3.us-east-1.amazonaws.com/your.bucket/path/to/file.ext?etcetcetc
.
If a local file path is given in the job request then it must be a path as seen from inside the server. To specify a local file path correctly you’ll need to know what local directories were made available to the server when it was started and where they were mounted. If the microservice is started as follows:
1docker run --rm -it \
2 ...
3 --volume $PWD:/workspace \
4 ...
then the --volume $PWD:/workspace
line makes the current directory ($PWD
) available as the /workspace
directory inside the server. If the BAM file should be saved in $PWD/my_input_files/batch1/filename.bam
then
the cURL command (see below for the complete command) would need to be:
1curl -X 'POST' \
2'http://localhost:8003/genomics/parabricks/universal-variant-calling/run' \
3-H 'accept: application/json' \
4-H 'Content-Type: application/json' \
5-d '{
6 "out_bam": "/workspace/my_input_files/batch1/filename.bam",
7 ...
8 }'
For remote files it doesn’t matter what directories were mounted on the server. Specify a pre-signed URL like this:
1curl -X 'POST' \
2'http://localhost:8003/genomics/parabricks/universal-variant-calling/run' \
3-H 'accept: application/json' \
4-H 'Content-Type: application/json' \
5-d '{
6 "out_bam": "https://s3.us-east-1.amazonaws.com/your.bucket/path/to/file.ext?etcetcetc",
7 ...
8 }'
Requests can mix local and remote file paths at will. For example, you might read all your input files from S3 and write all your output files to a local drive.
Pull Container Image
Container image tags can be retrieved using the following command.
ngc registry image info nvcr.io/nvidia/nim/fq2bam
1Image Repository Information 2 Name: fq2bam 3 Display Name: fq2bam 4 Short Description: Please add description 5 Built By: 6 Publisher: 7 Multinode Support: False 8 Multi-Arch Support: False 9 Logo: 10 Labels: 11 Public: No 12 Access Type: 13 Associated Products: [] 14 Last Updated: Mar 14, 2024 15 Latest Image Size: 2.41 GB 16 Signed Tag?: False 17 Latest Tag: 24.03.01 18 Tags: 19 24.03.01
Pull the container image
docker pull nvcr.io/nvidia/nim/fq2bam:24.03.01
ngc registry image pull nvcr.io/nvidia/nim/fq2bam:24.03.01
Launch the Microservice
Launch the container.
1docker run --rm -it \
2 --volume /etc/passwd:/etc/passwd:ro \
3 --volume /etc/group:/etc/group:ro \
4 --volume /etc/shadow:/etc/shadow:ro \
5 --volume $PWD:/workspace \
6 --shm-size=2G \
7 --ulimit memlock=-1 \
8 --ulimit stack=67108864 \
9 --gpus=1 \
10 -p 8003:8003 \
11 --name fq2bam \
12 --runtime=nvidia \
13 nvcr.io/nvidia/nim/fq2bam:24.03.01
The current directory ($PWD) will be accessible inside the container as the “/workspace” directory. Subsequent example commands will put input files in this directory; output files will be written here. If you wish to place your files elsewhere change $PWD to the desired path.
Let’s briefly go through the flags in the command and some possible configurations:
Flag |
Description |
---|---|
|
Sets the name of the container. |
|
Use |
|
Tells Docker how many GPUs to use. In this case, we only use 1 GPU. |
|
Increases the maximum shared memory size from the default of 64MB to 2GB. |
|
Mounts directories from the host machine into container. |
|
A value of |
|
Sets the maximum stack size in bytes. We do not recommend lowering this value. |
|
Forwards the ports from the
host machine into Docker. This
NIM requires port |
Health and Liveness Checks
The container exposes health and liveness endpoints for integration into existing systems such as Kubernetes at /v2/health/ready
and /v2/health/live
. These endpoints only return an HTTP 200 OK
status code if the service is ready or live, respectively.
Run these in a new terminal.
1curl localhost:8003/v2/health/ready
2true
1curl localhost:8003/v2/health/live
2true
Download Sample Data
Download a small reference file. The FASTA file and all its associated indices must be packaged into a single .tar file. This is the same .tar file downloaded for the DeepVariant example.
1mkdir test-data 2cd test-data 3wget https://s3.amazonaws.com/parabricks.sample/nim/References_chr22.fasta.tar
Download the sample FASTQ files.
1wget https://s3.amazonaws.com/parabricks.sample/nim/test-datachr22.100x.fq_1.fastq.gz 2wget https://s3.amazonaws.com/parabricks.sample/nim/test-datachr22.100x.fq_2.fastq.gz 3cd .. 4mkdir results
The current working directory should have at least these files:
1tree 2. 3├── results 4└── test-data 5 ├── References_chr22.fasta.tar 6 ├── test-datachr22.100x.fq_1.fastq.gz 7 └── test-datachr22.100x.fq_2.fastq.gz
Run Alignment Mapping
You can submit a request in a Python script as follows:
1#!/usr/bin/env python3 2 3# Script: fq2bam_request.py 4# Usage: python3 fq2bam_request.py 5 6import requests 7import json 8 9invoke_url = "http://localhost:8003/genomics/parabricks/fq2bam/run" 10 11headers = { 12 "accept": "application/json", 13 "content-type": "application/json", 14} 15 16# The /workspace directory referred to here is the /workspace directory found in 17# the server startup command. 18in_fq = [ 19 { 20 "fq1" : "/workspace/test-data/test-datachr22.100x.fq_1.fastq.gz", 21 "fq2" : "/workspace/test-data/test-datachr22.100x.fq_2.fastq.gz", 22 "rg" : "@RG\\tID:foo\\tLB:lib1\\tPL:bar\\tSM:HG002\\tPU:foo" 23 } 24] 25 26payload = { 27 "in_ref_tarball" : "/workspace/test-data/References_chr22.fasta.tar", 28 "in_fq" : json.dumps(in_fq), 29 "out_bam" : ["/workspace/results/test-datachr22.bam"], 30 "out_bam_parts_manifest" : "/workspace/results/test-datachr22.bam.parts_manifest.txt", 31 "out_bai" : "/workspace/results/test-datachr22.bam.bai", 32 "out_chrs" : "/workspace/results/test-datachr22_chrs.txt", 33 "out_stderr" : "/workspace/results/test-datachr22.stderr", 34 "out_stdout" : "/workspace/results/test-datachr22.stdout", 35 "additional_args" : "" 36} 37 38response = requests.post(invoke_url, headers=headers, json=payload, stream=True) 39print(response.status_code) 40print(response.text)
Below is a description of what each flag in the payload does, and how it can be customized to any workflow:
Flag
Description
fq1
First pair ended
fastq
file.fq2
Second pair ended
fastq
file.in_ref_tarball
Reference tarball in the same format as the sample data ‘References_chr22.fasta.tar’. Untar to see associated files.
out_bam
Path for the aligned output
bam
file.out_bam_parts_manifest
Parts manifest for the output
bam
file.out_bai
Path for the output
.bam.bai
file of bam indices.out_chrs
Path for file with a list of chromosomes.
out_stderr
Path for output of stderr during the job.
out_stdout
Path for output of stdout during the job.
additional_args
Other flags found in the documentation can be passed through here as a string. For example,
"--no-markdups --low-memory"
passes these flags directly to the Parabricks run command inside the NIM. This does NOT support options requiring an input such as--knownSites
.To run with multiple pairs of fastq files, simply extend the
in_fq
JSON as follows for as many pairs as necessary:1in_fq = [ 2 { 3 "fq1" : "/workspace/test-data/test-datachr22.100x.fq_1.fastq.gz", 4 "fq2" : "/workspace/test-data/test-datachr22.100x.fq_2.fastq.gz", 5 "rg" : "@RG\\tID:foo\\tLB:lib1\\tPL:bar\\tSM:HG002\\tPU:foo" 6 }, 7 { 8 "fq3" : "/workspace/test-data/test-datachr22.100x.fq_3.fastq.gz", 9 "fq4" : "/workspace/test-data/test-datachr22.100x.fq_4.fastq.gz", 10 "rg" : "@RG\\tID:foo\\tLB:lib1\\tPL:bar\\tSM:HG002\\tPU:foo" 11 }, 12 ... 13]
Please note that it takes ~30s for the request to process. You should see the following console output…
1ls results/ 2test-datachr22.bam 3test-datachr22.bam.bai 4test-datachr22.bam.parts_manifest.txt 5test-datachr22_chrs.txt 6test-datachr22.stderr 7test-datachr22.stdout
Stopping the Container
When you’re done testing the endpoint, you can bring down the container by running docker stop fq2bam
in a new terminal.