Running Clara Parabricks on GCP
This guide shows how to run Parabricks on a compute instance on Google Cloud Platform (GCP).
Clara Parabricks is an accelerated compute framework that supports applications across the genomics industry, primarily supporting analytical workflows for DNA, RNA, and somatic mutation detection applications. With industry leading compute times, Parabricks rapidly converts a FASTQ file to a VCF using multiple, industry validated variant callers and also includes the ability to QC and annotate those variants. As Parabricks is based upon publicly available tools, results are easy to verify and combine with other publicly available datasets.
More information is available on the Clara Parabricks Product Page.
Detailed installation, usage, and tuning information is available in the Parabricks user guide.
In this section, we will show how to start a Compute Instance on GCP.
Begin by navigating to the Google Cloud homepage and selecting Compute Engine from the left sidebar. This will take us to the VM instances page.
At the top of the page, select Create Instance. Here we can configure all the settings for our VM instance. Under Name let’s call the instance “parabricks” and select an appropriate region. For the purpose of this guide, the region can be anything.
Under Machine Configuration we will select hardware details for the VM instance. Select GPU at the top and select “NVIDIA T4” for GPU Type and “1” for Number of GPUs. Under Machine Type, select “n1-standard-32”. This machine type meets the minimum CPU and memory requirements for Parabricks, and is all we need for the purpose of this guide. The Machine Configuration section should now look like this:
We will make sure our VM instance has the proper GPU drivers by switching our base image from the default image to a base image that already has drivers installed. Under the Boot disk section, select Change. Under Operating System, select “Deep Learning on Linux” and under Version select “NVIDIA GPU-Optimized VMI”. While we are on this page, let’s also increase the disk size from the default value up to 500 GB. This will ensure we have enough space for the test dataset when we test our Parabricks installation. The Boot disk page will look like this:
Now we have everything we need to launch the instance. At the bottom of the page click Create. After a few minutes, we can see the instance is running and ready to be used.
Let’s click on the instance to go to the instance page and then under SSH in the top right, select a method to connect. Read more about connecting to GCP instances in their documentation.
Once we are connected, the VM will ask if we want to install the NVIDIA Driver. Select yes, and allow it to install the drivers automatically.
Once the driver installation finishes, there we need to set up our Docker environment. At this point, Docker is already installed however it requires sudo access to run. We can get around this by running the following commands:
The first command adds our user to the Docker group, allowing us to run Docker commands without using sudo. The second command refreshes Docker to make sure these changes take effect. We can test that this worked by running docker ps. This command should run without any errors:
Now we are ready to start the Parabricks installation!
We will install Parabricks into our instance that we just created. To do this, we will use the NVIDIA GPU Cloud (NGC) to download the Parabricks Docker image.
Visit the Parabricks page on NGC to get the Docker pull command for the latest version of Parabricks.
Back in our EC2 instance, let’s run the docker pull command:
$ docker pull nvcr.io/nvidia/clara/clara-parabricks:4.2-1
Now Parabricks is installed! Let’s run some sample data to test it.
Parabricks provides a small sample dataset as a test for the installation and hardware which can be downloaded using:
$ wget -O parabricks_sample.tar.gz \ "https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz"
When the download completes, we can untar the data using:
$ tar xzvf parabricks_sample.tar.gz
The parabricks_sample folder should look like this when we’re done:
Finally, we can run any of the Parabricks pipelines on it. Let’s run the germline pipeline using the following command:
$ docker run \ --rm \ --gpus all \ --volume `pwd`:`pwd` \ --workdir `pwd`/parabricks_sample \ nvcr.io/nvidia/clara/clara-parabricks:4.2-1 \ pbrun germline \ --ref Ref/Homo_sapiens_assembly38.fasta \ --in-fq Data/sample_1.fq.gz Data/sample_2.fq.gz \ --knownSites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi \ --out-bam output.bam \ --out-variants germline.vcf \ --out-recal-file recal.txt
We can tell that Parabricks started correctly when we see the Parabricks banner and the ProgressMeter begins to populate with values:
This should take ~10 minutes to finish running. When it’s done, we should see the output files in the sample data directory:
Congratulations, we’ve just run our first Parabricks job!
We encourage you to expand on the demo in this guide by using your own data, trying other pipelines, and generally exploring what Parabricks has to offer. Check out the documentation for more information about the different pipelines available. You can also find our online developer community on the Parabricks forum, where you can ask questions and search through answers while you are learning how to use Parabricks.