FQ2BAM TutorialΒΆ

This tutorial will show you how to run our core alignment tool, FQ2BAM, which allows you to align a FASTQ file according to GATK best practices at blazing speeds. This includes the gold-standard alignment tool BWA-MEM with inbuilt co-ordinate sorting of the output file, and optionally application of base-quality-score-recalibration and marking of duplicate reads.

The fq2bam tool aligns, sorts (by coordinate), and marks duplicates in paired-end FASTQ file data. The data files used in this example are taken from the sample data downloaded in the previous section.

If you execute the following command using the Clara Parabricks sample data, you should get the same results as shown here.

Before executing this command, make sure your current directory is where you extracted the sample data; it should have a parabricks_sample sub-directory.

$ docker run \
      --gpus all \
      --rm \
      --volume $(pwd):/workdir \
      --volume $(pwd):/outputdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.0.0-1 \
    pbrun fq2bam \
      --ref /workdir/parabricks_sample/Ref/Homo_sapiens_assembly38.fasta \
      --in-fq /workdir/parabricks_sample/Data/sample_1.fq.gz /workdir/parabricks_sample/Data/sample_2.fq.gz \
      --out-bam /outputdir/fq2bam_output.bam

[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /workdir/parabricks_sample/Data/sample_1.fq.gz and
/workdir/parabricks_sample/Data/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[PB Info 2022-Sep-02 19:49:27] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:49:27] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2022-Sep-02 19:49:27] ||                           Version 4.0.0-1                                ||
[PB Info 2022-Sep-02 19:49:27] ||                       GPU-BWA mem, Sorting Phase-I                       ||
[PB Info 2022-Sep-02 19:49:27] ------------------------------------------------------------------------------
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[PB Warning 2022-Sep-02 19:50:02][ParaBricks/src/pbOpts.cu:325]

WARNING
The system has 12 threads, however recommended number of threads with 1 GPU is 16.
The run might not finish or might have less than expected performance.

[PB Info 2022-Sep-02 19:50:02] GPU-BWA mem
[PB Info 2022-Sep-02 19:50:02] ProgressMeter    Reads       Base Pairs Aligned
[PB Info 2022-Sep-02 19:50:45] 5043564      580000000
[PB Info 2022-Sep-02 19:51:21] 10087128 1160000000
[PB Info 2022-Sep-02 19:51:59] 15130692 1740000000
[PB Info 2022-Sep-02 19:52:39] 20174256 2320000000
[PB Info 2022-Sep-02 19:53:20] 25217820 2900000000
[PB Info 2022-Sep-02 19:53:58] 30261384 3480000000
[PB Info 2022-Sep-02 19:54:36] 35304948 4060000000
[PB Info 2022-Sep-02 19:55:13] 40348512 4640000000
[PB Info 2022-Sep-02 19:55:53] 45392076 5220000000
[PB Info 2022-Sep-02 19:56:36] 50435640 5800000000
[PB Info 2022-Sep-02 19:57:02]
GPU-BWA Mem time: 420.426442 seconds
[PB Info 2022-Sep-02 19:57:02] GPU-BWA Mem is finished.


[main] CMD: /usr/local/parabricks/binaries//bin/bwa mem -Z ./pbOpts.txt /workdir/parabricks_sample/Ref/Homo_sapiens_assembly38.fasta /workdir/parabricks_sample/Data/sample_1.fq.gz /workdir/parabricks_sample/Data/sample_2.fq.gz @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[main] Real time: 455.468 sec; CPU: 4766.384 sec
[PB Info 2022-Sep-02 19:57:02] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:02] ||        Program:                      GPU-BWA mem, Sorting Phase-I        ||
[PB Info 2022-Sep-02 19:57:02] ||        Version:                                     4.0.0-1              ||
[PB Info 2022-Sep-02 19:57:02] ||        Start Time:                       Fri Sep  2 19:49:27 2022        ||
[PB Info 2022-Sep-02 19:57:02] ||        End Time:                         Fri Sep  2 19:57:02 2022        ||
[PB Info 2022-Sep-02 19:57:02] ||        Total Time:                           7 minutes 35 seconds        ||
[PB Info 2022-Sep-02 19:57:02] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:03] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:03] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2022-Sep-02 19:57:03] ||                           Version 4.0.0-1                                ||
[PB Info 2022-Sep-02 19:57:03] ||                             Sorting Phase-II                             ||
[PB Info 2022-Sep-02 19:57:03] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:03] progressMeter - Percentage
[PB Info 2022-Sep-02 19:57:03] 0.0   0.00 GB
[PB Info 2022-Sep-02 19:57:13] 72.8  0.00 GB
[PB Info 2022-Sep-02 19:57:23] Sorting and Marking: 20.001 seconds
[PB Info 2022-Sep-02 19:57:23] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:23] ||        Program:                                  Sorting Phase-II        ||
[PB Info 2022-Sep-02 19:57:23] ||        Version:                                     4.0.0-1              ||
[PB Info 2022-Sep-02 19:57:23] ||        Start Time:                       Fri Sep  2 19:57:03 2022        ||
[PB Info 2022-Sep-02 19:57:23] ||        End Time:                         Fri Sep  2 19:57:23 2022        ||
[PB Info 2022-Sep-02 19:57:23] ||        Total Time:                                     20 seconds        ||
[PB Info 2022-Sep-02 19:57:23] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:23] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:23] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2022-Sep-02 19:57:23] ||                           Version 4.0.0-1                                ||
[PB Info 2022-Sep-02 19:57:23] ||                         Marking Duplicates, BQSR                         ||
[PB Info 2022-Sep-02 19:57:23] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:57:24] progressMeter -  Percentage
[PB Info 2022-Sep-02 19:57:34] 13.6  16.60 GB
[PB Info 2022-Sep-02 19:57:44] 31.1  13.45 GB
[PB Info 2022-Sep-02 19:57:54] 46.8  10.22 GB
[PB Info 2022-Sep-02 19:58:04] 61.1  7.05 GB
[PB Info 2022-Sep-02 19:58:14] 77.3  3.84 GB
[PB Info 2022-Sep-02 19:58:24] 91.4  0.60 GB
[PB Info 2022-Sep-02 19:58:34] 100.0     0.00 GB
[PB Info 2022-Sep-02 19:59:18] BQSR and writing final BAM:  113.592 seconds
[PB Info 2022-Sep-02 19:59:18] ------------------------------------------------------------------------------
[PB Info 2022-Sep-02 19:59:18] ||        Program:                          Marking Duplicates, BQSR        ||
[PB Info 2022-Sep-02 19:59:18] ||        Version:                                     4.0.0-1              ||
[PB Info 2022-Sep-02 19:59:18] ||        Start Time:                       Fri Sep  2 19:57:23 2022        ||
[PB Info 2022-Sep-02 19:59:18] ||        End Time:                         Fri Sep  2 19:59:18 2022        ||
[PB Info 2022-Sep-02 19:59:18] ||        Total Time:                            1 minute 55 seconds        ||
[PB Info 2022-Sep-02 19:59:18] ------------------------------------------------------------------------------
Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

On an AWS g4dn.8xlarge instance (32 vCPUs, one T4 GPU, 128 GB memory), this takes approximately six minutes.

If you get an out-of-memory error make sure your computer has enough RAM, and that large amounts of memory aren't being used by other programs.

This fq2bam command produces three output files:

$ ls -l
total 14330820
-rw-r--r-- 1 root root 4819386804 Sep  2 15:58 fq2bam_output.bam
-rw-r--r-- 1 root root    6882792 Sep  2 15:59 fq2bam_output.bam.bai
-rw-r--r-- 1 root root      87690 Sep  2 15:59 fq2bam_output_chrs.txt

(input files not shown)

The first line of fq2bam_output.bam (as viewed with the samtools view fq2bam_output.bam command) is as follows:

HWI-D00127:570:HK3TJBCX2:1:1202:9643:76055  99      chr1    10027   26      24M5I86M        =       10178   231     ACCCTAACCCTAACCCTAACCCGACCCCGACCCCGACCCAAACCCAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTAACC     DDDDDHGHIIIIIHIIHHIHHHIHIIIIIIHDHHIHHHIHIHIIIIFHIEHHIIHHIIIIEHIIIIHHIHIIICHE@1FHH?1GEFE1111D11<FH11<FD11<<FFE111<11     MD:Z:22T5T0A4T5T41A27   PG:Z:MarkDuplicatesRG:Z:HK3TJBCX2.1     NM:i:11 AS:i:69 XS:i:72
....

Note

If the fq2bam command is run on a system with too little memory, you will see this message after the initial header:

WARNING
The system has 62 GB, however recommended RAM with 1 GPU is 64 GB.
The run might not finish or might have less than expected performance.