FASTQ AND BAM PROCESSING OVERVIEW

NVIDIA Clara Parabricks Pipelines tools that can process fastq files and refine bam files

Here are the articles in this section:

FQ2BAM

Generate BAM/CRAM output given one or more pairs of fastq files. Optionally generate BQSR report.

fq2bam performs the following steps. The user can decide to turn-off marking of duplicates. The BQSR step is only performed if the –knownSites input and –out-recal-file output options are provided.

QUICK START

Copy
Copied!

            
            $ pbrun fq2bam --ref Ref/Homo_sapiens_assembly38.fasta \
--in-fq Data/sample_1.fq.gz Data/sample_2.fq.gz  \
--knownSites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \
--out-bam mark_dups_gpu.bam \
--out-recal-file recal_gpu.txt \
--tmp-dir /raid/myrun

COMPATIBLE CPU BASED BWA-MEM, GATK4 COMMANDS

The command below is the bwa-0.7.15 and GATK4 counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command. Please look at Output Comparison page on how you can compare the results.

Copy
Copied!

            
            # Run bwa-mem and pipe output to create sorted bam
$ bwa mem -t 32 -K 10000000 -R '@RG\tID:sample_rg1\tLB:lib1\tPL:bar\tSM:sample\tPU:sample_rg1' \
Ref/Homo_sapiens_assembly38.fasta Data/sample_1.fq.gz Data/sample_2.fq.gz | gatk \
SortSam --java-options -Xmx30g --MAX_RECORDS_IN_RAM=5000000 -I=/dev/stdin \
-O=cpu.bam --SORT_ORDER=coordinate --TMP_DIR=/raid/myrun

# Mark Duplicates
$ gatk MarkDuplicates --java-options -Xmx30g -I=cpu.bam -O=mark_dups_cpu.bam \
-M=metrics.txt --TMP_DIR=/raid/myrun

# Generate BQSR Report
$ gatk BaseRecalibrator --java-options -Xmx30g --input mark_dups_cpu.bam --output \
recal_cpu.txt --known-sites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \
--reference Ref/Homo_sapiens_assembly38.fasta

OPTIONS

--ref
--in-fq
--in-se-fq
--out-bam
--out-recal-file
--out-duplicate-metrics
--knownSites
--interval-file
--interval
--interval-padding
--no-markdups
--bwa-options
--markdups-assume-sortorder-queryname
--markdups-assume-sortorder-queryname
--optical-duplicate-pixel-distance
--out-qc-metrics-dir
--read-group-sm
--read-group-lb
--read-group-pl
--read-group-id-prefix
--no-warnings

--num-gpus NUM_GPUS
--gpu-devices GPU_DEVICES

--tmp-dir TMP_DIR
--with-petagene-dir WITH_PETAGENE_DIR
--keep-tmp
--license-file LICENSE_FILE
--version

BQSR

bqsr performs the Base Quality Score Recalibration (BQSR) in a stand alone fashion.

QUICK START

Copy
Copied!

            
            $ pbrun bqsr --ref Ref/Homo_sapiens_assembly38.fasta \
--in-bam mark_dups_gpu.bam \
--knownSites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \
--out-recal-file recal_gpu.txt \

COMPATIBLE GATK4 COMMAND

The command below is the GATK4 counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command.

Copy
Copied!

            
            $ gatk BaseRecalibrator --java-options -Xmx30g --input mark_dups_gpu.bam --output \
recal_cpu.txt --known-sites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \
--reference Ref/Homo_sapiens_assembly38.fasta

OPTIONS

--ref
--in-bam
--knownSites
--interval-file
--interval
--interval-padding
--out-recal-file

--num-gpus NUM_GPUS
--gpu-devices GPU_DEVICES

--tmp-dir TMP_DIR
--with-petagene-dir WITH_PETAGENE_DIR
--keep-tmp
--license-file LICENSE_FILE
--version

APPLYBQSR

applybqsr updates the Base Quality Scores using the BQSR report.

QUICK START-CLI

Copy
Copied!

            
            $ pbrun applybqsr --ref Ref/Homo_sapiens_assembly38.fasta \
--in-bam mark_dups_gpu.bam \
--in-recal-file recal_gpu.txt  \
--out-bam S1_updated.bam \

COMPATIBLE GATK4 COMMAND

The command below is the GATK4 counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command.

Copy
Copied!

            
            $ gatk ApplyBQSR --java-options -Xmx30g -R Ref/Homo_sapiens_assembly38.fasta \
-I=mark_dups_gpu.bam --bqsr-recal-file=recal_cpu.txt  -O=S1_updated.bam

OPTIONS

--ref
--in-bam
--in-recal-file
--out-bam
--interval-file
--interval
--interval-padding
--num-threads

--num-gpus NUM_GPUS
--gpu-devices GPU_DEVICES

--tmp-dir TMP_DIR
--with-petagene-dir WITH_PETAGENE_DIR
--keep-tmp
--license-file LICENSE_FILE
--version