fq2bam (FQ2BAM + BWA-MEM)

Generate BAM/CRAM output given one or more pairs of FASTQ files. Can also optionally generate a BQSR report.

What is BWA-MEM?

BWA-MEM is a fast, accurate algorithm for mapping DNA sequence reads to a reference genome, performing local alignment and producing alignment for different parts of the query sequence. It is the default algorithm in Burrows-Wheeler Aligner (BWA) for reads that are longer than 70bp and is designed for high-throughput sequencing technologies such as Illumina and Pacific Biosciences.

Why BWA-MEM?

BWA-MEM is capable of handling longer reads and is less sensitive to errors than other alignment algorithms. It is therefore used for a variety of applications, from routine analysis of sequencing data to more advanced applications such as de novo assembly and variant calling.

Some of the advantages of using BWA-MEM over similar tools include:

It is faster than many other alignment algorithms, making it the ideal choice for high-throughput sequencing.
It has a lower false positive rate than many other alignment algorithms, which means fewer false-positive variants are reported.
It is memory-efficient, allowing it to be used on limited resources.
It is highly accurate, with a reported accuracy of over 99% on Illumina data.

What is fq2bam?

BWA-MEM can be deployed within Clara Parabricks, a software suite designed for accelerated secondary analysis in genomics, bringing industry standard tools and workflows from CPU to GPU, and delivering the same results at up to 60x faster runtimes. FQ2BAM is the Parabricks wrapper for BWA-MEM, which will sort the output and can mark duplicates and recalibrate base quality scores in line with GATK best practices. A 30x whole genome can be run through FQ2BAM in as little as 17 minutes on an NVIDIA DGX system, compared to 4-9 hours on a CPU instance (m5.24xlarge, 96 x vCPU).

How should I use BWA-MEM in fq2bam?

fq2bam uses an accelerated version of BWA-MEM to generate BAM/CRAM output given one or more pairs of FASTQ files. The user can turn-off marking of duplicates by adding the --no-markdups option. The BQSR step is only performed if the --knownSites input and --out-recal-file output options are provided; doing so will also generate a BQSR report.

Quick Start

Copy
Copied!

            
            # This command assumes all the inputs are in INPUT_DIR and all the outputs go to OUTPUT_DIR.
docker run --rm --gpus all --volume INPUT_DIR:/workdir --volume OUTPUT_DIR:/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.2-1 \
    pbrun fq2bam \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2}  \
    --knownSites /workdir/${KNOWN_SITES_FILE} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --out-recal-file /outputdir/${OUTPUT_RECAL_FILE}

Compatible CPU-based BWA-MEM, GATK4 Commands

The commands below are the bwa-0.7.15 and GATK4 counterpart of the Parabricks command above. The output from these commands will be identical to the output from the above command. See the Output Comparison page for comparing the results.

Copy
Copied!

            
            # Run bwa-mem and pipe the output to create a sorted BAM.
$ bwa mem \
    -t 32 \
    -K 10000000 \
    -R '@RG\tID:sample_rg1\tLB:lib1\tPL:bar\tSM:sample\tPU:sample_rg1' \
    <INPUT_DIR>/${REFERENCE_FILE} <INPUT_DIR>/${INPUT_FASTQ_1} <INPUT_DIR>/${INPUT_FASTQ_2} | \
  gatk SortSam \
    --java-options -Xmx30g \
    --MAX_RECORDS_IN_RAM 5000000 \
    -I /dev/stdin \
    -O cpu.bam \
    --SORT_ORDER coordinate

# Mark duplicates.
$ gatk MarkDuplicates \
    --java-options -Xmx30g \
    -I cpu.bam \
    -O mark_dups_cpu.bam \
    -M metrics.txt

# Generate a BQSR report.
$ gatk BaseRecalibrator \
    --java-options -Xmx30g \
    --input mark_dups_cpu.bam \
    --output <OUTPUT_DIR>/${OUTPUT_RECAL_FILE} \
    --known-sites <INPUT_DIR>/${KNOWN_SITES_FILE} \
    --reference <INPUT_DIR>/${REFERENCE_FILE}

fq2bam Reference

Run GPU-bwa mem, co-ordinate sorting, marking duplicates, and Base Quality Score Recalibration to convert FASTQ to BAM/CRAM.

Input/Output file options

--ref REF
--in-fq [IN_FQ [IN_FQ ...]]
--in-se-fq [IN_SE_FQ [IN_SE_FQ ...]]
--in-fq-list IN_FQ_LIST
--knownSites KNOWNSITES
--interval-file INTERVAL_FILE
--out-recal-file OUT_RECAL_FILE
--out-bam OUT_BAM
--out-duplicate-metrics OUT_DUPLICATE_METRICS
--out-qc-metrics-dir OUT_QC_METRICS_DIR

Tool Options:

-L INTERVAL, --interval INTERVAL
--bwa-options BWA_OPTIONS
--no-warnings
--gpuwrite
--gpusort
--use-gds
--memory-limit MEMORY_LIMIT
--low-memory
--filter-flag FILTER_FLAG
--skip-multiple-hits
--min-read-length MIN_READ_LENGTH
--align-only
--num-cpu-threads-per-stage NUM_CPU_THREADS_PER_STAGE
--no-markdups
--fix-mate
--markdups-assume-sortorder-queryname
--markdups-picard-version-2182
--optical-duplicate-pixel-distance OPTICAL_DUPLICATE_PIXEL_DISTANCE
--read-group-sm READ_GROUP_SM
--read-group-lb READ_GROUP_LB
--read-group-pl READ_GROUP_PL
--read-group-id-prefix READ_GROUP_ID_PREFIX
-ip INTERVAL_PADDING, --interval-padding INTERVAL_PADDING
--standalone-bqsr

Common options:

--logfile LOGFILE
--tmp-dir TMP_DIR
--with-petagene-dir WITH_PETAGENE_DIR
--keep-tmp
--no-seccomp-override
--version

GPU options:

--num-gpus NUM_GPUS

Note

The --in-fq option takes the names of two FASTQ files, optionally followed by a quoted read group. The FASTQ filenames must not start with a hyphen.

Note

When using the --in-fq-list option a read group is required on each line of the input file.