fq2bamfast (BWA-MEM + GATK)

Generate BAM/CRAM output given one or more pairs of FASTQ files. Can also optionally generate a BQSR report.

Note

fq2bam will become an alias for fq2bamfast in the next major release.

What is BWA-MEM?

BWA-MEM is a fast, accurate algorithm for mapping DNA sequence reads to a reference genome, performing local alignment and producing alignment for different parts of the query sequence. It is the default algorithm in Burrows-Wheeler Aligner (BWA) for reads that are longer than 70bp and is designed for high-throughput sequencing technologies such as Illumina and Pacific Biosciences.

Why BWA-MEM?

BWA-MEM is capable of handling longer reads and is less sensitive to errors than other alignment algorithms. It is therefore used for a variety of applications, from routine analysis of sequencing data to more advanced applications such as de novo assembly and variant calling.

Some of the advantages of using BWA-MEM over similar tools include:

It is faster than many other alignment algorithms, making it the ideal choice for high-throughput sequencing.
It has a lower false positive rate than many other alignment algorithms, which means fewer false-positive variants are reported.
It is memory-efficient, allowing it to be used on limited resources.
It is highly accurate, with a reported accuracy of over 99% on Illumina data.

What is fq2bamfast?

The tool fq2bamfast is Parabrick's new version of fq2bam's BWA-MEM implementation optimized for performance. We have kept the same command-line interface with small changes to support new performance options. Some BWA options may not yet be supported. Generally, fq2bamfast will use more device memory than fq2bam as a trade-off for better performance. If device memory is less than 50GB, one may need to experiment with the --bwa-nstreams or --low-memory.

How should I use BWA-MEM in fq2bamfast?

fq2bamfast uses an accelerated version of BWA-MEM to generate BAM/CRAM output given one or more pairs of FASTQ files. The user can turn-off marking of duplicates by adding the --no-markdups option. The BQSR step is only performed if the --knownSites input and --out-recal-file output options are provided; doing so will also generate a BQSR report.

Quick Start

Copy
Copied!

            
            # This command assumes all the inputs are in INPUT_DIR and all the outputs go to OUTPUT_DIR.
docker run --rm --gpus all --volume INPUT_DIR:/workdir --volume OUTPUT_DIR:/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1 \
    pbrun fq2bamfast \
    --ref /workdir/${REFERENCE_FILE} \
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2}  \
    --knownSites /workdir/${KNOWN_SITES_FILE} \
    --out-bam /outputdir/${OUTPUT_BAM} \
    --out-recal-file /outputdir/${OUTPUT_RECAL_FILE}

Compatible CPU-based BWA-MEM, GATK4 Commands

The commands below are the bwa-0.7.15 and GATK4 counterpart of the Parabricks command above. The output from these commands will be identical to the output from the above command. See the Output Comparison page for comparing the results.

Note

Set --bwa-options="-K 10000000" to produce compatible pair-ended results.

Copy
Copied!

            
            # Run bwa-mem and pipe the output to create a sorted BAM.
$ bwa mem \
    -t 32 \
    -K 10000000 \
    -R '@RG\tID:sample_rg1\tLB:lib1\tPL:bar\tSM:sample\tPU:sample_rg1' \
    <INPUT_DIR>/${REFERENCE_FILE} <INPUT_DIR>/${INPUT_FASTQ_1} <INPUT_DIR>/${INPUT_FASTQ_2} | \
  gatk SortSam \
    --java-options -Xmx30g \
    --MAX_RECORDS_IN_RAM 5000000 \
    -I /dev/stdin \
    -O cpu.bam \
    --SORT_ORDER coordinate

# Mark duplicates.
$ gatk MarkDuplicates \
    --java-options -Xmx30g \
    -I cpu.bam \
    -O mark_dups_cpu.bam \
    -M metrics.txt

# Generate a BQSR report.
$ gatk BaseRecalibrator \
    --java-options -Xmx30g \
    --input mark_dups_cpu.bam \
    --output <OUTPUT_DIR>/${OUTPUT_RECAL_FILE} \
    --known-sites <INPUT_DIR>/${KNOWN_SITES_FILE} \
    --reference <INPUT_DIR>/${REFERENCE_FILE}

fq2bamfast Reference

Run GPU-bwa mem, co-ordinate sorting, marking duplicates, and Base Quality Score Recalibration to convert FASTQ to BAM/CRAM.

Input/Output file options

--ref REF
--in-fq [IN_FQ ...]
--in-se-fq [IN_SE_FQ ...]
--in-fq-list IN_FQ_LIST
--knownSites KNOWNSITES
--interval-file INTERVAL_FILE
--out-recal-file OUT_RECAL_FILE
--out-bam OUT_BAM
--out-duplicate-metrics OUT_DUPLICATE_METRICS
--out-qc-metrics-dir OUT_QC_METRICS_DIR

Tool Options:

--max-read-length MAX_READ_LENGTH
--min-read-length MIN_READ_LENGTH
-L INTERVAL, --interval INTERVAL
--bwa-options BWA_OPTIONS
--no-warnings
--filter-flag FILTER_FLAG
--skip-multiple-hits
--align-only
--no-markdups
--fix-mate
--markdups-assume-sortorder-queryname
--markdups-picard-version-2182
--monitor-usage
--optical-duplicate-pixel-distance OPTICAL_DUPLICATE_PIXEL_DISTANCE
--read-group-sm READ_GROUP_SM
--read-group-lb READ_GROUP_LB
--read-group-pl READ_GROUP_PL
--read-group-id-prefix READ_GROUP_ID_PREFIX
-ip INTERVAL_PADDING, --interval-padding INTERVAL_PADDING
--standalone-bqsr

Performance Options:

--bwa-nstreams BWA_NSTREAMS
--bwa-cpu-thread-pool BWA_CPU_THREAD_POOL
--gpuwrite
--gpuwrite-deflate-algo GPUWRITE_DEFLATE_ALGO
--gpusort
--use-gds
--memory-limit MEMORY_LIMIT
--low-memory

Common options:

--logfile LOGFILE
--tmp-dir TMP_DIR
--with-petagene-dir WITH_PETAGENE_DIR
--keep-tmp
--no-seccomp-override
--version

GPU options:

--num-gpus NUM_GPUS

Note

The --in-fq option takes the names of two FASTQ files, optionally followed by a quoted read group. The FASTQ filenames must not start with a hyphen.

Note

When using the --in-fq-list option a read group is required on each line of the input file.