giraffe (vg giraffe + GATK)
Note that the Parabricks GPU-accelerated Giraffe tool is currently in beta.
Generate BAM output given one or a pair of FASTQ files using the pangenome aligner VG Giraffe [1] [2].
See the giraffe Reference section for a detailed listing of all available options.
VG Giraffe is a short-read mapping tool developed by Dr. Benedict Paten's lab at the University of California, Santa Cruz (UCSC). This innovative tool can align reads to a graph representation of multiple reference genomes, enhancing the quality of downstream analyses. By accurately mapping reads to thousands of genomes simultaneously, VG Giraffe offers a substantial improvement over traditional single-reference aligners.
By utilizing a graph-based approach, VG Giraffe can more effectively handle genetic diversity and structural variations across populations. Here are three key benefits of using VG Giraffe:
Improved accuracy: VG Giraffe achieves higher precision and recall in read mapping compared to linear genome aligners, especially when dealing with complex genomic regions or populations with significant genetic diversity.
Reduced reference bias (or mapping bias): By incorporating multiple haplotypes and known variants into its graph structure, VG Giraffe minimizes the reference bias inherent in traditional linear genome aligners. This leads to more comprehensive and unbiased characterization of genetic variation, especially for samples that diverge significantly from the standard reference genome.
Faster performance: Despite working with more complex graph structures, VG Giraffe is significantly faster than its predecessor VG Map and comparable in speed to popular linear genome mappers. It can map sequencing reads to thousands of human genomes at a speed similar to methods that map to a single reference genome.
VG Giraffe can be used within Parabricks, a software suite designed for accelerated
secondary analysis in genomics. Our wrapper (pbrun giraffe
) will run our GPU-accelerated
VG Giraffe and sort the output BAM by coordinate.
While users can build custom reference graphs for VG Giraffe using the VG Autoindex tool, pre-built pangenome graphs are also available. Dr. Paten's lab and the Human Pangenome Consortium have made these resources publicly accessible, allowing researchers to leverage high-quality, ready-to-use pangenome graphs for their analyses (HPRC data).
The set of index files used in the test below can be downloaded using the aws cli
as follows:
aws s3 cp \
s3://human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/filtered . \
--no-sign-request --recursive --exclude "*" --include "hprc-v1.0-mc-grch38-minaf.0.1.*"
Then to generate the required .gbz
file, from the directory where the index files are:
docker run --rm \
-v $(pwd):/workdir \
--workdir /workdir \
quay.io/vgteam/vg:v1.59.0 \
vg gbwt --gbz-format \
-g hprc-v1.0-mc-grch38-minaf.0.1.gbz \
-I hprc-v1.0-mc-grch38-minaf.0.1.gg hprc-v1.0-mc-grch38-minaf.0.1.gbwt
At runtime, index data is loaded into GPU device memory. Due to the index's size, this can constrain available memory for other operations. The following options impact device memory usage and performance:
For 16GB devices (e.g. T4): Use
--low-memory
optionFor 16GB-40GB devices (e.g. L4, A10): Optimize performance by adjusting:
--nstreams
: Controls the number of CUDA streams per GPU--batch-size
: Adjusts the number of reads processed in a batchFor L4 best performance was obtained using
--nstreams 2 --batch-size 10000
For >40GB devices: Default parameters are sufficient; however, there is the potential for further optimization by adjusting the aforementioned parameters.
Note: While a fixed base memory allocation exists per device, the number of streams and batch size are the primary factors affecting total device memory consumption.
# This command assumes all the inputs are in INPUT_DIR and all the outputs go to OUTPUT_DIR.
docker run --rm --gpus all --volume INPUT_DIR:/workdir --volume OUTPUT_DIR:/outputdir \
--workdir /workdir \
nvcr.io/nvidia/clara/clara-parabricks:4.4.0-1 \
pbrun giraffe --read-group "sample_rg1" \
--sample "sample-name" --read-group-library "library" \
--read-group-platform "platform" --read-group-pu "pu" \
--dist-name /workdir/hprc-v1.0-mc-grch38-minaf.0.1.dist \
--minimizer-name /workdir/hprc-v1.0-mc-grch38-minaf.0.1.min \
--gbz-name /workdir/hprc-v1.0-mc-grch38-minaf.0.1.giraffe.gbz \
--xg-name /workdir/hprc-v1.0-mc-grch38-minaf.0.1.xg \
--graph-name /workdir/hprc-v1.0-mc-grch38-minaf.0.1.gg \
--gbwt-name /workdir/hprc-v1.0-mc-grch38-minaf.0.1.gbwt \
--in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \
--out-bam /outputdir/${OUTPUT_BAM}
To use Giraffe-aligned BAM files for variant calling, you need to extract the appropriate reference file from the Giraffe index files. Run the following commands from the directory containing the Giraffe index files:
# Extract the list of paths corresponding to GRCh38
docker run --rm \
-v $(pwd):/workdir \
--workdir /workdir \
quay.io/vgteam/vg:v1.59.0 \
vg paths -x hprc-v1.0-mc-grch38-minaf.0.1.xg -L > hprc-v1.0-mc-grch38-minaf.0.1.paths
# Filter paths list
grep -v _decoy hprc-v1.0-mc-grch38-minaf.0.1.paths \
| grep -v _random \
| grep -v chrUn_ \
| grep -v chrEBV \
| grep -v chrM \
| grep -v chain_ > hprc-v1.0-mc-grch38-minaf.0.1.paths.sub
# Extract the corresponding sequences to a FASTA file
docker run --rm \
-v $(pwd):/workdir \
--workdir /workdir \
quay.io/vgteam/vg:v1.59.0 \
vg paths -x hprc-v1.0-mc-grch38-minaf.0.1.xg -p hprc-v1.0-mc-grch38-minaf.0.1.paths.sub -F > hprc-v1.0-mc-grch38-minaf.0.1.fa
# Index the fasta file
samtools faidx hprc-v1.0-mc-grch38-minaf.0.1.fa
These commands will generate a FASTA file (hprc-v1.0-mc-grch38-minaf.0.1.fa
),
and the corresponding index (hprc-v1.0-mc-grch38-minaf.0.1.fa.fai
), that can
be used as the reference for variant calling. Note that these files can be also used
for BQSR (bqsr).
Once you have the Giraffe-aligned BAM file and the extracted reference FASTA, you can proceed with variant calling using DeepVariant or HaplotypeCaller.
# Deepvariant
# This command assumes all the inputs are in INPUT_DIR and all the outputs go to OUTPUT_DIR.
docker run --rm --gpus all --volume INPUT_DIR:/workdir --volume OUTPUT_DIR:/outputdir \
--workdir /workdir \
nvcr.io/nvidia/clara/clara-parabricks:4.4.0-1 \
pbrun deepvariant \
--ref /workdir/hprc-v1.0-mc-grch38-minaf.0.1.fa \
--in-bam /workdir/${INPUT_BAM} \
--out-variants /outputdir/${OUTPUT_VCF}
# Haplotype Caller
# This command assumes all the inputs are in INPUT_DIR and all the outputs go to OUTPUT_DIR.
docker run --rm --gpus all --volume INPUT_DIR:/workdir --volume OUTPUT_DIR:/outputdir \
--workdir /workdir \
nvcr.io/nvidia/clara/clara-parabricks:4.4.0-1 \
pbrun haplotypecaller \
--ref /workdir/hprc-v1.0-mc-grch38-minaf.0.1.fa \
--in-bam /workdir/${INPUT_BAM} \
--in-recal-file /workdir/${INPUT_RECAL_FILE} \
--out-variants /outputdir/${OUTPUT_VCF}
For more detailed instructions on variant calling, please refer to the tool-specific documentation (deepvariant, haplotypecaller).
The commands below are the vg-1.59.0 and GATK4 counterpart of the Parabricks command above. The output from these commands will be identical to the output from the above command. See the Output Comparison page for comparing the results.
# Run giraffe and pipe the output to create a sorted BAM.
$ vg giraffe \
-t 16 \
-d /workdir/hprc-v1.0-mc-grch38-minaf.0.1.dist \
-m /workdir/hprc-v1.0-mc-grch38-minaf.0.1.min \
-x /workdir/hprc-v1.0-mc-grch38-minaf.0.1.xg \
-g /workdir/hprc-v1.0-mc-grch38-minaf.0.1.gg \
-H /workdir/hprc-v1.0-mc-grch38-minaf.0.1.gbwt \
-f /workdir/${INPUT_FASTQ_1} \
-f /workdir/${INPUT_FASTQ_2} \
--output-format bam | \
gatk SortSam \
--java-options -Xmx30g \
--MAX_RECORDS_IN_RAM 5000000 \
-I /dev/stdin \
-O cpu.bam \
--SORT_ORDER coordinate
When comparing output with the CPU counterpart the following can be sources of small differences.
Baseline VG Container
When comparing output between baseline giraffe and Parabricks' accelerated version,
if you intend to use the baseline vg container (quay.io/vgteam/vg:v1.59.0
),
you will need to re-build the container with an Ubuntu 22.04 base.
This is because of changes in the C++ standard library for the default gcc version of
the underlying OS (#4391). Modify line 6
of their Dockerfile to reference mirror.gcr.io/library/ubuntu:22.04
instead
of 20.04. Rebuild a container with the following command.
git clone https://github.com/vgteam/vg.git
cd vg
git checkout v1.59.0
git submodule update --init --recursive
make version
docker build --no-cache -f Dockerfile --build-arg THREADS=16 --tag \
<YOUR_CONTIANER_NAME> --network host ./
Unmapped reads
Parabricks
giraffe
sorts unmapped reads slightly differently than baseline GATK SortSam. Unmapped reads can be filtered with samtools by runningsamtools view -F 4
.
Align reads to a pangenome graph.
Input/Output file options
- --in-fq [IN_FQ ...]
-
Path to the paired-end FASTQ files. The files must be in fastq or fastq.gz format. Example 1: --in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz. (default: None)
- --in-se-fq [IN_SE_FQ ...]
-
Path to the single-end FASTQ file. The file must be in fastq or fastq.gz format. (default: None)
- -d DIST_NAME, --dist-name DIST_NAME
-
Cluster using this distance index. (default: None)
Option is required.
- -m MINIMIZER_NAME, --minimizer-name MINIMIZER_NAME
-
Use this minimizer index. (default: None)
Option is required.
- -Z GBZ_NAME, --gbz-name GBZ_NAME
-
Map to this GBZ graph. (default: None)
Option is required.
- -x XG_NAME, --xg-name XG_NAME
-
XG graph used for BAM output. (default: None)
Option is required.
- -g GRAPH_NAME, --graph-name GRAPH_NAME
-
GBWTGraph used for mapping. (default: None)
Option is required.
- -H GBWT_NAME, --gbwt-name GBWT_NAME
-
GBWT index for mapping. (default: None)
Option is required.
- --out-bam OUT_BAM
-
Path of a BAM file for output. (default: None)
Option is required.
Tool Options:
- --read-group READ_GROUP
-
Read group ID for this run. (default: None)
Option is required.
- --sample SAMPLE
-
Sample (SM) tag for read group in this run. (default: None)
Option is required.
- --read-group-library READ_GROUP_LIBRARY
-
Library (LB) tag for read group in this run. (default: None)
- --read-group-platform READ_GROUP_PLATFORM
-
Platform (PL) tag for read group in this run; refers to platform/technology used to produce reads. (default: None)
- --read-group-pu READ_GROUP_PU
-
Platform unit (PU) tag for read group in this run. (default: None)
- --prune-low-cplx
-
Prune short and low complexity anchors during linear format realignment. (default: None)
- --max-fragment-length MAX_FRAGMENT_LENGTH
-
Assume that fragment lengths should be smaller than INT when estimating the fragment length distribution. (default: None)
- --fragment-mean FRAGMENT_MEAN
-
Force the fragment length distribution to have this mean. (default: None)
- --fragment-stdev FRAGMENT_STDEV
-
Force the fragment length distribution to have this standard deviation. (default: None)
- --align-only
-
Generate output BAM after vg-giraffe alignment. The output will not be co-ordinate sorted. (default: None)
- --monitor-usage
-
Monitor approximate CPU utilization and host memory usage during execution (available during sort and postsort). (default: None)
- --max-read-length MAX_READ_LENGTH
-
Maximum read length/size (i.e., sequence length) used for giraffe and filtering FASTQ input (default: 480)
- --min-read-length MIN_READ_LENGTH
-
Minimum read length/size (i.e., sequence length) used for giraffe and filtering FASTQ input (default: 1)
Performance Options:
- --nstreams NSTREAMS
-
Number of streams per GPU to use; note: more streams increases device memory usage. (default: 3)
- --num-primary-cpus-per-gpu NUM_PRIMARY_CPUS_PER_GPU
-
Number of primary CPU threads per GPU driving its associated thread pool. (default: 16)
- --cpu-thread-pool CPU_THREAD_POOL
-
Number of processing threads per primary CPU thread. (default: 1)
- --batch-size BATCH_SIZE
-
Batch size used for processing alignments. (default: 10000)
- --write-threads WRITE_THREADS
-
Number of threads used for writing and pre-sorting output. (default: 4)
- --gpuwrite
-
Use one GPU to accelerate writing final BAM/CRAM. (default: None)
- --gpuwrite-deflate-algo GPUWRITE_DEFLATE_ALGO
-
Choose the nvCOMP DEFLATE algorithm to use with --gpuwrite. Note these options do not correspond to CPU DEFLATE options. Valid options are 1, 2, and 4. Option 1 is fastest, while options 2 and 4 have progressively lower throughput but higher compression ratios. The default value is 1 when the user does not provide an input (i.e., None) (default: None)
- --gpusort
-
Use GPUs to accelerate sorting and marking. (default: None)
- --use-gds
-
Use GPUDirect Storage (GDS) to enable a direct data path for direct memory access (DMA) transfers between GPU memory and storage. Must be used concurrently with --gpuwrite. Please refer to Parabricks Documentation > Best Performance for information on how to set up and use GPUDirect Storage. (default: None)
- --memory-limit MEMORY_LIMIT
-
System memory limit in GBs during sorting and postsorting. By default, the limit is half of the total system memory. (default: 62)
- --low-memory
-
Use low memory mode; will lower the number of streams per GPU and decrease the batch size. (default: None)
Common options:
- --logfile LOGFILE
-
Path to the log file. If not specified, messages will only be written to the standard error output. (default: None)
- --tmp-dir TMP_DIR
-
Full path to the directory where temporary files will be stored.
- --with-petagene-dir WITH_PETAGENE_DIR
-
Full path to the PetaGene installation directory. By default, this should have been installed at /opt/petagene. Use of this option also requires that the PetaLink library has been preloaded by setting the LD_PRELOAD environment variable. Optionally set the PETASUITE_REFPATH and PGCLOUD_CREDPATH environment variables that are used for data and credentials (default: None)
- --keep-tmp
-
Do not delete the directory storing temporary files after completion.
- --no-seccomp-override
-
Do not override seccomp options for docker (default: None).
- --version
-
View compatible software versions.
GPU options:
- --num-gpus NUM_GPUS
-
Number of GPUs to use for a run. GPUs 0..(NUM_GPUS-1) will be used.
Jouni Sirén et al., Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg 8871 (2021). DOI: 10.1126/science.abg8871
Baseline VG Giraffe: https://github.com/vgteam/vg