pangenome_germline
Run a GPU-accelerated end-to-end germline pipeline from FASTQ to VCF using pangenome alignment (Giraffe) and Pangenome-aware DeepVariant.
See the pangenome_germline Reference section for a detailed listing of all available options.
The pangenome germline pipeline combines two tools into a single end-to-end workflow:
Giraffe (giraffe (vg giraffe + GATK)) -- aligns short reads to a pangenome graph, producing a coordinate-sorted, duplicate-marked BAM file.
Pangenome-aware DeepVariant (pangenome_aware_deepvariant) -- calls variants from the aligned BAM using a pangenome-aware model, producing a VCF file.
This pipeline is inspired by the vgteam Giraffe-DeepVariant WDL workflow, but uses Pangenome-aware DeepVariant instead of standard DeepVariant for improved variant calling accuracy in complex and highly variable genomic regions.
The pangenome germline pipeline requires the following input files:
For Giraffe (alignment):
.gbz-- pangenome graph.dist-- distance index.min-- minimizer index.zipcodes-- zipcodes index.paths.sub-- filtered reference paths
For Pangenome-aware DeepVariant (variant calling):
.fa-- reference FASTA extracted from the graph.fa.fai-- FASTA index
The reference FASTA (--ref) must be derived from the same pangenome graph
used for alignment (--gbz-name) and must match the contigs listed in the
reference paths file (--ref-paths).
The index files can be generated from a GBZ graph using vg autoindex, and
the reference FASTA can be extracted from the graph using vg paths. The
following example uses the HPRC v1.1 Minigraph-Cactus pangenome graph aligned to
GRCh38:
# Download GBZ
# https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.d9.gbz
aws s3 cp \
s3://human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.d9.gbz \
. \
--no-sign-request
# Extract index files from GBZ
docker run --rm --volume $(pwd):/workdir \
--workdir /workdir \
--user $(id -u):$(id -g) \
quay.io/vgteam/vg:v1.70.0 \
vg autoindex \
-p hprc-v1.1-mc-grch38.d9.autoindex.1.70 \
-G hprc-v1.1-mc-grch38.d9.gbz \
-w giraffe
# Extract paths from GBZ
docker run --rm \
--user $(id -u):$(id -g) \
--volume $(pwd):/workdir \
--workdir /workdir \
quay.io/vgteam/vg:v1.70.0 \
vg paths -x hprc-v1.1-mc-grch38.d9.gbz \
-L --paths-by GRCh38 > hprc-v1.1-mc-grch38.d9.paths
# Filter paths to keep only chromosomes
grep -v _decoy hprc-v1.1-mc-grch38.d9.paths \
| grep -v _random \
| grep -v chrUn_ \
| grep -v chrEBV \
| grep -v chrM \
| grep -v chain_ > hprc-v1.1-mc-grch38.d9.paths.sub
# Extract the sequences corresponding to the list of paths to a FASTA file
docker run --rm \
--user $(id -u):$(id -g) \
--volume $(pwd):/workdir \
--workdir /workdir \
quay.io/vgteam/vg:v1.70.0 \
vg paths -x hprc-v1.1-mc-grch38.d9.gbz \
-p hprc-v1.1-mc-grch38.d9.paths.sub \
-F > hprc-v1.1-mc-grch38.d9.fa
# Index the FASTA file
docker run --rm \
--user $(id -u):$(id -g) \
--volume $(pwd):/workdir \
--workdir /workdir \
quay.io/biocontainers/samtools:1.17--hd87286a_2 \
samtools faidx hprc-v1.1-mc-grch38.d9.fa
The pipeline provides an optional pre-flight verification step enabled by the
--run-ref-verification flag. When enabled, the pipeline checks the
consistency of the reference FASTA against the GBZ graph before alignment begins.
The verification checks that:
Every contig listed in
--ref-pathsis present in the reference FASTA.Every contig listed in
--ref-pathsis present in the GBZ graph.The sequences and lengths of those contigs match between the FASTA and the graph.
If the verification fails, the pipeline exits with a descriptive error message
indicating which contigs are missing or mismatched, and how to fix the issue.
Users can remove the --run-ref-verification flag to bypass this check,
but the resulting variants may be incorrect if the reference files are inconsistent.
When --run-ref-verification is not set, the pipeline will still suggest
using the flag if an error occurs during alignment or variant calling, as the error
may be caused by inconsistent reference files.
Before running the pangenome germline pipeline, ensure you have generated the required files. See the file generation section above for instructions.
# This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
--workdir /workdir \
nvcr.io/nvidia/clara/clara-parabricks:4.7.0-1 \
pbrun pangenome_germline \
--ref /workdir/hprc-v1.1-mc-grch38.d9.fa \
--gbz-name /workdir/hprc-v1.1-mc-grch38.d9.gbz \
--dist-name /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.dist \
--minimizer-name /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.shortread.withzip.min \
--zipcodes-name /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.shortread.zipcodes \
--ref-paths /workdir/hprc-v1.1-mc-grch38.d9.paths.sub \
--in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \
--out-bam /outputdir/${OUTPUT_BAM} \
--out-variants /outputdir/${OUTPUT_VCF} \
--run-ref-verification
The pangenome germline pipeline inherits performance options from both Giraffe and Pangenome-aware DeepVariant. The key tuning parameters are:
Giraffe (alignment step):
--nstreams: Number of CUDA streams per GPU (default:auto).--num-cpu-threads-per-gpu: Number of CPU worker threads per GPU (default: 16).--minimizers-gpu: Enable GPU-accelerated minimizer computation (SE only).--low-memory: For GPUs with less than 22 GB of device memory.
Pangenome-aware DeepVariant (variant calling step):
--num-streams-per-gpu: Number of streams per GPU for variant calling (default:auto).--run-partition: More efficiently splits work across multiple GPUs.
For detailed performance guidance on the alignment step, see System Requirements and Useful Options for Performance. For the variant calling step, see Best Performance for Deepvariant.
The commands below are the CPU counterpart of the Parabricks pangenome germline pipeline. The output from these commands will be identical to the output from the above command. See the Output Comparison page for comparing the results.
The index files used below are generated in the file generation section.
# Stage 1: Run vg giraffe and pipe the output to create a sorted BAM.
$ vg giraffe \
-t 16 \
-Z /workdir/hprc-v1.1-mc-grch38.d9.gbz \
-d /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.dist \
-m /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.shortread.withzip.min \
-z /workdir/hprc-v1.1-mc-grch38.d9.autoindex.1.70.shortread.zipcodes \
--ref-paths /workdir/hprc-v1.1-mc-grch38.d9.paths.sub \
-f /workdir/${INPUT_FASTQ_1} \
-f /workdir/${INPUT_FASTQ_2} \
--output-format bam | \
gatk SortSam \
--java-options -Xmx30g \
--MAX_RECORDS_IN_RAM 5000000 \
-I /dev/stdin \
-O cpu.bam \
--SORT_ORDER coordinate
# Mark duplicates.
$ gatk MarkDuplicates \
-I cpu.bam \
-O cpu.markdup.bam \
-M metrics.txt
# Stage 2: Run Google pangenome-aware DeepVariant.
sudo docker run \
--volume <INPUT_DIR>:/input \
--volume <OUTPUT_DIR>:/output \
--shm-size 12gb \
google/pangenome_aware_deepvariant-1.9.0 \
/opt/deepvariant/bin/run_pangenome_aware_deepvariant \
--model_type WGS \
--ref /input/hprc-v1.1-mc-grch38.d9.fa \
--reads /input/cpu.markdup.bam \
--pangenome /input/hprc-v1.1-mc-grch38.d9.gbz \
--output_vcf /output/${OUTPUT_VCF} \
--num_shards $(nproc) \
--make_examples_extra_args "ws_use_window_selector_model=true"
The pangenome germline pipeline combines Giraffe and Pangenome-aware DeepVariant. Sources of mismatches between Parabricks and CPU-based tools can arise from either stage. For detailed information, refer to:
Giraffe Source of Mismatches -- differences in the alignment stage (baseline container requirements, unmapped read sorting).
Pangenome-aware DeepVariant Source of Mismatches -- differences in the variant calling stage (CNN inference, read sorting, GBZ reader caching).
Run the germline pipeline from FASTQ to VCF using pangenome alignment (Giraffe) and pangenome-aware DeepVariant. The reference FASTA (--ref) must be the same reference used for the graph (GBZ) and for the ref-paths file.
Type | Name | Required? | Description |
|---|---|---|---|
| I/O | --ref REF |
Yes | Path to the reference FASTA (for DeepVariant). Must match the reference used for the graph and ref-paths. |
| I/O | --in-fq [IN_FQ ...] |
No | Path to the pair-ended FASTQ files followed by optional read groups with quotes (Example: "@RG\tID:foo\tLB:lib1\tPL:bar\tSM:sample\tPU:foo"). The files must be in fastq or fastq.gz format. All sets of inputs should have a read group; otherwise, none should have a read group, and it will be automatically added by the pipeline. This option can be repeated multiple times. Example 1: --in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz --in-fq sampleX_2_1.fastq.gz sampleX_2_2.fastq.gz. Example 2: --in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz "@RG\tID:foo\tLB:lib1\tPL:bar\tSM:sample\tPU:unit1" --in-fq sampleX_2_1.fastq.gz sampleX_2_2.fastq.gz "@RG\tID:foo2\tLB:lib1\tPL:bar\tSM:sample\tPU:unit2". For the same sample, Read Groups should have the same sample name (SM) and a different ID and PU. |
| I/O | --in-se-fq [IN_SE_FQ ...] |
No | Path to the single-ended FASTQ file followed by optional read group with quotes (Example: "@RG\tID:foo\tLB:lib1\tPL:bar\tSM:sample\tPU:foo"). The file must be in fastq or fastq.gz format. Either all sets of inputs have a read group, or none should have one, and it will be automatically added by the pipeline. This option can be repeated multiple times. Example 1: --in-se-fq sampleX_1.fastq.gz --in-se-fq sampleX_2.fastq.gz . Example 2: --in-se-fq sampleX_1.fastq.gz "@RG\tID:foo\tLB:lib1\tPL:bar\tSM:sample\tPU:unit1" --in-se-fq sampleX_2.fastq.gz "@RG\tID:foo2\tLB:lib1\tPL:bar\tSM:sample\tPU:unit2" . For the same sample, Read Groups should have the same sample name (SM) and a different ID and PU. |
| I/O | --in-fq-list IN_FQ_LIST |
No | Path to a file that contains the locations of pair-ended FASTQ files. Each line: |
| I/O | --in-se-fq-list IN_SE_FQ_LIST |
No | Path to a file that contains the locations of single-ended FASTQ files. Each line: |
| I/O | -Z GBZ_NAME, --gbz-name GBZ_NAME |
Yes | Map to this GBZ graph. |
| I/O | -d DIST_NAME, --dist-name DIST_NAME |
Yes | Cluster using this distance index. |
| I/O | -m MINIMIZER_NAME, --minimizer-name MINIMIZER_NAME |
Yes | Use this minimizer index. |
| I/O | -z ZIPCODES_NAME, --zipcodes-name ZIPCODES_NAME |
Yes | Use this zipcodes file for clustering. |
| I/O | --ref-paths REF_PATHS |
Yes | Path to file containing ordered list of paths in the graph (one per line or HTSlib .dict). Must match contigs in --ref. |
| I/O | --out-bam OUT_BAM |
Yes | Path of BAM file for output. |
| I/O | --out-variants OUT_VARIANTS |
Yes | Path of the vcf/vcf.gz/gvcf/gvcf.gz file after variant calling. |
| I/O | -L INTERVAL, --interval INTERVAL |
No | Interval within which to call variants. This option can be used multiple times. |
| I/O | --interval-file INTERVAL_FILE |
No | Path to an interval file (BED format). This option can be used multiple times. |
| I/O | --pb-model-file PB_MODEL_FILE |
No | Path to a non-default parabricks model file for pangenome-aware deepvariant. |
| I/O | --run-ref-verification |
No | Run the pre-flight reference verification step that checks the FASTA (.fai + sequences) against the GBZ graph. |
| Tool | --read-group READ_GROUP |
No | Read group ID for this run. |
| Tool | --sample SAMPLE |
No | Sample (SM) tag for read group in this run. |
| Tool | --read-group-library READ_GROUP_LIBRARY |
No | Library (LB) tag for read group in this run. |
| Tool | --read-group-platform READ_GROUP_PLATFORM |
No | Platform (PL) tag for read group in this run; refers to platform/technology used to produce reads. |
| Tool | --read-group-pu READ_GROUP_PU |
No | Platform unit (PU) tag for read group in this run. |
| Tool | --prune-low-cplx |
No | Prune short and low complexity anchors during linear format realignment. |
| Tool | --max-fragment-length MAX_FRAGMENT_LENGTH |
No | Assume that fragment lengths should be smaller than MAX-FRAGMENT-LENGTH when estimating the fragment length distribution. |
| Tool | --fragment-mean FRAGMENT_MEAN |
No | Force the fragment length distribution to have this mean. |
| Tool | --fragment-stdev FRAGMENT_STDEV |
No | Force the fragment length distribution to have this standard deviation. |
| Tool | --align-only |
No | Generate output BAM after vg-giraffe alignment. The output will not be co-ordinate sorted. |
| Tool | --copy-comment |
No | Append FASTQ comment to BAM output via auxiliary tag. |
| Tool | --no-markdups |
No | Do not perform the Mark Duplicates step. Return BAM after sorting. |
| Tool | --markdups-single-ended-start-end |
No | Mark duplicate on single-ended reads by 5' and 3' end. |
| Tool | --ignore-rg-markdups-single-ended |
No | Ignore read group info in marking duplicates on single-ended reads. This option must be used with --markdups-single-ended-start-end. |
| Tool | --markdups-assume-sortorder-queryname |
No | Assume the reads are sorted by queryname for marking duplicates. This will mark secondary, supplementary, and unmapped reads as duplicates as well. This flag will not impact variant calling while increasing processing times. |
| Tool | --markdups-picard-version-2182 |
No | Assume marking duplicates to be similar to Picard version 2.18.2. |
| Tool | --optical-duplicate-pixel-distance OPTICAL_DUPLICATE_PIXEL_DISTANCE |
No | The maximum offset between two duplicate clusters in order to consider them optical duplicates. Ignored if --out-duplicate-metrics is not passed. |
| Tool | --monitor-usage |
No | Monitor approximate CPU utilization and host memory usage during execution. |
| Tool | --max-read-length MAX_READ_LENGTH |
No | Maximum read length/size (i.e., sequence length) used for giraffe and filtering FASTQ input. (default: 480) |
| Tool | --min-read-length MIN_READ_LENGTH |
No | Minimum read length/size (i.e., sequence length) used for giraffe and filtering FASTQ input. (default: 1) |
| Tool | --disable-use-window-selector-model |
No | Change the window selector model from Allele Count Linear to Variant Reads. This option will increase the accuracy and runtime. |
| Tool | --norealign-reads |
No | Do not locally realign reads before calling variants. Reads longer than 500 bp are never realigned. |
| Tool | --min-mapping-quality MIN_MAPPING_QUALITY |
No | By default, reads with any mapping quality are kept. Setting this field to a positive integer i will only keep reads that have a MAPQ >= i. Note this only applies to aligned reads. |
| Tool | --mode MODE |
No | Value can be one of [shortread]. By default, it is shortread. (default: shortread) |
| Tool | --pileup-image-width PILEUP_IMAGE_WIDTH |
No | Pileup image width. Only change this if you know your model supports this width. |
| Tool | --no-channel-insert-size |
No | If True, don't add insert_size channel into the pileup image. (default: False) |
| Tool | --sample-name-pangenome SAMPLE_NAME_PANGENOME |
No | Sample name to use for pangenome data. Default is 'pangenome'. (default: pangenome) |
| Tool | --ref-name-pangenome REF_NAME_PANGENOME |
No | Reference genome name for pangenome data. Default is 'GRCh38'. (default: GRCh38) |
| Performance | --nstreams NSTREAMS |
No | Number of streams per GPU to use; use 'auto' to set from GPU and host memory (may enable low-memory, dozeu/minimizers for SE). Integer overrides. More streams increases device and host memory usage. (default: auto) |
| Performance | --num-cpu-threads-per-gpu NUM_CPU_THREADS_PER_GPU |
No | Number of primary CPU threads to use per GPU. (default: 16) |
| Performance | --batch-size BATCH_SIZE |
No | Batch size used for processing alignments. (default: 10000) |
| Performance | --write-threads WRITE_THREADS |
No | Number of threads used for writing and pre-sorting output. (default: 4) |
| Performance | --gpuwrite |
No | Use one GPU to accelerate writing final BAM/CRAM. |
| Performance | --gpuwrite-deflate-algo GPUWRITE_DEFLATE_ALGO |
No | Choose the nvCOMP DEFLATE algorithm to use with --gpuwrite. Note these options do not correspond to CPU DEFLATE options. Valid options are 1, 2, and 4. Option 1 is fastest, while options 2 and 4 have progressively lower throughput but higher compression ratios. The default value is 1 when the user does not provide an input (i.e., None). |
| Performance | --gpusort |
No | Use GPUs to accelerate sorting and marking. |
| Performance | --use-gds |
No | Use GPUDirect Storage (GDS) to enable a direct data path for direct memory access (DMA) transfers between GPU memory and storage. Must be used concurrently with --gpuwrite. Please refer to Parabricks Documentation > Best Performance for information on how to set up and use GPUDirect Storage. |
| Performance | --memory-limit MEMORY_LIMIT |
No | System memory limit in GBs during sorting and postsorting. By default, the limit is half of the total system memory. (default: 62) |
| Performance | --low-memory |
No | Use low memory mode; will lower the number of streams per GPU and decrease the batch size. |
| Performance | --minimizers-gpu |
No | (SE only) Use GPU for minimizers and seeds. (default: False) |
| Performance | --work-queue-capacity WORK_QUEUE_CAPACITY |
No | Soft limit for the capacity of the work queues in between stages. (default: 40) |
| Performance | --num-cpu-threads-per-stream NUM_CPU_THREADS_PER_STREAM |
No | Number of CPU threads to use per stream. (default: 6) |
| Performance | --num-streams-per-gpu NUM_STREAMS_PER_GPU |
No | Number of streams to use per GPU. Default is 'auto' which will try to use an optimal amount of streams based on the GPU. (default: auto) |
| Performance | --run-partition |
No | Divide the whole genome into multiple partitions and run multiple processes at the same time, each on one partition. |
| Performance | --gpu-num-per-partition GPU_NUM_PER_PARTITION |
No | Number of GPUs to use per partition. |
| Runtime | --verbose |
No | Enable verbose output. |
| Runtime | --x3 |
No | Show full command line arguments. |
| Runtime | --logfile LOGFILE |
No | Path to the log file. If not specified, messages will only be written to the standard error output. |
| Runtime | --tmp-dir TMP_DIR |
No | Full path to the directory where temporary files will be stored. (default: .) |
| Runtime | --with-petagene-dir WITH_PETAGENE_DIR |
No | Full path to the PetaGene installation directory. By default, this should have been installed at /opt/petagene. Use of this option also requires that the PetaLink library has been preloaded by setting the LD_PRELOAD environment variable. Optionally set the PETASUITE_REFPATH and PGCLOUD_CREDPATH environment variables that are used for data and credentials. Optionally set the PetaLinkMode environment variable that is used to further configure PetaLink, notably setting it to "+write" to enable outputting compressed BAM and .fastq files. |
| Runtime | --keep-tmp |
No | Do not delete the directory storing temporary files after completion. |
| Runtime | --no-seccomp-override |
No | Do not override seccomp options for docker. |
| Runtime | --version |
No | View compatible software versions. |
| Runtime | --preserve-file-symlinks |
No | Override default behavior to keep file symlinks intact and not resolve the symlink. |
| Runtime | --num-gpus NUM_GPUS |
No | Number of GPUs to use for a run. (default: 1) |