human_par - NVIDIA Docs

Given one or more pairs of FASTQ files, you can run the human_par pipeline workflow to generate output, including BAM, recal, and variants called with proper pseudoautosomal region ploidy values.

The human_par pipeline shown below resembles the GATK4 best practices pipeline. The inputs are BWA-indexed reference files, pair-ended FASTQ files, knownSites for BQSR calculation, and specifications to determine which sex to run variant calling. Male samples run Haplotype Caller first on non-X/Y regions and the pseudoautosomal region with ploidy 1, then on X/Y regions without the pseudoautosomal region with ploidy 2. Female samples run Haplotype Caller on all regions with ploidy 2. The outputs of this pipeline are as follows:

Aligned, co-ordinate sorted, duplicated marked BAM
BQSR report
Variants in vcf/g.vcf/g.vcf.gz format

Three options are used to determine the sex of the input sample. The --sample-sex manually sets the sex as male or female, overriding the sex detected from the number of X and Y reads. Additionally, two X/Y ratio range options can be used to automatically detect the sex of the sample based on the number of X and Y reads. Both --range-male and --range-female options provide a range for the possible value of the X/Y ratio. If the X/Y ratio falls in any of the given ranges, that sex will be used for Haplotype Caller; however, if the X/Y ratio is not in any range, the pipeline relies on the --sample-sex option to continue. At least one of these three options must be provided.

Quick Start

Run a human_par pipeline:

Copy
Copied!

            
            $ pbrun human_par \
    --ref Ref/Homo_sapiens_assembly38.fasta \
    --in-fq Data/sample_1.fq.gz Data/sample_2.fq.gz  \
    --knownSites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \
    --range-male 1-10 \
    --range-female 150-250 \
    --sample-sex male \
    --out-bam output.bam \
    --out-variants output.vcf \
    --out-recal-file report.txt

Compatible CPU-based BWA-MEM, GATK4 Commands

The commands below are the bwa-0.7.12 and GATK4 counterpart of the Parabricks command above. The output from these commands will be identical to the output from the above command. See the Output Comparison page for comparing the results.

Copy
Copied!

            
            # Run bwa-mem and pipe output to create sorted BAM
$ bwa mem -t 32 -K 10000000 -R '@RG\tID:sample_rg1\tLB:lib1\tPL:bar\tSM:sample\tPU:sample_rg1' \
Ref/Homo_sapiens_assembly38.fasta S1_1.fastq.gz S1_2.fastq.gz | gatk \
SortSam --java-options -Xmx30g --MAX_RECORDS_IN_RAM=5000000 -I=/dev/stdin \
-O=cpu.bam --SORT_ORDER=coordinate --TMP_DIR=/raid/myrun

# Mark Duplicates
$ gatk MarkDuplicates --java-options -Xmx30g -I=cpu.bam -O=mark_dups_cpu.bam \
-M=metrics.txt --TMP_DIR=/raid/myrun

# Generate BQSR Report
$ gatk BaseRecalibrator --java-options -Xmx30g --input mark_dups_cpu.bam --output \
recal_cpu.txt --known-sites Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \
--reference Ref/Homo_sapiens_assembly38.fasta

# Run ApplyBQSR Step
$ gatk ApplyBQSR --java-options -Xmx30g -R Ref/Homo_sapiens_assembly38.fasta \
-I=mark_dups_cpu.bam --bqsr-recal-file=recal_file.txt -O=cpu_nodups_BQSR.bam

#Run Haplotype Caller on the non-X/Y regions and the pseudoautosomal region
$ gatk HaplotypeCaller --java-options -Xmx30g --input cpu_nodups_BQSR.bam --output \
result_cpu_non_xy.vcf --reference Ref/Homo_sapiens_assembly38.fasta \
-L non_xy_regions_with_par.list --native-pair-hmm-threads 16

#Run Haplotype Caller on the X/Y regions without the pseudoautosomal region
$ gatk HaplotypeCaller --java-options -Xmx30g --input cpu_nodups_BQSR.bam --output \
result_cpu_xy.vcf --reference Ref/Homo_sapiens_assembly38.fasta \
-L xy_regions_without_par.list --native-pair-hmm-threads 16 \
(--ploidy 1 for male samples)

#Merge the variants from both Haplotype Caller runs
$ gatk MergeVcfs -I result_cpu_non_xy.vcf -I result_cpu_xy.vcf -O result_cpu.vcf

human_par Reference

Run the germline pipeline from FASTQ to VCF with correct ploidy values for human sex chromosome handling.

Input/Output file options

--ref REF
--in-fq IN_FQ [IN_FQ ...]
--in-se-fq [IN_SE_FQ [IN_SE_FQ ...]]
--knownSites KNOWNSITES
--out-recal-file OUT_RECAL_FILE
--out-bam OUT_BAM
--out-variants OUT_VARIANTS
--out-duplicate-metrics OUT_DUPLICATE_METRICS

Options specific to this tool

-L INTERVAL, --interval INTERVAL
--bwa-options BWA_OPTIONS
--no-warnings
--no-markdups
--fix-mate
--markdups-assume-sortorder-queryname
--markdups-picard-version-2182
--optical-duplicate-pixel-distance OPTICAL_DUPLICATE_PIXEL_DISTANCE
--read-group-sm READ_GROUP_SM
--read-group-lb READ_GROUP_LB
--read-group-pl READ_GROUP_PL
--read-group-id-prefix READ_GROUP_ID_PREFIX
-ip INTERVAL_PADDING, --interval-padding INTERVAL_PADDING
--standalone-bqsr
--haplotypecaller-options HAPLOTYPECALLER_OPTIONS
--static-quantized-quals STATIC_QUANTIZED_QUALS
--gvcf
--batch
--disable-read-filter DISABLE_READ_FILTER
--max-alternate-alleles MAX_ALTERNATE_ALLELES
-G ANNOTATION_GROUP, --annotation-group ANNOTATION_GROUP
-GQB GVCF_GQ_BANDS, --gvcf-gq-bands GVCF_GQ_BANDS
--rna
--dont-use-soft-clipped-bases
--sample-sex SAMPLE_SEX
--range-male RANGE_MALE
--range-female RANGE_FEMALE

Common options:

--logfile LOGFILE
--tmp-dir TMP_DIR
--with-petagene-dir WITH_PETAGENE_DIR
--keep-tmp
--license-file LICENSE_FILE
--no-seccomp-override
--version

GPU options:

--num-gpus NUM_GPUS
--gpu-devices GPU_DEVICES

Note

The --in-fq option takes the names of two FASTQ files, optionally followed by a quoted read group. The FASTQ filenames must not start with a hyphen.

Note

In the values provided to --haplotypecaller-options --output-mode requires two leading hyphens, while all other values take a single hyphen.