RNA PIPELINE

Run GATK best practices for RNAseq short variant discovery (SNPs + Indels)

The RNA GATK pipeline process the input fastq files. The output is in VCF format.

../_images/rna_gatk.png

QUICK START

CLI

# The commandline below will run RNA GATK pipeline.
$ pbrun rna_gatk --ref Ref/Homo_sapiens_assembly38.fasta \
                --in-fq Data/sample_1.fq.gz Data/sample_2.fq.gz \
                --genome-lib-dir Ref/ \
                --out-variants output.vcf \
                --out-bam tumor.bam

OPTIONS

--ref

Path to the reference file (default: None)

--in-fq

Path to the pair ended fastq files followed by optional read group with quotes (Example: “@RGtID:footLB:lib1tPL:bartSM:sampletPU:foo”). Files can be in fastq or fastq.gz format or a google cloud storage object. If no read group is provided, one will be automatically added by the pipeline. Example 1: –in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz . Example 2: –in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz “@RGtID:footLB:lib1tPL:bartSM:sampletPU:unit1” (default: None)

--in-se-fq

Path to the single ended fastq file followed by optional read group with quotes (Example: “@RGtID:footLB:lib1tPL:bartSM:sampletPU:foo”). File can be in fastq or fastq.gz format or a google cloud storage object. Either all sets of inputs have read group or none should have it and will be automatically added by the pipeline. This option can be repeated multiple times. Example 1: –in-se-fq sampleX_1.fastq.gz –in-se-fq sampleX_2.fastq.gz . Example 2: –in-se-fq sampleX_1.fastq.gz “@RGtID:footLB:lib1tPL:bartSM:sampletPU:unit1” –in-se-fq sampleX_2.fastq.gz “@RGtID:foo2tLB:lib1tPL:bartSM:sampletPU:unit2” . For same sample, Read Groups should have same sample name (SM) and different ID and PU (default: None)

--genome-lib-dir

Path to a genome resource library directory. We assume that the indexing required to run star has been completed by the user. (default: None)

--knownSites

Path to a known indels file. Must be in vcf/vcf.gz format. This option can be used multiple times (default: None)

--interval-file

Path to an interval file with possible formats: Picard-style (.interval_list or .picard), GATK-style (.list or .intervals), or BED file (.bed). This option can be used multiple times (default: None)

--output-dir

Path to the directory that will contain all of the generated files (default: None)

--out-recal-file

Path of report file after Base Quality Score Recalibration. Path can be a Google Cloud Storage object or AWS S3 Storage object (default: None)

--out-bam

Path of output BAM file. Path can be a Google Cloud Storage object or AWS S3 Storage object (default: None)

--out-variants

Path of vcf/g.vcf/gvcf.gz file after variant calling. Path can be a Google Cloud Storage object or AWS S3 Storage object. It can also be a local folder in batch mode (default: None)

--num-cpu-threads

Number of CPU threads to traverse separate chromosomes in splitncigar (default: 4)

--no-ignore-mark

Do not ignore marked reads in sorted output (default: None)

--num-threads

Number of running worker threads per GPU (default: 4)

--out-prefix

Prefix filename for output data (default: None)

--read-group-sm

SM tag for read groups in this run (default: None)

--read-group-lb

LB tag for read groups in this run (default: None)

--read-group-pl

PL tag for read groups in this run (default: None)

--read-group-id-prefix

prefix for ID and PU tag for read groups in this run. This prefix will be used for all pair of fastq files in this run. The ID and PU tag will consist of this prefix and an identifier which will be unique for a pair of fastq files (default: None)

--two-pass-mode

2-pass mapping mode. The string can be “None” for 1-pass mapping or “Basic” for basic 2-pass mapping with all 1st pass junctions inserted into the genome indices on the fly (default: Basic)

--read-length

Input read length used to determine sjdbOverhang (default: None)

--haplotypecaller-options

Pass supported haplotype caller options as one string. Currently supported original haplotypecaller options: -min-pruning <int>, -standard-min-confidence-threshold-for-calling <int>, -max-reads-per-alignment-start <int>, -min-dangling-branch-length <int>, -pcr-indel-model <NONE, HOSTILE, AGGRESSIVE, CONSERVATIVE>. e.g. –haplotypecaller-options=”-min-pruning 4 -standard-min-confidence-threshold-for-calling 30” (default: None)

--static-quantized-quals

Use static quantized quality scores to a given number of levels. Repeat this option multiple times for multiple bins (default: None)

--gvcf

Generate variant calls in gVCF format (default: None)

--batch

Given an input list of BAMs, run the variant calling of each BAM using one GPU, and process BAMs in parallel based on how many GPUs the system has (default: None)

--disable-read-filter

Disable the read filters for bam entries. Currently supported read filters that can be disabled: MappingQualityAvailableReadFilter, MappingQualityReadFilter, NotSecondaryAlignmentReadFilter, WellformedReadFilter (default: None)

--max-alternate-alleles

Maximum number of alternate alleles to genotype (default: None)

-G, --annotation-group

Which groups of annotations to add to the output variant calls. Currently supported annotation groups: StandardAnnotation, StandardHCAnnotation, AS_StandardAnnotation (default: None)

-GQB, --gvcf-gq-bands

Exclusive upper bounds for reference confidence GQ bands. Must be in the range [1, 100] and specified in increasing order (default: None)

--rna

Run haplotypecaller optimized for RNA Data (default: None)

--dont-use-soft-clipped-bases

Dont use soft clipped bases for variant calling (default: None)

--ploidy

Ploidy assumed for the bam file. Currently only haploid (ploidy 1) and diploid (ploidy 2) are supported (default: 2)

-L, --interval

Interval within which to call the variants from the bam/cram file. All intervals will have a padding of 100 to get read records and overlapping intervals will be combined. Interval files should be passed using the –interval-file option. This option can be used multiple times. e.g. “-L chr1 -L chr2:10000 -L chr3:20000+ -L chr4:10000-20000” (default: None)

-ip, --interval-padding

Amount of padding (in base pairs) to add to each interval you are including (default: None)

--dont-use-soft-clipped-bases

Dont use soft clipped bases for variant calling.

--ploidy PLOIDY

Ploidy assumed for the bam file. Currently only haploid (ploidy 1) and diploid (ploidy 2) are supported (default: 2)

--num-gpus NUM_GPUS

Number of GPUs to use for a run. GPUs 0..(NUM_GPUS-1) will be used. If you are using flexera, please include –gpu-devices too.

--gpu-devices GPU_DEVICES

Which GPU devices to use for a run. By default, all GPU devices will be used. To use specific GPU devices enter a comma-separated list of GPU device numbers. Possible device numbers can be found by examining the output of the nvidia-smi command. For example, using –gpu-devices 0,1 would only use the first two GPUs.

--tmp-dir TMP_DIR

Full path to the directory where temporary files will be stored.

--with-petagene-dir WITH_PETAGENE_DIR

Full path to the PetaGene installation directory where bin/ and species/ folders are located.

--keep-tmp

Do not delete the directory storing temporary files after completion.

--license-file LICENSE_FILE

Path to license file license.bin if not in installation directory.

--version

View compatible software versions.