RNA PIPELINE
Run GATK best practices for RNAseq short variant discovery (SNPs + Indels)
The RNA GATK pipeline process the input fastq files. The output is in VCF format.
CLI
# The commandline below will run RNA GATK pipeline.
$ pbrun rna_gatk --ref Ref/Homo_sapiens_assembly38.fasta \
--in-fq Data/sample_1.fq.gz Data/sample_2.fq.gz \
--genome-lib-dir Ref/ \
--out-variants output.vcf \
--out-bam tumor.bam
- --ref
- --in-fq
- --in-se-fq
- --genome-lib-dir
- --knownSites
- --interval-file
- --output-dir
- --out-recal-file
- --out-bam
- --out-variants
- --num-cpu-threads
- --no-ignore-mark
- --num-threads
- --out-prefix
- --read-group-sm
- --read-group-lb
- --read-group-pl
- --read-group-id-prefix
- --two-pass-mode
- --read-length
- --haplotypecaller-options
- --static-quantized-quals
- --gvcf
- --batch
- --disable-read-filter
- --max-alternate-alleles
- -G, --annotation-group
- -GQB, --gvcf-gq-bands
- --rna
- --dont-use-soft-clipped-bases
- --ploidy
- -L, --interval
- -ip, --interval-padding
- --dont-use-soft-clipped-bases
- --ploidy PLOIDY
Path to the reference file (default: None)
Path to the pair ended fastq files followed by optional read group with quotes (Example: “@RGtID:footLB:lib1tPL:bartSM:sampletPU:foo”). Files can be in fastq or fastq.gz format or a google cloud storage object. If no read group is provided, one will be automatically added by the pipeline. Example 1: –in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz . Example 2: –in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz “@RGtID:footLB:lib1tPL:bartSM:sampletPU:unit1” (default: None)
Path to the single ended fastq file followed by optional read group with quotes (Example: “@RGtID:footLB:lib1tPL:bartSM:sampletPU:foo”). File can be in fastq or fastq.gz format or a google cloud storage object. Either all sets of inputs have read group or none should have it and will be automatically added by the pipeline. This option can be repeated multiple times. Example 1: –in-se-fq sampleX_1.fastq.gz –in-se-fq sampleX_2.fastq.gz . Example 2: –in-se-fq sampleX_1.fastq.gz “@RGtID:footLB:lib1tPL:bartSM:sampletPU:unit1” –in-se-fq sampleX_2.fastq.gz “@RGtID:foo2tLB:lib1tPL:bartSM:sampletPU:unit2” . For same sample, Read Groups should have same sample name (SM) and different ID and PU (default: None)
Path to a genome resource library directory. We assume that the indexing required to run star has been completed by the user. (default: None)
Path to a known indels file. Must be in vcf/vcf.gz format. This option can be used multiple times (default: None)
Path to an interval file with possible formats: Picard-style (.interval_list or .picard), GATK-style (.list or .intervals), or BED file (.bed). This option can be used multiple times (default: None)
Path to the directory that will contain all of the generated files (default: None)
Path of report file after Base Quality Score Recalibration. Path can be a Google Cloud Storage object or AWS S3 Storage object (default: None)
Path of output BAM file. Path can be a Google Cloud Storage object or AWS S3 Storage object (default: None)
Path of vcf/g.vcf/gvcf.gz file after variant calling. Path can be a Google Cloud Storage object or AWS S3 Storage object. It can also be a local folder in batch mode (default: None)
Number of CPU threads to traverse separate chromosomes in splitncigar (default: 4)
Do not ignore marked reads in sorted output (default: None)
Number of running worker threads per GPU (default: 4)
Prefix filename for output data (default: None)
SM tag for read groups in this run (default: None)
LB tag for read groups in this run (default: None)
PL tag for read groups in this run (default: None)
prefix for ID and PU tag for read groups in this run. This prefix will be used for all pair of fastq files in this run. The ID and PU tag will consist of this prefix and an identifier which will be unique for a pair of fastq files (default: None)
2-pass mapping mode. The string can be “None” for 1-pass mapping or “Basic” for basic 2-pass mapping with all 1st pass junctions inserted into the genome indices on the fly (default: Basic)
Input read length used to determine sjdbOverhang (default: None)
Pass supported haplotype caller options as one string. Currently supported original haplotypecaller options: -min-pruning <int>, -standard-min-confidence-threshold-for-calling <int>, -max-reads-per-alignment-start <int>, -min-dangling-branch-length <int>, -pcr-indel-model <NONE, HOSTILE, AGGRESSIVE, CONSERVATIVE>. e.g. –haplotypecaller-options=”-min-pruning 4 -standard-min-confidence-threshold-for-calling 30” (default: None)
Use static quantized quality scores to a given number of levels. Repeat this option multiple times for multiple bins (default: None)
Generate variant calls in gVCF format (default: None)
Given an input list of BAMs, run the variant calling of each BAM using one GPU, and process BAMs in parallel based on how many GPUs the system has (default: None)
Disable the read filters for bam entries. Currently supported read filters that can be disabled: MappingQualityAvailableReadFilter, MappingQualityReadFilter, NotSecondaryAlignmentReadFilter, WellformedReadFilter (default: None)
Maximum number of alternate alleles to genotype (default: None)
Which groups of annotations to add to the output variant calls. Currently supported annotation groups: StandardAnnotation, StandardHCAnnotation, AS_StandardAnnotation (default: None)
Exclusive upper bounds for reference confidence GQ bands. Must be in the range [1, 100] and specified in increasing order (default: None)
Run haplotypecaller optimized for RNA Data (default: None)
Dont use soft clipped bases for variant calling (default: None)
Ploidy assumed for the bam file. Currently only haploid (ploidy 1) and diploid (ploidy 2) are supported (default: 2)
Interval within which to call the variants from the bam/cram file. All intervals will have a padding of 100 to get read records and overlapping intervals will be combined. Interval files should be passed using the –interval-file option. This option can be used multiple times. e.g. “-L chr1 -L chr2:10000 -L chr3:20000+ -L chr4:10000-20000” (default: None)
Amount of padding (in base pairs) to add to each interval you are including (default: None)
Dont use soft clipped bases for variant calling.
Ploidy assumed for the bam file. Currently only haploid (ploidy 1) and diploid (ploidy 2) are supported (default: 2)
- --num-gpus NUM_GPUS
- --gpu-devices GPU_DEVICES
Number of GPUs to use for a run. GPUs 0..(NUM_GPUS-1) will be used. If you are using flexera, please include –gpu-devices too.
Which GPU devices to use for a run. By default, all GPU devices will be used. To use specific GPU devices enter a comma-separated list of GPU device numbers. Possible device numbers can be found by examining the output of the nvidia-smi command. For example, using –gpu-devices 0,1 would only use the first two GPUs.
- --tmp-dir TMP_DIR
- --with-petagene-dir WITH_PETAGENE_DIR
- --keep-tmp
- --license-file LICENSE_FILE
- --version
Full path to the directory where temporary files will be stored.
Full path to the PetaGene installation directory where bin/ and species/ folders are located.
Do not delete the directory storing temporary files after completion.
Path to license file license.bin if not in installation directory.
View compatible software versions.