RNA
Spliced Transcripts Alignment to a Reference
QUICK START
$ pbrun rna_fq2bam --in-fq sample_X_1.fq.gz sample_X_2.fq.gz --genome-lib-dir HG38 --output-dir sample_X/
COMPATIBLE CPU COMMAND
The command below is the CPU counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command.
#STAR Alignment
$ ./STAR --genomeDir HG38 --readFilesIn sample_X_1.fq.gz sample_X_2.fq.gz
--outFileNamePrefix sample_X/ --outSAMtype BAM SortedByCoordinate
#Coordinate Sorting
$ gatk SortSam --java-options -Xmx30g --MAX_RECORDS_IN_RAM=5000000 -I=Aligned.sortedByCoord.out.bam \
-O=cpu.bam --SORT_ORDER=coordinate --TMP_DIR=/raid/myrun
# Mark Duplicates
$ gatk MarkDuplicates --java-options -Xmx30g -I=cpu.bam -O=mark_dups_cpu.bam \
-M=metrics.txt --TMP_DIR=/raid/myrun
OPTIONS
- --ref REF
- --genome-lib-dir
- --output-dir
- --out-bam
- --in-fq
- --in-se-fq
- --read-group-sm
- --read-group-lb
- --read-group-pl
- --read-group-id-prefix
- --two-pass-mode
- --num-sa-bases
- --max-intron-size
- --min-intron-size
- --min-match-filter
- --min-match-filter-normalized
- --out-filter-intron-motifs
- --max-out-filter-mismatch
- --max-out-filter-mismatch-ratio
- --max-out-filter-multimap
- --out-reads-unmapped
- --out-sam-unmapped
- --out-sam-attributes
- --out-sam-strand-field
- --out-sam-mode
- --out-sam-mapq-unique
- --read-files-command
- --min-score-filter
- --min-spliced-mate-length
- --max-junction-mismatches
- --max-out-read-size
- --max-alignments-per-read
- --score-gap
- --seed-search-start
- --max-bam-sort-memory
- --align-ends-type
- --align-insertion-flush
- --max-align-mates-gap
- --min-align-spliced-mate-map
- --max-collapsed-junctions
- --min-align-sj-overhang
- --min-align-sjdb-overhang
- --sjdb-overhang
- --min-chim-overhang
- --min-chim-segment
- --max-chim-multimap
- --chim-multimap-score-range
- --chim-score-non-gtag
- --min-non-chim-score-drop
- --out-chim-format
- --out-chim-type
- --out-prefix
- --num-threads
(required) Path to the reference file
(required) Path to a genome resource library directory. We assume that the indexing required to run star has been completed by the user. (default: None)
(required) Path to the directory that will contain all of the generated files (default: None)
(required) Path of output BAM file (default: None)
Pair ended fastq files. These can be in .fastq or .fastq.gz format. You can provide read group information as an optional third argument.
Example 1:
--in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz
Example 2:
--in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz "@RG\tID:foo\tLB:lib1\tPL:bar\tSM:sample\tPU:unit1"
This option can be repeated multiple times as well.
Example 1:
--in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz --in-fq sampleX_2_1.fastq.gz sampleX_2_2.fastq.gz
Example 2:
--in-fq sampleX_1_1.fastq.gz sampleX_1_2.fastq.gz "@RG\tID:foo\tLB:lib1\tPL:bar\tSM:sample\tPU:unit1" \
--in-fq sampleX_2_1.fastq.gz sampleX_2_2.fastq.gz "@RG\tID:foo2\tLB:lib1\tPL:bar\tSM:sample\tPU:unit2"
Either all sets of inputs have read group or none should have it and will be automatically added by the pipeline. For same sample, Read Groups should have same sample name (SM) and different ID and PU.
Single ended fastq files. These can be in .fastq or .fastq.gz format. You can provide read group information as an optional third argument.
Example 1:
--in-se-fq sampleX.fastq.gz
Example 2:
--in-se-fq sampleX.fastq.gz "@RG\tID:foo\tLB:lib1\tPL:bar\tSM:sample\tPU:unit1"
This option can be repeated multiple times as well.
Example 1:
--in-se-fq sampleX_1.fastq.gz --in-se-fq sampleX_2.fastq.gz
Example 2:
--in-se-fq sampleX_1.fastq.gz "@RG\tID:foo\tLB:lib1\tPL:bar\tSM:sample\tPU:unit1" \
--in-se-fq sampleX_2.fastq.gz "@RG\tID:foo2\tLB:lib1\tPL:bar\tSM:sample\tPU:unit2"
Either all sets of inputs have read group or none should have it and will be automatically added by the pipeline. For the same sample, Read Groups should have the same sample name (SM) and different ID and PU.
SM tag for read groups in this run (default: None)
LB tag for read groups in this run (default: None)
PL tag for read groups in this run (default: None)
prefix for ID and PU tag for read groups in this run. This prefix will be used for all pair of fastq files in this run. The ID and PU tag will consist of this prefix and an identifier which will be unique
2-pass mapping mode. The string can be “None” for 1-pass mapping or “Basic” for basic 2-pass mapping with all 1st pass junctions inserted into the genome indices on the fly (default: None)
Length (bases) of the SA pre-indexing string. Longer strings will use much more memory, but allow for faster searches. A value between 10 and 15 is recommended. For small genomes, the parameter must be scaled down to min(14, log2(GenomeLength)/2 - 1) (default: 14)
Maximum align intron size. If this value is 0, the maximum size will be determined by (2^winBinNbits)*winAnchorDistNbins (default: 0)
Minimum align intron size. Genomic gap is considered intron if its length is greater than or equal to this value, otherwise it is considered Deletion (default: 21)
Minimum number of matched bases required for alignment output (default: 0)
Same as –min-match-filter, but normalized to the read length (sum of mates’ lengths for paired-end reads) (default: 0.66)
Type of filter alignment using their motifs. This string can be “None” for no filtering, “RemoveNoncanonical” for filtering out alignments that contain non-canonical junctions, or “RemoveNoncanonicalUnannotated” for filtering out alignments that contain non-canonical unannotated junctions when using annotated splice junctions database. The annotated non-canonical junctions will be kept (default: None)
Maximum number of mismatches allowed for an alignment to be output (default: 10)
Maximum ratio of mismatches to mapped length allowed for an alignment to be output (default: 0.3)
Maximum number of loci the read is allowed to map to for all alignments to be output. Otherwise, no alignments will be output and the read will be counted as “mapped to too many loci” in the Log.final.out (default: 10)
Type of output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s). This string can be “None” for no output or “Fastx” for output in separate fasta/fastq files, Unmapped.out.mate1/2 (default: None)
Type of output of unmapped reads in the SAM format. The string can be “None” for no output, “Within” to output unmapped reads within the main SAM file, “KeepPairs” for no output and unmapped mates will be recorded for each alignment, or “Within_KeepPairs” to output unmapped reads within the main SAM file and unmapped mates will be recorded for each alignment (default: None)
A string of SAM attributes in the order desired for the output SAM. The string can contain any combination of the following attributes: {NH, HI, AS, nM, NM, MD, jM, jI, XS, MC, ch}. Alternatively, the string can be “None” for no attributes, “Standard” for the attributes {NH, HI, AS, nM}, or “All” for the attributes {NH, HI, AS, nM, NM, MD, jM, jI, MC, ch}.
e.g. --outSAMattributes NH nM jI XS ch
(default: Standard)
Cufflinks-like strand field flag. The string can be “None” for no flag or “intronMotif” for the strand derived from the intron motif. Reads with inconsistent and/or non-canonical introns will be filtered out (default: None)
SAM output mode. The string can be “None” for no SAM output, “Full” for full SAM output, or “NoQS” for full SAM output without quality scores (default: Full)
The MAPQ value for unique mappers. Must be in the range [0, 255] (default: 255)
Command line to execute for each of the input files. This command should generate FASTA or FASTQ text and send it to stdout. For example: zcat - to uncompress .gz files, bzcat - to uncompress .bz2 files, etc. (default: None)
Minimum score required for alignment output, normalized to the read length (sum of mates’ lengths for paired-end reads) (default: 0.66)
Minimum mapped length for a read mate that is spliced, normalized to the mate length. Must be greater than 0 (default: 0.66)
Maximum number of mismatches for stitching of the splice junctions. A limit must be specified for each of the following: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. To indicate no limit for any of the four options, use -1 (default: [0, -1, 0, 0])
Maximum size of the SAM record (bytes) for one read. Recommended value: >(2*(LengthMate1+LengthMate2+100)*o utFilterMultimapNmax. Must be greater than 0 (default: 100000)
Maximum number of different alignments per read to consider. Must be greater than 0 (default: 10000)
Splice junction penalty (independent on intron motif) (default: 0)
Defines the search start point through the read. The read split pieces will not be longer than this value. Must be greater than 0 (default: 50)
Maximum available RAM (bytes) for sorting BAM. If this value is 0, it will be set to the genome index size. Must be greater than or equal to 0 (default: 0)
Type of read ends alignment. Can be one of two options: “Local” will do a standard local alignment with soft-clipping allowed; “EndToEnd” will force an end-to-end read alignment with no soft-clipping (default: Local)
How to flush ambiguous insertion positions. The string can be “None” to not flush insertions or “Right” to flush insertions to the right (default: None)
Maximum gap between two mates. If 0, max intron gap will be determined by (2^winBinNbits)*winAnchorDistNbins (default: 0)
Minimum mapped length for a read mate that is spliced. Must be greater than 0 (default: 0)
Maximum number of collapsed junctions. Must be greater than 0 (default: 1000000)
Minimum overhang (i.e. block size) for spliced alignments. Must be greater than 0 (default: 5)
Minimum overhang (i.e. block size) for annotated (sjdb) spliced alignments. Must be greater than 0 (default: 3)
Length of the donor/acceptor sequence on each side of the junctions. Ideally this value should be equal to mate_length - 1. Must be greater than 0 (default: 100)
Minimum overhang for the Chimeric.out.junction file. Must be greater than or equal to 0 (default: 20)
Minimum chimeric segment length. If it is set to 0, there will be no chimeric output. Must be greater than or equal to 0 (default: 0)
Maximum number of chimeric multi-alignments. If it is set to 0, the old scheme for chimeric detection which only considered unique alignments will be used. Must be greater than or equal to 0 (default: 0)
The score range for multi-mapping chimeras below the best chimeric score. This option only works with –max-chim-multimap > 1. Must be greater than or equal to 0 (default: 1)
The penalty for a non-GT/AG chimeric junction (default: -1)
To trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be smaller than this value. Must be greater than or equal to 0 (default: 20)
Formatting type for the Chimeric.out.junction file. Possible types are {0, 1}. If type 0, there will be no comment lines/headers. If type 1, there will be comment lines at the end of the file: command line and Nreads: total, unique, multi (default: 0)
Type of chimeric output. This string can be “Junctions” for Chimeric.out.junction, “WithinBAM” for main aligned BAM files (Aligned.*.bam), “WithinBAM_HardClip” for hard-clipping in the CIGAR for supplemental chimeric alignments, or “WithinBAM_SoftClip” for soft-clipping in the CIGAR for supplemental chimeric alignments (default: Junctions)
Prefix filename for output data (default: None)
Number of running worker threads per GPU (default: 4)
- --num-gpus NUM_GPUS
- --gpu-devices GPU_DEVICES
Number of GPUs to use for a run. GPUs 0..(NUM_GPUS-1) will be used. If you are using flexera, please include –gpu-devices too.
Which GPU devices to use for a run. By default, all GPU devices will be used. To use specific GPU devices enter a comma-separated list of GPU device numbers. Possible device numbers can be found by examining the output of the nvidia-smi command. For example, using –gpu-devices 0,1 would only use the first two GPUs.
- --tmp-dir TMP_DIR
- --with-petagene-dir WITH_PETAGENE_DIR
- --keep-tmp
- --license-file LICENSE_FILE
- --version
Full path to the directory where temporary files will be stored.
Full path to the PetaGene installation directory where bin/ and species/ folders are located.
Do not delete the directory storing temporary files after completion.
Path to license file license.bin if not in installation directory.
View compatible software versions.
Identifying candidate fusion transcripts
QUICK START
$ pbrun starfusion --chimeric-junction sample_x.out.junction --genome-lib-dir HG38 --output-dir sample_X/
COMPATIBLE CPU COMMAND
The command below is the CPU counterpart of the Parabricks command above. The output from these commands will generate the exact same results as the output from the above command.
$ ./STAR-Fusion --chimeric_junction sample_x.out.junction --genome_lib_dir HG38 --output_dir sample_X/
OPTIONS
- --chimeric-junction
- --genome-lib-dir
- --output-dir
- --num-threads
- --out-prefix OUT_PREFIX
Path to the Chimeric.out.junction file produced by STAR (default: None)
Path to a genome resource library directory. For more information, visit https://github.com/STAR-Fusion/STAR-Fusion/wiki/installing-star-fusion#data_resources_required
Path to the directory that will contain all of the generated files (default: None)
Number of threads for worker (default: 4)
Prefix filename for output data
- --tmp-dir TMP_DIR
- --with-petagene-dir WITH_PETAGENE_DIR
- --keep-tmp
- --license-file LICENSE_FILE
- --version
Full path to the directory where temporary files will be stored.
Full path to the PetaGene installation directory where bin/ and species/ folders are located.
Do not delete the directory storing temporary files after completion.
Path to license file license.bin if not in installation directory.
View compatible software versions.