snpswift
snpswift annotates variants in a VCF file with VCF or GTF databases.
$ pbrun snpswift \
--input-vcf input_to_be_annotated.vcf \
--anno-vcf prefix1:source_of_annotations_1.vcf.gz \
--anno-vcf prefix2:source_of_annotations_2.vcf.gz \
--output-vcf the_annotated_version.vcf
If the --ensembl
option (see below) is used, snpswift requires that the
chromosome names be prefixed with 'chr'. If the chromosome names do not
already have a 'chr' prefix it can be added with the following command:
$ awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' < Homo_sapiens.GRCh38.104.gtf > Homo_sapiens.GRCh38.104.withchr.gtf
In cases where there are multiple matches for a query variant in a single annotation database, snpswift will annotate with details from the first match found. For example, given this query variant:
chr2 |
15013 |
T |
G |
and these matches in a single database:
chr2 |
15013 |
SNP1186 |
T |
G |
. |
. |
GENE=EXPSNP;STRAND=+;LEGACY_ID=EXID2045;SNP;CNT=2 |
chr2 |
15013 |
SNP1186 |
T |
G |
. |
. |
GENE=EXPSNP_ENST00000450;STRAND=+;LEGACY_ID=EXID2045;SNP;CNT=2 |
Snpswift annotation uses the information from the first match to produce this:
chr2 |
15013 |
T |
G |
GENE=EXPSNP;STRAND=+;LEGACY_ID=EXID2045;SNP;CNT=2 |
If the input VCF file contains multi-allelic variants we suggest splitting the multi-allelics
with bcftools
before using snpswift
for optimal annotation coverage:
$ bcftools norm --multiallelics- multiallelic_variants.vcf -o input_to_snpswift.vcf
The TSV file must have column headers, and the first four columns must contain chromosome, position, ref and alt in that order. The name of the first column in the header line must start with a '#' i.e. '#Chrom position ref alt'.
Annotate a VCF file using VCF/GTF/TSV databases.
Input/Output file options
- --input-vcf INPUT_VCF
- --anno-vcf ANNO_VCF
- --anno-tsv ANNO_TSV
- --ensembl ENSEMBL
- --output-vcf OUTPUT_VCF
An input VCF to annotate with VCF and GTF database files. (default: None)
Option is required.
A prefix and VCF in the format <prefix:/absolute/path/anno.vcf.gz>. INFO fields from <anno.vcf.gz> will be added to the input VCF. This option can be used multiple times. Annotation VCFs must be bgzipped and tabix indexed. (default: None)
A prefix and TSV in the format <prefix:/absolute/path/anno.tsv>. This option can be used multiple times. Annotation TSVs must not be gzipped. This option adds information in each TSV column to the INFO field of vcf. (default: None)
A GTF file from ENSEMBL; the Gene Name and Gene ID fields will be added to the input VCF. (default: None)
Path to the output annotated VCF file. (default: None)
Option is required.
Tool Options:
- --num-threads NUM_THREADS
Number of worker threads to run for vcf annotation. (default: 8)
Common options:
- --logfile LOGFILE
- --tmp-dir TMP_DIR
- --with-petagene-dir WITH_PETAGENE_DIR
- --keep-tmp
- --license-file LICENSE_FILE
- --no-seccomp-override
- --version
Path to the log file. If not specified, messages will only be written to the standard error output. (default: None)
Full path to the directory where temporary files will be stored.
Full path to the PetaGene installation directory. By default, this should have been installed at /opt/petagene. Use of this option also requires that the PetaLink library has been preloaded by setting the LD_PRELOAD environment variable. Optionally set the PETASUITE_REFPATH and PGCLOUD_CREDPATH environment variables that are used for data and credentials (default: None)
Do not delete the directory storing temporary files after completion.
Path to license file license.bin if not in the installation directory.
Do not override seccomp options for docker (default: None).
View compatible software versions.