snpswift

snpswift annotates variants in a VCF file with VCF or GTF databases.

Quick Start

$ pbrun snpswift \
    --input-vcf input_to_be_annotated.vcf \
    --anno-vcf prefix1:source_of_annotations_1.vcf.gz \
    --anno-vcf prefix2:source_of_annotations_2.vcf.gz \
    --output-vcf the_annotated_version.vcf

If the --ensembl option (see below) is used, snpswift requires that the chromosome names be prefixed with 'chr'. If the chromosome names do not already have a 'chr' prefix it can be added with the following command:

$ awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' < Homo_sapiens.GRCh38.104.gtf > Homo_sapiens.GRCh38.104.withchr.gtf

In cases where there are multiple matches for a query variant in a single annotation database, snpswift will annotate with details from the first match found. For example, given this query variant:

chr2

15013

T

G

and these matches in a single database:

chr2

15013

SNP1186

T

G

.

.

GENE=EXPSNP;STRAND=+;LEGACY_ID=EXID2045;SNP;CNT=2

chr2

15013

SNP1186

T

G

.

.

GENE=EXPSNP_ENST00000450;STRAND=+;LEGACY_ID=EXID2045;SNP;CNT=2

Snpswift annotation uses the information from the first match to produce this:

chr2

15013

T

G

GENE=EXPSNP;STRAND=+;LEGACY_ID=EXID2045;SNP;CNT=2

If the input VCF file contains multi-allelic variants we suggest splitting the multi-allelics with bcftools before using snpswift for optimal annotation coverage:

$ bcftools norm --multiallelics- multiallelic_variants.vcf -o input_to_snpswift.vcf

The TSV file must have column headers, and the first four columns must contain chromosome, position, ref and alt in that order. The name of the first column in the header line must start with a '#' i.e. '#Chrom position ref alt'.

snpswift Reference

Annotate a VCF file using VCF/GTF/TSV databases.

Input/Output file options

--input-vcf INPUT_VCF

An input VCF to annotate with VCF and GTF database files. (default: None)

Option is required.

--anno-vcf ANNO_VCF

A prefix and VCF in the format <prefix:/absolute/path/anno.vcf.gz>. INFO fields from <anno.vcf.gz> will be added to the input VCF. This option can be used multiple times. Annotation VCFs must be bgzipped and tabix indexed. (default: None)

--anno-tsv ANNO_TSV

A prefix and TSV in the format <prefix:/absolute/path/anno.tsv>. This option can be used multiple times. Annotation TSVs must not be gzipped. This option adds information in each TSV column to the INFO field of vcf. (default: None)

--ensembl ENSEMBL

A GTF file from ENSEMBL; the Gene Name and Gene ID fields will be added to the input VCF. (default: None)

--output-vcf OUTPUT_VCF

Path to the output annotated VCF file. (default: None)

Option is required.

Tool Options:

--num-threads NUM_THREADS

Number of worker threads to run for vcf annotation. (default: 8)

Common options:

--logfile LOGFILE

Path to the log file. If not specified, messages will only be written to the standard error output. (default: None)

--tmp-dir TMP_DIR

Full path to the directory where temporary files will be stored.

--with-petagene-dir WITH_PETAGENE_DIR

Full path to the PetaGene installation directory. By default, this should have been installed at /opt/petagene. Use of this option also requires that the PetaLink library has been preloaded by setting the LD_PRELOAD environment variable. Optionally set the PETASUITE_REFPATH and PGCLOUD_CREDPATH environment variables that are used for data and credentials (default: None)

--keep-tmp

Do not delete the directory storing temporary files after completion.

--license-file LICENSE_FILE

Path to license file license.bin if not in the installation directory.

--no-seccomp-override

Do not override seccomp options for docker (default: None).

--version

View compatible software versions.