9.19. Clara Genomics Analysis

9.19.1. Overview

Clara Genomics Analysis is a GPU-accelerated library for biological sequence analysis. This section provides a brief overview of the different components of ClaraGenomicsAnalysis.

Source code is at https://github.com/clara-genomics

9.19.1.1. Components in the toolset:

9.19.1.1.1. cudamapper

The cudamapper package provides minimizer-based GPU-accelerated approximate mapping. cudamapper outputs mappings in the PAF format and is currently optimised for all-vs-all long read (ONT, Pacific Biosciences) sequences.

Detailed documentation is at https://github.com/clara-genomics/ClaraGenomicsAnalysis

9.19.1.1.2. racon

Racon can be used as a polishing tool after the assembly with either Illumina data or data produced by third generation of sequencing. The type of data inputed is automatically detected.

Racon takes as input only three files: contigs in FASTA/FASTQ format, reads in FASTA/FASTQ format and overlaps/alignments between the reads and the contigs in MHAP/PAF/SAM format. Output is a set of polished contigs in FASTA format printed to stdout. All input files can be compressed with gzip (which will have impact on parsing time).

Racon can also be used as a read error-correction tool. In this scenario, the MHAP/PAF/SAM file needs to contain pairwise overlaps between reads including dual overlaps.

Detailed documentation is at https://github.com/clara-genomics/racon-gpu

9.19.1.2. 3rd party tools:

9.19.1.2.1. minimap2

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include:
  1. mapping PacBio or Oxford Nanopore genomic reads to the human genome;

  2. finding overlaps between long reads with error rate up to ~15%;

  3. splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome;

  4. aligning Illumina single- or paired-end reads;

  5. assembly-to-assembly alignment;

  6. full-genome alignment between two closely related species with divergence below ~15%.

Detailed documentation is at https://github.com/lh3/minimap2

9.19.1.2.2. miniasm

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

Detailed documentation is at https://github.com/lh3/minimap2

9.19.2. Directory Structure

This sample includes the following folders and files:

  • Dockerfile

    • Creates the docker image encapsulating all the tools mentioned above.