11.15. De Novo Sequence Assembly - Clara Genomics Analysis

This is a reference pipeline using Clara Genomic Analysis tools to assemble genome with Clara Deploy SDK.

These tools exploit the abilities of GPU to accelerate gene sequencing.

11.15.1. Pipeline Definition

Copy
Copied!

            
            api-version: 0.3.0
name: denovo-gpu
parameters:
  DOCKER_IMAGE: claraomics/cga_cuda10
  DOCKER_TAG: 0.4.3

  MAPPER_THREADS: 15
  MAPPER_KMER_SIZE: 15
  MAPPER_WINDOW_SIZE: 5
  MAPPER_INDEX_SIZE: 8000
  MAPPER_ADDITIONAL_PARAMS: ''

  RACON_LOOPS: 5
  RACON_THREADS: 5
  RACON_POLISH_BATCH_SIZE: 4
  RACON_ADDITIONAL_PARAMS: ''
  JOB_ID: '.'

operators:
- name: mapper
  description: CUDA mapper
  container:
    image: ${{DOCKER_IMAGE}}
    tag: ${{DOCKER_TAG}}
    command: ["/bin/sh", "-c",
              "mapperWrapper.sh${{MAPPER_ADDITIONAL_PARAMS}}-i/input-d/mapperOutput/${{JOB_ID}}-o/mapperOutput/${{JOB_ID}}/overlaps.paf-t${{MAPPER_THREADS}}-k${{MAPPER_KMER_SIZE}}-w${{MAPPER_WINDOW_SIZE}}-s${{MAPPER_INDEX_SIZE}}"]
  requests:
    gpu: 1
  input:
  - path: /input/
  output:
  - path: /mapperOutput

- name: miniasm
  description: Miniasm
  container:
    image: ${{DOCKER_IMAGE}}
    tag: ${{DOCKER_TAG}}
    command: ["/bin/sh", "-c",
              "miniasmWrapper.sh-f/mapperOutput/${{JOB_ID}}/sample.fasta-l/mapperOutput/${{JOB_ID}}/overlaps.paf-o/asmOutput/${{JOB_ID}}/reads.gfa"]
  input:
  - path: /input/
  - from: mapper
    path: /mapperOutput
  output:
  - path: /asmOutput

- name: racon
  description: Polish Assembly using racon
  container:
    image: ${{DOCKER_IMAGE}}
    tag: ${{DOCKER_TAG}}
    command: ["/bin/sh", "-c",
              "raconWrapper.sh${{RACON_ADDITIONAL_PARAMS}}-r/mapperOutput/${{JOB_ID}}/sample.fasta-t${{RACON_THREADS}}-l${{RACON_LOOPS}}-p${{RACON_POLISH_BATCH_SIZE}}-f/asmOutput/${{JOB_ID}}/reads.gfa-o/raconOutput/${{JOB_ID}}-a/raconOutput/${{JOB_ID}}/polished_assembly.fa"]
  requests:
    gpu: 1
  input:
  - path: /input/
  - from: miniasm
    path: /asmOutput
  - from: mapper
    path: /mapperOutput
  output:
  - path: /raconOutput/

11.15.2. Executing the Pipeline

Please refer to the Run Reference Pipelines using Local Input Files in the How to run a Reference Pipeline section to learn how to register a pipeline and execute the pipeline using local input files.

11.15.3. Data Input

Input requires a folder containing the following files:

sample.fasta - Input fasta sample file for all-to-all mapping

jobConfig(optional) - A file containing param and value in shell script style. A sample (sample_job_config.sh) is provided. Following is content of a jobConfig file with default values.

Copy
Copied!

            
            KMER_SIZE=15         # length of kmer to use for minimizers
WINDOW_SIZE=5        # length of window to use for minimizers
INDEX_SIZE=10000     # length of batch size used for query

RACON_LOOPS=5        # Number of polishing loops
RACON_THREADS=15     # number of threads
POLISH_BATCH_SIZE=6  # number of batches for CUDA accelerated polishing

11.15.4. Data Output

Assembled and polished sequence