11.15. De Novo Sequence Assembly - Clara Genomics Analysis

This is a reference pipeline using Clara Genomic Analysis tools to assemble genome with Clara Deploy SDK.

These tools exploit the abilities of GPU to accelerate gene sequencing.

11.15.1. Pipeline Definition

api-version: 0.3.0
name: denovo-gpu
parameters:
  DOCKER_IMAGE: claraomics/cga_cuda10
  DOCKER_TAG: 0.4.3

  MAPPER_THREADS: 15
  MAPPER_KMER_SIZE: 15
  MAPPER_WINDOW_SIZE: 5
  MAPPER_INDEX_SIZE: 8000
  MAPPER_ADDITIONAL_PARAMS: ''

  RACON_LOOPS: 5
  RACON_THREADS: 5
  RACON_POLISH_BATCH_SIZE: 4
  RACON_ADDITIONAL_PARAMS: ''
  JOB_ID: '.'

operators:
- name: mapper
  description: CUDA mapper
  container:
    image: ${{DOCKER_IMAGE}}
    tag: ${{DOCKER_TAG}}
    command: ["/bin/sh", "-c",
              "mapperWrapper.sh ${{MAPPER_ADDITIONAL_PARAMS}} -i /input -d /mapperOutput/${{JOB_ID}} -o /mapperOutput/${{JOB_ID}}/overlaps.paf -t ${{MAPPER_THREADS}} -k ${{MAPPER_KMER_SIZE}} -w ${{MAPPER_WINDOW_SIZE}} -s ${{MAPPER_INDEX_SIZE}}"]
  requests:
    gpu: 1
  input:
  - path: /input/
  output:
  - path: /mapperOutput

- name: miniasm
  description: Miniasm
  container:
    image: ${{DOCKER_IMAGE}}
    tag: ${{DOCKER_TAG}}
    command: ["/bin/sh", "-c",
              "miniasmWrapper.sh -f /mapperOutput/${{JOB_ID}}/sample.fasta -l /mapperOutput/${{JOB_ID}}/overlaps.paf -o /asmOutput/${{JOB_ID}}/reads.gfa"]
  input:
  - path: /input/
  - from: mapper
    path: /mapperOutput
  output:
  - path: /asmOutput

- name: racon
  description: Polish Assembly using racon
  container:
    image: ${{DOCKER_IMAGE}}
    tag: ${{DOCKER_TAG}}
    command: ["/bin/sh", "-c",
              "raconWrapper.sh ${{RACON_ADDITIONAL_PARAMS}} -r /mapperOutput/${{JOB_ID}}/sample.fasta -t ${{RACON_THREADS}} -l ${{RACON_LOOPS}} -p ${{RACON_POLISH_BATCH_SIZE}} -f /asmOutput/${{JOB_ID}}/reads.gfa -o /raconOutput/${{JOB_ID}} -a /raconOutput/${{JOB_ID}}/polished_assembly.fa"]
  requests:
    gpu: 1
  input:
  - path: /input/
  - from: miniasm
    path: /asmOutput
  - from: mapper
    path: /mapperOutput
  output:
  - path: /raconOutput/

11.15.2. Executing the Pipeline

Please refer to the Run Reference Pipelines using Local Input Files in the How to run a Reference Pipeline section to learn how to register a pipeline and execute the pipeline using local input files.

11.15.3. Data Input

Input requires a folder containing the following files:

  • sample.fasta - Input fasta sample file for all-to-all mapping

  • jobConfig(optional) - A file containing param and value in shell script style. A sample (sample_job_config.sh) is provided. Following is content of a jobConfig file with default values.

    KMER_SIZE=15         # length of kmer to use for minimizers
    WINDOW_SIZE=5        # length of window to use for minimizers
    INDEX_SIZE=10000     # length of batch size used for query
    
    RACON_LOOPS=5        # Number of polishing loops
    RACON_THREADS=15     # number of threads
    POLISH_BATCH_SIZE=6  # number of batches for CUDA accelerated polishing
    

11.15.4. Data Output

Assembled and polished sequence