DeepVariant training using Parabricks

DeepVariant is a data analysis pipeline employing a deep neural network to identify genetic variants from next-generation DNA sequencing (NGS) data. While DeepVariant is exceptionally precise for various NGS data, there might be users keen on crafting tailored deep learning models meticulously suited for highly specific data.

The DeepVariant training pipeline has three major steps:

Run make_examples in “training” mode on the training and validation data sets,
Shuffle each set of examples and generate a data configuration file for each, and
Run model_train and model_eval.

Parabricks currently contains a GPU accelerated version of the first two steps.

Run make_examples in training mode

The "make_examples" step processes the input data, producing output suitable for use in subsequent steps. The output produced will include a label field.

Beginning with version 1.4.0, DeepVariant introduced an additional parameter in their WGS configuration through the --channels "insert_size" flag.

Depending on the nature of your data, you may wish to adjust the flags for the make_examples step, potentially leading to varying formats for the output examples. Please see the DeepVariant documentation for details regarding these options.

make_examples Quick Start

This code runs the "make_examples" step, combining the reference, BAM, VCF and BED files into a format suitable for use by the shuffle, model_train and model_eval steps.

Copy
Copied!

            
            # This command assumes all the inputs are in INPUT_DIR and all the outputs go to OUTPUT_DIR.
docker run --rm --gpus all --volume INPUT_DIR:/workdir --volume OUTPUT_DIR:/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/deepvariant_train:4.2.0-1 \
    pbrun make_examples \
    --ref /workdir/${REFERENCE_FILE} \
    --reads /workdir/${INPUT_BAM} \
    --truth-variants /workdir/${TRUTH_VCF} \
    --confident_regions /workdir/${TRUTH_BED} \
    --examples /outputdir/${TFRECORD_FILE} \
    --disable-use-window-selector-model \
    --channel-insert-size

Compatible make_examples Baseline Command

Copy
Copied!

            
            ( seq 0 $((N_SHAREDS-1)) |\
    parallel --halt 2 --line-buffer \
        sudo docker run --volume <INPUT_DIR>:/workdir --volume <OUTPUT_DIR>:/outputdir \
        google/deepvariant:"${BIN_VERSION"} \
/opt/deepvariant/bin/make_examples \
--mode training \
--ref "/workdir/${REF}" \
--reads "/workdir/${INPUT_BAM}" \
--examples "/outputdir/validation_set.with_label.tfrecord@${N_SHARDS}.gz" \
--truth_variants "/workdir/${TRUTH_VCF" \
--confident_regions "/workdir/${TRUTH_BED}" \
--task {} \
--channels "insert_size" )

make_examples Reference

Run DeepVariant make_examples in training mode to create tensorflow.Examples.

make_examples Input/Output file options

--ref REF
--reads READS
--interval-file INTERVAL_FILE
--confident-regions CONFIDENT_REGIONS
--truth-variants TRUTH_VARIANTS
--examples EXAMPLES
--proposed-variants PROPOSED_VARIANTS

make_examples Tool Options:

--num-cpu-threads-per-stream NUM_CPU_THREADS_PER_STREAM
--num-zipper-threads NUM_ZIPPER_THREADS
--num-streams-per-gpu NUM_STREAMS_PER_GPU
--disable-use-window-selector-model
--gvcf
--norealign-reads
--sort-by-haplotypes
--keep-duplicates
--vsc-min-count-snps VSC_MIN_COUNT_SNPS
--vsc-min-count-indels VSC_MIN_COUNT_INDELS
--vsc-min-fraction-snps VSC_MIN_FRACTION_SNPS
--vsc-min-fraction-indels VSC_MIN_FRACTION_INDELS
--min-mapping-quality MIN_MAPPING_QUALITY
--min-base-quality MIN_BASE_QUALITY
--mode MODE
--alt-aligned-pileup ALT_ALIGNED_PILEUP
--variant-caller VARIANT_CALLER
--add-hp-channel
--parse-sam-aux-fields
--use-wes-model
--run-partition
--gpu-num-per-partition GPU_NUM_PER_PARTITION
--include-med-dp
--normalize-reads
--pileup-image-width PILEUP_IMAGE_WIDTH
--channel-insert-size
--no-channel-insert-size
--max-read-size-512
--prealign-helper-thread
--max-reads-per-partition MAX_READS_PER_PARTITION
--partition-size PARTITION_SIZE
--track-ref-reads
--phase-reads
--dbg-min-base-quality DBG_MIN_BASE_QUALITY
--ws-min-windows-distance WS_MIN_WINDOWS_DISTANCE
--channel-gc-content
--channel-hmer-deletion-quality
--channel-hmer-insertion-quality
--channel-non-hmer-insertion-quality
--skip-bq-channel
--aux-fields-to-keep AUX_FIELDS_TO_KEEP
--vsc-min-fraction-hmer-indels VSC_MIN_FRACTION_HMER_INDELS
--vsc-turn-on-non-hmer-ins-proxy-support
--consider-strand-bias
--p-error P_ERROR
--channel-ins-size
--max-ins-size MAX_INS_SIZE
--disable-group-variants
-L INTERVAL, --interval INTERVAL

Common options:

--logfile LOGFILE
--tmp-dir TMP_DIR
--with-petagene-dir WITH_PETAGENE_DIR
--keep-tmp
--no-seccomp-override
--version

GPU options:

--num-gpus NUM_GPUS

Run shuffle

The shuffling of TensorFlow example data is a crucial stage in model training. In the DeepVariant training process the examples are globally shuffled as part of the preprocessing step.

This script shuffles TensorFlow records locally and in-memory.

shuffle Quick Start

Copy
Copied!

            
            # This command assumes all the inputs are in INPUT_DIR and all the outputs go to OUTPUT_DIR.
docker run --rm --gpus all --volume INPUT_DIR:/workdir --volume OUTPUT_DIR:/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/deepvariant_train:4.2.0-1 \
    pbrun shuffle \
    --input_pattern_list /workdir/validation_set.with_label.tfrecord-?????-of-00016.gz \
    --output_pattern_prefix /outputdir/validation_set.with_label.shuffled \
    --output_dataset_config /outputdir/validation_set.dataset_config.pbtxt \
    --output_dataset_name HG001 \
    --direct-num-workers 16

Compatible shuffle Baseline Command

Copy
Copied!

            
            python3 shuffle_tfrecords_lowmem.py \
    --input_pattern_list="${INPUT_DIR}/validation_set.with_label.tfrecord=?????-of-00016.gz" \
    --output_pattern_prefix="${OUTPUT_DIR}/validation_set.with_label.shuffled" \
    --output_dataset_config="${OUTPUT_DIR}/validation_set.dataset_config.pbtxt" \
    --putput_dataset_name="HG001" \
    --direct_num_workders=16 \
    --step=1

shuffle Reference

Shuffle examples globally.

Shuffle Input/Output file options

--output-dataset-config-pbtxt OUTPUT_DATASET_CONFIG_PBTXT
--input-pattern-list INPUT_PATTERN_LIST [INPUT_PATTERN_LIST ...]

Shuffle Tool Options:

--output-pattern-prefix OUTPUT_PATTERN_PREFIX

--output-dataset-name OUTPUT_DATASET_NAME

Option is required.

--direct-num-workers DIRECT_NUM_WORKERS

Common options:

--logfile LOGFILE
--tmp-dir TMP_DIR
--with-petagene-dir WITH_PETAGENE_DIR
--keep-tmp
--no-seccomp-override
--version

GPU options:

--num-gpus NUM_GPUS

Run model_train and model_eval

We provide a Jupyter Notebook with a more detailed example of re-training DeepVariant 1.5 using Parabricks and additional instructions on the model_train and model_eval steps.

See also the DeepVariant training documentation, and the original Shuffle program.