Abstract

The Transfer Learning Toolkit Getting Started Guide provides instructions on using an end-to-end workflow for accelerating Deep Learning training and inference for Medical Imaging use cases.

1. Overview

NVIDIA’s Clara Train SDK: Transfer Learning toolkit is a python-based SDK that allows developers looking into faster implementation of industry specific Deep Learning solutions to leverage optimized, ready-to-use, pretrained models built in-house by NVIDIA. These pre-trained models accelerate the developer’s deep learning training process and reduce higher costs associated with large scale data collection, labeling, and training models from scratch.

This toolkit offers an end-to-end workflow for accelerating Deep Learning training and inference for Medical Imaging use cases. The models provided are fully trained for Medical Imaging specific reference use cases such as organ and tumor segmentation and classification.

The following pre-trained models are available to download for specific classification and segmentation use cases. Complete details and accuracy metrics are available in the Appendix section of this guide.

  • Brain Tumor segmentation
  • Liver and Tumor segmentation
  • Hippocampus segmentation
  • Lung Tumor segmentation
  • Prostate segmentation
  • Left Atrium segmentation
  • Pancreas and Tumor segmentation
  • Colon Tumor segmentation
  • Hepatic Vessel segmentation
  • Spleen segmentation
  • Chest X-ray classification

Supervised training

Transfer learning uses an algorithm for supervised training to find the best model based on training and validation datasets.

The training dataset contains pairs of data items that are used for minimizing loss. The validation dataset contains pairs of data items for validation during the training.

A single pass through a full dataset is referred to as an epoch. Since a full dataset cannot typically be processed in a single iteration, it is divided into batches of data items. For each batch, an optimizer minimizes a loss function and adjusts the weights of the model accordingly. Training metrics are collected and logged during this process.

Once all iterations are completed for the epoch, validation is performed if necessary. Validation is performed by running the validation dataset through the current model. Validation metrics are computed, which measure the quality of the current model using several metrics. One important metric is called the stopping metric, which is used to determine the quality of the model.

Validation is usually configured to be run every N epochs, where N is configurable. The result of the validation determines the best model. The algorithm keeps the current best key metric, which is initialized to a large negative number. Each time validation is done, the computed key metric is compared with the current best. If it is better, then the current best is set to the new metric value, and the current model is written to disk in the model.ckpt file. The model.ckpt represents the best model.

The more validations performed, the more likely you are to find the best model. Finding the best model by doing validation after each iteration can take a long time, because validation needs to go through the whole validation dataset. In practice, you should do validation every few epochs using the num_training_epoch_per_valid parameter.

When the training is complete, the model_final.ckpt is written to disk and used for fine-tuning. This general algorithm is used for all modes of training: train, fine-tune, multi-gpu train, and multi-gpu fine-tune.

Using model.ckpt or model_final.ckpt

model.ckpt is the best model resulting from training. model_final.ckpt is created when the training is finished normally. Model_final is a snapshot of the model at the last moment. It is usually not the best model that can be obtained. Both model.ckpt and model_final.ckpt can be used for further training or fine-tuning. Here are two typical use cases:

  1. Continued training: Use the model_final.ckpt as the starting point for fine-tuning if you think the model has not been converged due to improper configuration meaning that the number of epochs was not set high enough.
  2. Transfer learning: Use the model.ckpt as the starting point for fine-tuning on a different dataset, this may be your own dataset, to obtain the model that is best for your data. This is also called adaptation.

2. Installation

Using the Transfer Learning Toolkit for Medical Imaging requires the following:

Hardware Requirements

Recommended

  • 1 GPU or more
  • 16 GB GPU memory
  • 8 core CPU
  • 32 GB system RAM
  • 80 GB free disk space

Software Requirements

  • Ubuntu 16.04 LTS
  • NVIDIA GPU driver v410.xx or above
  • nvidia-docker 2.0 installed, instructions: https://github.com/NVIDIA/nvidia-docker.
  • Cuda runtime/Python package/Tensorflow These are required, but there is no need to install them. They are included in the docker image.

Access registration

Get an NGC API Key

  • NVIDIA GPU Cloud account and API key - https://ngc.nvidia.com/
    1. Go to NGC and search for Clara Train container in the Catalog tab. This message is displayed: Sign in to access the PULL feature of this repository.
    2. Enter your email address and click Next or click Create an Account.
    3. Click Sign In.
    4. Click the Clara Train SDK tile.

Download the docker container

  • Execute docker login nvcr.io from the command line and enter your username and password.
    • Username: $oauthtoken
    • Password: API_KEY
  • dockerImage=nvcr.io/nvidia/clara-train-sdk:v1.0-py3
  • docker pull ${dockerImage}

Running the container

Once downloaded, run the docker using this command:

docker run -it --rm --ipc=host --net=host --runtime=nvidia --mount type=bind,source=/your/dataset/location,target=/workspace/data $dockerImage /bin/bash
Note: If you are on a network that uses a proxy server to connect to the Internet, you can provide proxy server details when launching the container.
 docker run --runtime=nvidia -it --rm -e HTTPS_PROXY=https_proxy_server_ip:https_proxy_server_port -e HTTP_PROXY=http_proxy_server_ip:http_proxy_server_port $dockerImage /bin/bash

The docker, by default, starts in the /opt/nvidia folder. To access local directories from within the docker, they have to be mounted in the docker. To mount a directory, use the -v <source_dir>:<mount_dir> option. For more information, see Bind Mounts. Here is an example:

docker run --runtime=nvidia -it --rm -v /home/<username>/tlt-experiments:/workspace/tlt-experiments $dockerImage /bin/bash

This mounts the /home/<username>/tlt-experiments directory in your disk to /workspace/tlt-experiments in docker.

Downloading the models

  • Use this command to pull the docker:
    docker pull ${dockerImage}
  • Use this command to download models from the NGC model registry: ngc registry model list nvidia/med/*
Note: The -v argument is a mandatory. Please use --list_versions to find the all the versions that are available.
API_KEY=yourAPIkey
MODEL_NAME=segmentation_ct_spleen
VERSION=1 

ngc registry model download-version nvidia/med/segmentation_ct_spleen:1 -d /var/tmp

See Segmentation models and Classification models for more details.

3. Medical Model Archive

The MMAR (Medical Model ARchive) defines a standard structure for organizing all artifacts produced during the model development life cycle.

You can experiment with different configurations for the same MMAR. Create a new MMAR by cloning from an existing MMAR by using the cp -r OS command.

MMAR defines the standard structure for storing artifacts (files) needed and produced by the model development workflow (training, validation, inference, etc.). The MMAR includes all the information about the model, and the work space to perform all model development tasks.

ROOT
	config
		config_train.json
		config_validation.json
		environment.json
	commands
		set_env.sh
		train.sh
		train_finetune.sh
		train_2gpu.sh
		train_2gpu_finetune.sh
		infer.sh
		validate.sh
		export.sh
	resources
     log.config
		...
	docs
		license.txt
		Readme.md
		...
      models (all forms of model: checkpoint, frozen graphs, saved model, TRTIS manifest)  
      model.ckpt.meta, model.ckpt.index, model.ckpt.data
      tensorboard event files
      model.frn.pb, model.trt.pb, model.trtis.pbtxt

Commands

The provided commands perform model development work based on the configurations in the config folder. The only command you may need to change is the set_env.sh, where you can set the PYTHONPATH to the proper value.

You don’t need to change any other commands for default behavior, but you can and should study them to understand how they are defined.

train.sh

This command is used to do basic single-gpu training from scratch. When finished, you should see the following files in the “models” folder:

  • model.ckpt - the best model obtained
  • model_final.ckpt - the final model when the training is done. It is usually NOT the best model.
  • Event files - these are tensorboard events that you can view with tensorboard.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_train.json
7    ENVIRONMENT_FILE=config/environment.json
8    python3 -u  -m medical.tlt2.src.apps.train \
9       -m $MMAR_ROOT \
10       -c $CONFIG_FILE \
11       -e $ENVIRONMENT_FILE \
12       --set \
13       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
14       epochs=1260 \
15       learning_rate=0.0001 \
16       num_training_epoch_per_valid=20 \
17       multi_gpu=false
Note: Line numbers are not part of the command.

Explanation

Line 1:     this is a bash script
Lines 2 to 3:     resolve and set the absolute directory path for MMAR_ROOT 
Line 6:     set the training config file
Line 7:     set the environment file that defines commonly used variables 
such as DATA_ROOT.
Lines 8 to 17:     invokes the training program.
Lines 9 to 11:     set the arguments required by the training program
Line 12:     the --set directive allows certain training parameters to be overwritten
Line 13:     set the DATASET_JSON to use the dataset_0.json in the “config” folder of 
the MMAR. This overwrites the DATASET_JSON defined in the environment.json file.
Lines 14 to 17: overwrite the training variables as defined in config_train.json

train_finetune.sh

This command is used to continue training from previous checkpoint (model.ckpt). Before running this command, you must have a previously generated checkpoint in the models folder. The output of this command is the same as train.sh.

Example
1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_train.json
7    ENVIRONMENT_FILE=config/environment.json
8    python3 -u  -m medical.tlt2.src.apps.train \
9       -m $MMAR_ROOT \
10       -c $CONFIG_FILE \
11       -e $ENVIRONMENT_FILE \
12       --set \
13       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
14       PRETRAIN_WEIGHTS_FILE="" \
15       epochs=1000 \
16       learning_rate=0.0001 \
17       num_training_epoch_per_valid=20 \
18       MMAR_CKPT=$MMAR_ROOT/models/model.ckpt \
19       multi_gpu=false

Explanation

This command is very similar to train.sh. The only differences are:

Line 14: set PRETRAIN_WEIGHTS_FILE to empty string. When fine-tuning a model, no need to download the pretrained weights from the web.

Line 18: set the pre-trained model’s checkpoint location

train_2gpu.sh

This command does horovod-based training with 2 GPUs from scratch. The output of this command is the same as train.sh. You can use this as an example for multi-gpu training. Please see horovod for tips. In general, the learning rate should be scaled up based on the number of GPUs.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_train.json
7    ENVIRONMENT_FILE=config/environment.json
8    mpirun -np 2 -H localhost:2 -bind-to none -map-by slot \
9       -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 
10    -mca btl ^openib --allow-run-as-root \
11       python3 -u  -m medical.tlt2.src.apps.train \
12       -m $MMAR_ROOT \
13       -c $CONFIG_FILE \
14       -e $ENVIRONMENT_FILE \
15       --set \
16       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
17       epochs=1250 \
18       learning_rate=0.0003 \
19       num_training_epoch_per_valid=10 \
20       multi_gpu=true

Explanation

This file is very similar to train.sh. The differences are:

Lines 8 to 20: Run 2-GPU training with mpirun program, which is a 3rd-party program that manages cross-process communication.

Lines 8 to 10: set arguments of mpirun for running 2 processes

Line 20: multi_gpu must be set to true

Lines 11 to 20: the training program setup - same as in train.sh. Note that the learning rate is scaled up, as suggested by horovod.

train_2gpu_finetune.sh

This command does horovod-based training with 2 GPUs from previous checkpoint. The output of this command is the same as train.sh.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT is set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_train.json
7    ENVIRONMENT_FILE=config/environment.json
8    mpirun -np 2 -H localhost:2 -bind-to none -map-by slot \
9       -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 
10    -mca btl ^openib --allow-run-as-root \
11       python3 -u  -m medical.tlt2.src.apps.train \
12       -m $MMAR_ROOT \
13       -c $CONFIG_FILE \
14       -e $ENVIRONMENT_FILE \
15       --set \
16       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
17       PRETRAIN_WEIGHTS_FILE="" \
18       MMAR_CKPT=$MMAR_ROOT/models/model.ckpt \
19       learning_rate=0.0003 \
20       num_training_epoch_per_valid=10 \
21       epochs=1000 \
22       multi_gpu=true

Explanation

This command is basically a combination of train_finetune.sh and train_2gpu.sh:

Lines 8 to 22: run the training program with mpirun;

Lines 11 to 22: run the training program with parameters for 2 GPU.

export.sh

Export a trained model checkpoint to frozen graphs. Two frozen graphs will be generated into the models folder.

Note: Training must have been done before you run this command.
  • model.fzn.pb - the regular frozen graph
  • model.trt.pb - TRT-optimized frozen graph

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    # Data list containing all data
5    export CKPT_DIR=$MMAR_ROOT/models
6    python3 -u -m medical.tlt2.src.apps.export \
7       --model_file_format CKPT \
8       --model_file_path $CKPT_DIR \
9       --model_name model \
10       --input_node_names "NV_MODEL_INPUT" \
11       --output_node_names NV_MODEL_OUTPUT \
12       --trt_min_seg_size 50

Explanation

Line 5: set the location of the “models” directory. Checkpoint must have been created there.

Lines 6 to 12: invoke the export program.

Line 7: set the source model format: it is a checkpoint format (CKPT)

Line 8: set the path to the checkpoint

Line 9: set the base name of the model’s checkpoint file

Lines 10 and 11: set the input and output node names

Line 12: set the minimum segmentation size for TensorRT optimization

infer.sh

Perform inference against the model, based on the configuration of config_validation.json in the config folder. Inference output is saved in the eval folder.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_validation.json
7    ENVIRONMENT_FILE=config/environment.json
8    python3 -u  -m medical.tlt2.src.apps.evaluate \
9       -m $MMAR_ROOT \
10       -c $CONFIG_FILE \
11       -e $ENVIRONMENT_FILE \
12       --set \
13       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
14       output_infer_result=true \
15       do_validation=false

Explanation

Line 1: this is a bash script

Lines 2 to 3: resolve and set the absolute directory path for MMAR_ROOT

Line 6: set the validation config file

Line 7: set the environment file that defines commonly used variables such as DATA_ROOT.

Lines 8 to 15: invokes the evaluate program.

Lines 9 to 11: set the arguments required by the program

Line 12: the --set directive allows certain parameters to be overwritten

Line 13: set the DATASET_JSON to use the dataset_0.json in the “config” folder of the MMAR. This overwrites the DATASET_JSON defined in the environment.json file.

Lines 14 to 15: overwrite default values of the evaluation variables

Line 14: instructs the program to generate inference results

Line 15: instructs the program not to do validation

validate.sh

Perform validation against the model, based on the configuration of config_validation.json in the config folder. Validation output is saved in the evalfolder.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_validation.json
7    ENVIRONMENT_FILE=config/environment.json
8    python3 -u  -m medical.tlt2.src.apps.evaluate \
9        -m $MMAR_ROOT \
10       -c $CONFIG_FILE \
11       -e $ENVIRONMENT_FILE \
12       --set \
13       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
14       do_validation=true \
15       output_infer_result=false

Explanation

This command is very similar to infer.sh. The only differences are:

Line 14: instructs the program to do validation

Line 15: instructs the program not to generate inference results

Note:infer.sh and validate.sh use the same evaluate program.

Configuration

The JSON files in the config folder define configurations of workflow tasks (training, inference, and validation).

config_train.json

This file defines components that make up the training workflow. It is used by all four training commands (single-gpu training and finetuning, multi-gpu training and finetuning).

config_validation.json

This file defines configuration that is used for both validate.sh and infer.sh. The only difference between the two commands are the options of do_validation and output_infer_result.

environment.json

This file defines the common parameters for all model work. The most important are DATA_ROOT and DATASET_JSON.

  • DATA_ROOT specifies the directory that contains the training data.
  • DATASET_JSON specifies the config file that contains the default training data split (usually dataset_0.json).
Note: Since MMAR does not contain training data, you must ensure that these two parameters are set to the right value. Do not change any other parameters.

Example

{
   "DATA_ROOT": "/workspace/data/Task09_Spleen_nii",
   "DATASET_JSON": "/workspace/data/Task09_Spleen_nii/dataset_0.json",
   "PROCESSING_TASK": "segmentation",
   "MMAR_EVAL_OUTPUT_PATH": "eval",
   "MMAR_CKPT_DIR": "models",
   "PRETRAIN_WEIGHTS_FILE": "/var/tmp/resnet50_weights_tf_dim_ordering_tf_kernels.h5"
}

Variable Description
DATA_ROOT The location of training data
DATASET_JSON The data split config file
PROCESSING_TASK The task type of the training: segmentation or classification
MMAR_EVAL_OUTPUT_PATH Directory for saving evaluation (validate or infer) results. Always the “eval” folder in the MMAR
MMAR_CKPT_DIR Directory for saving training results. Always the “models” folder in the MMAR
PRETRAIN_WEIGHTS_FILE Location of the pre-trained weights file. NOTE: if the file does not exist and is needed, the training program will download it from predefined URL from the web.

Data Split Config file

This is a JSON file that defines data split used for training and validation. For classification model, this file is usually named “plco.json”; for other models, it is usually named “dataset_0.json”.

The following is dataset_0.json of the model segmentation_ct_spleen:

{
   "description": "Spleen Segmentation",
   "labels": {
       "0": "background",
       "1": "spleen"
   },
   "licence": "CC-BY-SA 4.0",
   "modality": {
       "0": "CT"
   },
   "name": "Spleen",
   "numTest": 20,
   "numTraining": 41,
   "reference": "Memorial Sloan Kettering Cancer Center",
   "release": "1.0 06/08/2018",
   "tensorImageSize": "3D",
   "training": [
       {
           "image": "imagesTr/spleen_29.nii.gz",
           "label": "labelsTr/spleen_29.nii.gz"
       },
… <<more data here>>…..
       {
           "image": "imagesTr/spleen_49.nii.gz",
           "label": "labelsTr/spleen_49.nii.gz"
       }
   ],
   "validation": [
       {
           "image": "imagesTr/spleen_19.nii.gz",
           "label": "labelsTr/spleen_19.nii.gz"
       },

… <<more data here>>…..

      {
           "image": "imagesTr/spleen_9.nii.gz",
           "label": "labelsTr/spleen_9.nii.gz"
       }
   ]
}

There is a lot of information in this file, but the only sections needed by the training and validation programs are the “training” and “validation” sections, which define sample/label pairs of data items for training and validation respectively.

Cloning MMAR

A MMAR is a self-contained workspace for model development work. If you want to experiment with different configurations for the same MMAR, you should create a new MMAR by cloning from an existing MMAR. We will provide a “mmar-clone” command in the future, but before that you can easily use the “cp -r” OS command to do it yourself.

4. Bring your own model to transfer learning

You can use the predefined models offered by NVIDIA, or choose to use your own model architecture when configuring a training workflow, provided your model follows our model development guidelines.

Components for training workflow

A training workflow typically requires the following common components:

Data pipelines

A data pipeline contains a chain of transforms that are applied to the input image and label data to produce the data in the format required by the model. This release provides predefined transforms that you can use to configure the transformation chains.

Data pipelines produce batched data items during training. Typically, two data pipelines are used: one for producing training data, another producing validation data.

Model

The model component implements the neural network. It produces prediction for an input.

Loss

The loss component implements a loss function, typically based on the prediction from the model and corresponding label data.

Optimizer

The optimizer component implements training optimization algorithm for finding minimal loss during training.

Metrics

These components are used to dynamically measure the quality of the model during training on different aspects. Metric values are computed based on values of tensors. There are two kinds of metric components: training metrics, and validation metrics.
  • A training metric is a graph-building component that adds computational operations to the training graph, which produce tensors for metric computation.
  • Validation metrics implement algorithms to compute values for different aspects of the model, based on the values of tensors in the graph.

Structure of training graph

This diagram shows the overall structure of the training graph. It shows how the components are related. The blue ovals represent placeholders.

These components are built in this order:

  1. Training Data Pipeline
  2. Validation Data Pipeline
  3. Placeholders
  4. Model
  5. Loss
  6. Optimizer
  7. Metrics

Model API specification

The model must conform to the API spec.

from abc import abstractmethod, ABC
import tensorflow as tf


class Model(ABC):
    @abstractmethod
    def get_predictions(self, inputs, training, build_ctx=None):
        pass
    def get_loss(self):
        return 0
    def get_update_ops(self):
        return tf.get_collection(tf.GraphKeys.UPDATE_OPS)

Your model must extend the class Model and implements the required abstract methods.

get_predictions method

This method is required and is called during the construction of the computation graph. It must return a prediction tensor, as shown in the diagram above.

The inputs argument is the model input placeholder of the model.

The build_ctx argument is a dict that holds the data objects that are already built (see the component building order above). You can use them in the construction of your model. Specifically, by the time the get_predictions method is called, data pipelines and placeholders are already built, and the build_ctx contains the following objects:
  • data_property – properties about the input data such as data format (channels_first, channels_last), number of image channels, number of label channels, etc.
  • model_input – the placeholder for model input
  • label_input – the placeholder for label input
  • learning_rate– the placeholder for learning rate
  • is_train – the placeholder for is training flag

get_loss method

The get_loss method is called during the construction of the computation graph. You can override the default implementation of this method (which returns 0) if you want to return a model-specific loss. This loss is added to the result of the regular loss component.

get_update_ops method

You can also provide model-specific update ops using this method. The update ops will be used as the dependency for the Optimizer’s minimize operation.

Model creation

Transfer Learning manages components with a create and use strategy. Components are first configured and created based on the configuration parameters.

The configuration parameters are passed to the component’s construction method, __init__ , to get the component created. Since the parameters are defined at configuration time, they can only be simple static values (vs. dynamically created values such as tensors). Once the components are all created, workflow engine will start the graph construction process, which will invoke each component’s graph-building methods.

When creating your own model, you must follow this strategy: the __init__ method of the model class must only expect configuration parameters.

Examples

Extend the model class

To extend the model class, first , define your model as a subclass of the Model class:
import tensorflow as tf
from medical.tlt2.src.components.models.model import Model
class Model:

Model creation

The model’s constructor must only accept configurable parameters. Keep them in instance variables.

import tensorflow as tf
from medical.tlt2.src.components.models.model import Model


class CustomNetwork(Model):

    def __init__(self, num_classes,
                 factor=32,
                 training=False,
                 data_format='channels_first',
                 final_activation='linear'):
        Model.__init__(self)
        self.model = None
        self.num_classes = num_classes
        self.factor = factor
        self.training = training
        self.data_format = data_format
        self.final_activation = final_activation

        if data_format == 'channels_first':
            self.channel_axis = 1
        elif data_format == 'channels_last':
            self.channel_axis = -1

    def network(self, inputs, training, num_classes, factor, data_format, channel_axis):
        # very shallow Unet Network
        with tf.variable_scope('CustomNetwork'):

            conv1_1 = tf.keras.layers.Conv3D(factor, 3, padding='same', data_format=data_format, activation='relu')(inputs)
            conv1_2 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(conv1_1)
            pool1 = tf.keras.layers.MaxPool3D(pool_size=(2, 2, 2), strides=2, data_format=data_format)(conv1_2)

            conv2_1 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(pool1)
            conv2_2 = tf.keras.layers.Conv3D(factor * 4, 3, padding='same', data_format=data_format, activation='relu')(conv2_1)

            unpool1 = tf.keras.layers.UpSampling3D(size=(2, 2, 2), data_format=data_format)(conv2_2)
            unpool1 = tf.keras.layers.Concatenate(axis=channel_axis)([unpool1, conv1_2])

            conv7_1 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(unpool1)
            conv7_2 = tf.keras.layers.Conv3D(factor * 2, 3, padding='same', data_format=data_format, activation='relu')(conv7_1)

            output = tf.keras.layers.Conv3D(num_classes, 1, padding='same', data_format=data_format)(conv7_2)

            if str.lower(self.final_activation) == 'softmax':
                output = tf.nn.softmax(output, axis=channel_axis, name='softmax')
            elif str.lower(self.final_activation) == 'sigmoid':
                output = tf.nn.sigmoid(output, name='sigmoid')
            elif str.lower(self.final_activation) == 'linear':
                pass
            else:
                raise ValueError(
                    'Unsupported final_activation, it must of one (softmax, sigmoid or linear), but provided:' + self.final_activation)

        return output

    # additional custom loss
    def loss(self):
        return 0

    def get_predictions(self, inputs, training, build_ctx=None):
        self.model = self.network(
            inputs=inputs,
            training=training,
            num_classes=self.num_classes,
            factor=self.factor,
            data_format=self.data_format,
            channel_axis=self.channel_axis
        )
        return self.model

    def get_loss(self):
        return self.loss()

Implement methods

Define the get_predictions method.

def get_predictions(self, inputs, training,  build_ctx=None):
    if self.nn_data_format == 'NCDHW':
    if self.plane == 'x':
            inputs = tf.transpose(self.inputs, perm=[0, 1, 4, 3, 2])
    elif self.plane == 'y':
            inputs = tf.transpose(self.inputs, perm=[0, 1, 2, 4, 3])
    elif self.plane == 'z':
            inputs = self.inputs
    else:
            print('Incorrect key value for plane!')
    elif self.nn_data_format == 'NDHWC':
...
    return res_final

Optional methods

Optionally, you can define the get_loss method and the get_update_ops method for the model.

Configuration

Once your model is developed following the guidelines, you can use it in the training workflow with the following steps:

  1. Locate the section for model in the training config JSON file.
  2. Specify the path to your model’s class.
  3. Specify all required init parameters in the args section.
  4. Make sure that the specified model class path is in PYTHONPATH.

Here is sample training config file:

{
   "epochs": 1240,
   "num_training_epoch_per_valid": 20,
   "learning_rate": 1e-4,
   "multi_gpu": false,
   "train":
   {
       "loss":
       {
           "name": "Dice"
       },

       "optimizer":
       {
           "name": "Adam"
       },
    …...

       "model":
       {
       "path": "yourFileName.CustomNetwork",
       "args": {
       "num_classes": 2,
       "factor": 8,
       "final_activation": "softmax"
       }
       },
          …...
}

The pythonPathToYourModelClass must be accessible through PYTHONPATH.

For example, if pythonPathToYourModelClass is defined as: foo.bar.FancyNet and the class FancyNet is implemented in:

/project/deeplearn/foo/bar.py

then, PYTHONPATH must include

/project/deeplearn

5. Working with classification and segmentation models

The chapter provides instructions on preparing your data, training models, exporting, evaluating, and performing inference on the trained classification and segmentation models with transfer learning.

Working with classification models

Prepare the data

This section describes the format in which the data can be used with transfer learning for 2D classification tasks.

Data format

All input images and labels must be in png format. If you are planning to resample images, e.g., to 256x256, it is best to do that as a pre-processing step, rather than have the TLT toolkit do that on the fly. The png files can be 8- or 16-bit. You must also have ground truth labels available. These are often binary, i.e., {0,1}, or multi-class, i.e., {0,…,C} if there are C classes.

Folder structure

The layout of data files can be arbitrary, but the JSON file describing the data list must contain relative paths to all image files.

|--dataset_root:
     |--datalist.json
     |--png_files
        |--im1.png
        |--im2.png
        |--im3.png

Datalist JSON file

The JSON file describing the data structure must include a label_format key. The corresponding value should be a list of natural numbers, specifying the number of type of labels in the dataset. For instance, for the PLCO dataset, there are 15 binary labels, so it should be a list of 15 ones: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1].

The datafile should also have a training and validation key. These keys contain:
  • a list of dictionaries, where the value for the image key must be a relative path to the png file.
  • the value for the label key must be a list of natural numbers corresponding to the ground truth labels.
The labels for each image must match the label_format specified above.
 
{
    "label_format": [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
    "training":
    [
        {
            "image" : "im1.png"
            "label" : [0,0,1,0,0,0,0,0,0,0,1,0,0,0,0]
        },
...

The validation key is optional and only needs to be specified if the main training config file specifies metrics to compute. If the validation key is provided, it specifies the corresponding images and labels used to compute the validation metrics at the end of each training epoch (or less/more frequent if specified in the main training config).

Training a classification model

Run train.sh to train the model.

cd path/to/mmar/commands/folder

./train.sh

To fine-tune based on the pre-trained model included in the MMAR, first change the DATA_ROOT and DATASET_JSON to point to your dataset and data split configuration. Then run the train_finetune.sh:

cd path/to/mmar/commands/folder

./train_finetune.sh

The resulting checkpoint files are stored in the models folder of the MMAR.

For more details about MMAR, see Medical Model Archive..

For detailed example of training config of classification model, see Model training and validation configurations.

Multi-GPU training

To run multi-gpu training, runtrain_2gpu.sh. See Medical Model Archive.

When training or finetuning the models in multi-GPU setting on small number of training data, it is recommended to adjust the learning rate provided in the configuration files, e.g. multiple the learning rate by the GPU number as is recommended in https://arxiv.org/pdf/1706.02677.pdf.

Tensorboard visualization

You can run the following command to use Tensorboard for visualization.

python3 -m tensorboard.main --logdir "${MODEL_DIR}"

Exporting the model to a TensorRT optimized model inference

After the model has been trained, run export.sh from the "commands" folder in MMAR to export the checkpoint into frozen graphs.

cd path/to/mmar/commands/folder

./export.sh

Two frozen graph files will be produced in the models folder of the MMAR:

  • model.fzn.pb - a regular frozen graph
  • model.trt.pb - TRT-optimized frozen graph

Classification model evaluation with ground truth

Run validate.sh from the MMAR.

cd path/to/mmar/commands/folder

./validate.sh

The validation result files are created in the eval folder of the MMAR.

See Model training and validation configurations for example of validation config for classification model.

Classification model inference

Run infer.sh from the MMAR.

cd path/to/mmar/commands/folder

./infer.sh

The inference result files are created in the eval folder of the MMAR.

Note: Use the same configuration file for both validation and inference. For inference, the metric values specified in the configuration file won't be computed, and no ground truth label is needed.

Working with segmentation models

This section provides instructions on preparing your data, training models, exporting, evaluating and performing inference on the trained segmentation models using transfer learning.

Prepare the data

All input images and labels must be in NIfTI format. Each input image and its corresponding label mask must have the same image dimension. To visualize or save NIfTI images, you can use free viewers such as ITK-SNAP or MITK.

If your native data format is different from NIfTI or if you want to convert the image and label mask to isotropic resolution, you can use the provided Data Converter or some other software of your choice, such as ITK-SNAP or directly in Python.

Using the data converter

If the data format is DICOM or the resolution is not isotropic, one can use the provided data converter tool to convert the data to isotropic NIfTI format. Furthermore, many pre-trained models were trained on 1x1x1mm resolution images, and to use those pre-trained models as a starting point, convert the data to 1x1x1mm NIfTI format (Notice: If the dataset is already in NIfTI format, but not with 1x1x1 mm spacing, the data conversion is still required for the dataset.).

The tlt-dataconvert command converts all dicom volumes in your/data/directory to NIfTI format and optionally re-samples them to the provided resolution. If the images to be converted are segmentation labels, an option -l needs to be added, and the resampler will use nearest neighbor interpolator (otherwise linear interpolator is used).

tlt-dataconvert -d your/data/directory -r 1 -s .dcm -e .nii.gz -o your/output/directory

Supported options are:

Option Description
-d Input directory with subdirectories containing dicom images.
-r Output image resolution. If not provided, dicom resolution will be preserved. If only a single value is provided, target resolution will be isotrophic (e.g. -r 1 for 1x1x1mm resolution)
-s

Input file format,

can be .dcm, .nii, .nii.gz, .mha, .mhd.

-e Output file format, can be .nii, .nii.gz, .mha, .mhd.
-o Output directory.
-f (Optional) Force overwriting exsisting files if output directory already exists.
-l

(Optional) Flag indicating that the data is LABEL/SEGMENTATION masks and the nearest neighbor interpolation is used for re-sampling.

Note: If you need to convert both 3D volumetric images and their segmentation labels, put them into two different folders, and run the converter once for the images and once for the labels using the -l flag.

Folder structure

The layout of data files can be arbitrary, but the JSON file describing the data list must contain the relative paths to all data files.

|--dataset_root:
     |--datalist.json
     |--train
        |--im1.nii.gz
        |--lb1.nii.gz
        |--im2.nii.gz
        |--lb2.nii.gz
        |--im3.nii.gz
        |--lb3.nii.gz
        |--im4.nii.gz
        |--lb4.nii.gz
    |--val
        |--im1.nii.gz
        |--lb1.nii.gz
        |--im2.nii.gz
        |--lb2.nii.gz

For example, the datalist.json file looks similar to this. Here all paths are relative to datalist.json location.

{
    "training": [
        {
            "image" : "train/im1.nii.gz",
            "label" : "train/lb1.nii.gz"
        },
        {
            "image" : "train/im2.nii.gz",
            "label" : "train/lb2.nii.gz"
        },
        {
            "image" : "train/im3.nii.gz",
            "label" : "train/lb3.nii.gz"
        },      {
            "image" : "train/im4.nii.gz",
            "label" : "train/lb4.nii.gz"
        },
    ],
    "validation": [
        {
            "image" : "val/im1.nii.gz",
            "label" : "val/lb1.nii.gz"
        },
        {
            "image" : "val/im2.nii.gz",
            "label" : "val/lb2.nii.gz"
        },
    ]
}

The training and validation lists contain the images to be used in the training and validation steps, respectively.

Note: By default, all paths inside the datalist.json are assumed relative to the datalist.json file location. You can optionally specify the ROOT base path of the datasets by specifying it in the main config file (image_base_dir JSON key) or as a command line option (--file_root) to tlt-train command.

Datalist JSON file

The JSON file describing the data structure must include the training key with a list of items (each containing image and label keys).

The value for the image key can be a string containing the path to a single NIfTI file or a list of strings that are paths to NIfTI files. If there are several channels they are saved as separate files. Here is an example:

 
        {
            "image" : [
                        "train/im1_ch1.nii.gz",
                        "train/im1_ch2.nii.gz",
                        "train/im1_ch3.nii.gz",
                        "train/im1_ch4.nii.gz"
                    ]
         "label" : "train/lb1.nii.gz"
        },
Note: If image includes several files, they will be concatenated as separate channels of the network input. These images must be already spatially aligned.

The value for the label key, must be a string containing the path to a single NIfTI file with dense segmentation masks. The label mask defines segmentation using indices. Each integer index is a separate class or a multichannel one-hot-encoded image, where each channel represents a separate class.

The validation key is optional. If provided, the corresponding images/labels will be used to compute the validation metrics at the end of each specified training epoch in this release or less/more frequent if specified in the main training config. The validation section does not need to include the label keys, if the datalist.json is used for inference to compute the output segmentation masks.

Training a segmentation model

Segmentation training

Use train.sh to train the model:

cd path/to/mmar/commands/folder

./train.sh

Fine Tuning

To fine-tune based on the pre-trained model included in the MMAR, first change the DATA_ROOT and DATASET_JSON to point to your dataset and data split configuration. Then run the train_finetune.sh:

cd path/to/mmar/commands/folder

./train_finetune.sh

The resultant checkpoint files are stored in the models folder of the MMAR.

For more details see the Medical Model Archive.

For detailed example of training config of segmentation model, see Model training and validation configurations.

Multi-GPU training

To run 2-gpu training, runtrain_2gpu.sh.

cd path/to/mmar/commands/folder

./train_2gpu.sh

To fine-tune based on the pre-trained model included in the MMAR, first change the DATA_ROOT and DATASET_JSON to point to your dataset and data split configuration. Then run the train_2gpu_finetune.sh:

cd path/to/mmar/commands/folder

./train_2gpu_finetune.sh

The resulting checkpoint files are stored in the models folder of the MMAR.

See Medical Model Archive for more details.

When training or fine-tuning the models using the multi-GPU setting on a relatively small training dataset, it is recommended to adjust the learning rate provided in the configuration files, e.g. multiply the learning rate by the number of GPUs as is recommended in https://arxiv.org/pdf/1706.02677.pdf. You can create your own train_Ngpu.sh based on train_2gpu.sh. Make sure to adjust the learning rate accordingly.

Tensorboard visualization

You can run the following command to use Tensorboard for visualization.

python3 -m tensorboard.main --logdir "${MODEL_DIR}"

Exporting the model to a TensorRT optimized model inference

After the model has been trained, run export.sh from the "commands" folder in MMAR to export the checkpoint into frozen graphs.

cd path/to/mmar/commands/folder

./export.sh

Two frozen graph files will be produced in the models folder of the MMAR:

  • model.fzn.pb - a regular frozen graph
  • model.trt.pb - TRT-optimized frozen graph

Segmentation model evaluation with ground truth

Run the validate.sh from the MMAR.

cd path/to/mmar/commands/folder

./validate.sh

The validation result files are created in the eval folder of the MMAR.

See Model training and validation configurations for example of validation config for classification model.

Segmentation model inference

Use infer.sh to run inference on the model from the Medical Model Archive.

cd path/to/mmar/commands/folder

./infer.sh

The inference result files are created in the eval folder of the MMAR.

See Model training and validation configurations for example of validation config for classification model.

Note: We use the same configuration file for both validation and inference. For inference, the metric values specified in the configuration file won't be computed, and no ground truth label is needed.

Appendix

Segmentation models

Here is a list of the segmentation models. All the models are trained using 1x1x1mm resolution data.

Segmentation model Description
Brain tumor segmentation  
  • segmentation_mri_brain_tumors

    _br16_full

A pre-trained model for volumetric (3D) segmentation of brain tumors from multi-modal MRIs based on BraTS 2018 data.

https://www.med.upenn.edu/sbia/brats2018/data.html

The model is trained to segment 3 nested subregions of primary (gliomas) brain tumors: the "enhancing tumor" (ET), the "tumor core" (TC), and the "whole tumor" (WT), based on 4 input MRI scans ( T1c, T1, T2, FLAIR). The ET is described by areas that show hyper-intensity in T1c when compared to T1, but also when compared to "healthy" white matter in T1c. The TC describes the bulk of the tumor, which is what is typically resected. The TC encompasses the ET, as well as the necrotic (fluid-filled) and the non-enhancing (solid) parts of the tumor. The WT describes the complete extent of the disease, as it entails the TC and the peritumoral edema (ED), which is typically depicted by hyper-intense signal in FLAIR.

The dataset is available at "Multimodal Brain Tumor Segmentation Challenge (BraTS) 2018." The provided labelled data was partitioned, based our own split, into training (243 studies) and validation (42 studies) datasets.

For more detailed description of tumor regions, please see the Multimodal Brain Tumor Segmentation Challenge (BraTS) 2018 data page at: https://www.med.upenn.edu/sbia/brats2018/data.html.

This model utilized a similar approach described in 3D MRI brain tumor segmentation using autoencoder regularization, which was a winning method in BraTS2018 [1].

The provided training configuration required 16GB GPU memory.

Model Input Shape: 224 x 224 x 128

Training Script: train.sh

Model input and output:

  • Input: 4 channel 3D MRIs (T1c, T1, T2, FLAIR)
  • Output: 3 channels of tumor subregion 3D masks

The model was trained with 285 cases with our own split, as shown in the datalist json file in the config folder.

  • Tumor core (TC): 0.8624
  • Whole tumor (WT): 0.9020
  • Enhancing tumor (ET): 0.7770
  • segmentation_mri_brain_tumors

    _br16_t1c2tc

A pre-trained model for volumetric (3D) brain tumor segmentation (only TC from T1c images). The model is trained to segment "tumor core" (TC) based on 1 input MRI scan (T1c).

The dataset is available at "Multimodal Brain Tumor Segmentation Challenge (BraTS) 2018." The provided labelled data was partitioned, based our own split, into training (243 studies) and validation (42 studies) datasets, as shown in config/seg_brats18_datalist_t1c.json.

For more detailed description of tumor regions, please see the Multimodal Brain Tumor Segmentation Challenge (BraTS) 2018 data page at:

https://www.med.upenn.edu/sbia/brats2018/data.html

This model utilized a similar approach described in 3D MRI brain tumor segmentation using autoencoder regularization, which was a winning method in BraTS2018 [1].

The provided training configuration required 16GB GPU memory.

Model Input Shape: 224 x 224 x 128

Training Script: train.sh

Model input and output:

  • Input: 1 channel 3D MRI (T1c)
  • Output: 1 channel of tumor core 3D masks

The achieved mean Dice score on the validation data is: Tumor core (TC): 0.839

Liver and Tumor segmentation  
  • segmentation_ct_liver_and_tumor

A pre-trained model for volumetric (3D) segmentation of the liver and lesion in portal venous phase CT image.

This model is trained using the runnerup [2] awarded pipeline of the "Medical Segmentation Decathlon Challenge 2018" using the AHnet architecture [3].

This model was trained with Liver dataset, as part of "Medical Segmentation Decathlon Challenge 2018". It consists of 131 labelled data and 70 unlabelled data. The labelled data was partitioned, based on our own split, into 104 training images and 27 validation images for this training task, as shown in config/dataset_0.json.

For more detailed description of "Medical Segmentation Decathlon Challenge 2018," at:

http://medicaldecathlon.com/.

The training dataset is Task03_Liver.tar from the link above.

The data must be converted to 1mm resolution before training:

tlt-dataconvert -d ${SOURCE_IMAGE_ROOT} -r 1 -s .nii.gz -e .nii -o ${DESTINATION_IMAGE_ROOT}
Note: To match the default setting, we suggest that ${DESTINATION_IMAGE_ROOT} match DATA_ROOT as defined in environment.json in this MMAR's config folder.

The provided training configuration required 12GB GPU memory.

Data Conversion: convert to resolution 1mm x 1mm x 1mm

Model input shape: dynamic

Training Script: train.sh

Model input and output:

  • Input: 1 channel CT image
  • Output: 3 channels:
    • Label 1: liver
    • Label 2: tumor
    • Label 0: everything else

This Dice scores on the validation data achieved by this model are:

  • Liver: 0.932
  • Tumor: 0.495
Hippocampus segmentation  
  • segmentation_mri_hippocampus

A pre-trained model for volumetric (3D) segmentation of the hippocampus head and body from mono-modal MRI image.

This model is trained using the runner-up awarded pipeline of the "Medical Segmentation Decathlon Challenge 2018" with 208 training images and 52 validation images.

Training Data Source: Task04_Hippocampus.tar from http://medicaldecathlon.com/

The data was converted to resolution 1mm x 1mm x 1mm for training, using the following command:

tlt-dataconvert -d ${SOURCE_IMAGE_ROOT} -r 1 -s .nii.gz -e .nii.gz -o ${DESTINATION_IMAGE_ROOT}

The training was performed with command train.sh, which required 12GB-memory GPUs.

Training Graph Input Shape: dynamic

Actual Model Input: 96 x 96 x 96

Model input and output:

  • Input: 1 channel MRI image
  • Output: 2 channels:
    • Label 1: hippocampus
    • Label 0: everything else

This model achieves the following Dice score on the validation data (our own split from the training dataset):

  • Hippocampus: 0.872 (mean_dice1: 0.882 mean_dice_dice2: 0.862
Lung Tumor segmentation  
  • segmentation_ct_lung_tumor

A pre-trained model for volumetric (3D) segmentation of the lung tumor from CT image. This model is trained using the runner-up awarded pipeline of the "Medical Segmentation Decathlon Challenge 2018" with 50 training images and 13 validation images.

Training Data Source:

Task06_Lung.tar from http://medicaldecathlon.com/

The data was converted to resolution 1mm x 1mm x 1mm for training, using the following command:

tlt-dataconvert -d ${SOURCE_IMAGE_ROOT} -r 1 -s .nii.gz -e .nii.gz -o ${DESTINATION_IMAGE_ROOT}

The training was performed with command train_2gpu.sh, which required 12GB-memory GPUs.

Training Graph Input Shape: dynamic

Actual Model Input: 96 x 96 x 96

Model input and output:

  • Input: 1 channel CT image
  • Output: 2 channels:
    • Label 1: lung tumor
    • Label 0: everything else

This Dice scores on the validation data (our own split) achieved by this model are:

  • lung: 0.417
Prostrate segmentation  
  • segmentation_mri_prostate_cg_and_pz

A pre-trained model for volumetric (3D) segmentation of the prostate central gland and peripheral zone from the multimodal MR (T2, ADC). This model is trained using the runner-up awarded pipeline of the "Medical Segmentation Decathlon Challenge 2018" with 25 training image pairs and 7 validation images.

Training Data Source: Task05_Prostate.tar from http://medicaldecathlon.com/. The data was converted to resolution 1mm x 1mm x 1mm for training, using the following command:

 tlt-dataconvert -d ${SOURCE_IMAGE_ROOT} -r 1 -s .nii.gz -e .nii.gz -o ${DESTINATION_IMAGE_ROOT}

The training was performed with command train_4gpu.sh, which required 12GB-memory GPUs.

Training Graph Input Shape: dynamic

Actual Model Input: 96 x 96 x 32

Model input and output:

  • Input: 2 channel MRI image
  • Output: 2 channels:
    • Label 1: prostate peripheral zone
    • Label 0: everything else

This model achieve the following Dice score on the validation data (our own split from the training dataset):

  • Prostate: 0.724 (mean_dice1: 0.485 mean_dice2: 0.871)
Left atrium segmentation  
  • segmentation_mri_left_atrium

A pre-trained model for volumetric (3D) segmentation of the left atrium from MRI image.

This model is trained using the runner-up awarded pipeline of the "Medical Segmentation Decathlon Challenge 2018" with 16 training images and 4 validation images.

Training Data Source: Task02_Heart.tar from http://medicaldecathlon.com/ The data was converted to resolution 1mm x 1mm x 1mm for training, using the following command:

tlt-dataconvert -d ${SOURCE_IMAGE_ROOT} -r 1 -s .nii.gz -e .nii.gz -o ${DESTINATION_IMAGE_ROOT}

The training was performed with command train_2gpu.sh, which required 12GB-memory GPUs.

Training Graph Input Shape: dynamic

Actual Model Input: 96 x 96 x 96

Model input and output:

  • Input: 1 channel MRI image
  • Output: 2 channels:
    • Label 1: heart
    • Label 0: everything else

This Dice scores on the validation data (our own split) achieved by this model are:

  1. heart: 0.9158
Pancreas and tumor segmentation  
  • segmentation_ct_pancreas_and_tumor

A pre-trained model for volumetric (3D) segmentation of the pancreas and tumor from portal venous phase CT.

This model is trained using the runnerup [2] awarded pipeline of the "Medical Segmentation Decathlon Challenge 2018" using the AHnet architecture [3].

This model is trained with Pancreas dataset, as part of "Medical Segmentation Decathlon Challenge 2018". It consists of 281 labelled data and 139 non-labelled data. The labelled data was partitioned, based on our own split, into 224 training images and 57 validation images for this training task, as shown in config/dataset_0.json.

For more detailed description of "Medical Segmentation Decathlon Challenge 2018," see http://medicaldecathlon.com/.

The training dataset is Task07_Pancreas.tar from the link above.

The data must be converted to 1mm resolution before training:

tlt-dataconvert -d ${SOURCE_IMAGE_ROOT} -r 1 -s .nii.gz -e .nii -o ${DESTINATION_IMAGE_ROOT}
Note: To match up with the default setting, we suggest that ${DESTINATION_IMAGE_ROOT} match DATA_ROOT as defined in environment.json in this MMAR's config folder.

The provided training configuration required 12GB GPU memory.

Data Conversion: convert to resolution 1mm x 1mm x 1mm

Model Input Shape: dynamic

Training Script: train.sh

The training dataset is Task07_Pancreas.tar from the link above. The data must be converted to 1mm resolution before training:

Data Conversion: convert to resolution 1mm x 1mm x 1mm

Model input shape: dynamic

Model input and output:

  • Input: 1 channel CT image
  • Output: 3 channels:
    • Label 1: pancreas
    • Label 2: tumor
    • Label 0: everything else

This model achieves the following Dice score on the validation data (our own split from the training dataset):

  • Pancreas: 0.739
  • Tumor: 0.348
Colon tumor segmentation  
  • segmentation_ct_colon_tumor

A pre-trained model for volumetric (3D) segmentation of the Colon from CT image.

This model is trained using the runner-up awarded pipeline of the "Medical Segmentation Decathlon Challenge 2018" with 100 training images and 26 validation images.

Training Data Source: Task10_Colon.tar from http://medicaldecathlon.com/ The data was converted to resolution 1mm x 1mm x 1mm for training, using the following command.

tlt-dataconvert -d ${SOURCE_IMAGE_ROOT} -r 1 -s .nii.gz -e .nii.gz -o ${DESTINATION_IMAGE_ROOT}

The training was performed with command train_2gpu.sh, which required 12GB-memory GPUs.

Training Graph Input Shape: dynamic

Actual Model Input: 96 x 96 x 96

Training Graph Input Shape: dynamic

Model input and output:

  • Input: 1 channel CT image
  • Output: 2 channels:
    • Label 1: colon tumor
    • Label 0: everything else

This Dice scores on the validation data (our own split) achieved by this model are:

  • colon cancer: 0.367
Hepatic vessel and tumor segmentation  
  • segmentation_ct_hepatic_vessel

    _and_tumor

A pre-trained model for volumetric (3D) segmentation of the hepatic vessel and tumor from CT image.

This model is trained using the runnerup [2] awarded pipeline of the "Medical Segmentation Decathlon Challenge 2018" using the AHnet architecture [3].

This model was trained with Hepatic Vessel dataset, as part of "Medical Segmentation Decathlon Challenge 2018". It consists of 303 labelled data and 140 non-labelled data. The labelled data was partitioned, based on our own split, into 242 training images and 61 validation images for this training task, as shown in config/dataset_0.json.

For more detailed description of "Medical Segmentation Decathlon Challenge 2018," please see http://medicaldecathlon.com/.

The training dataset is Task08_HepaticVessel.tar from the link above.

The data must be converted to 1mm resolution before training:

 tlt-dataconvert -d ${SOURCE_IMAGE_ROOT} -r 1 -s .nii.gz -e .nii -o ${DESTINATION_IMAGE_ROOT}
Note: to match up with the default setting, we suggest that ${DESTINATION_IMAGE_ROOT} match DATA_ROOT as defined in environment.json in this MMAR's config folder.

The provided training configuration required 12GB GPU memory.

Data Conversion: convert to resolution 1mm x 1mm x 1mm.

Model Input Shape: dynamic

Training Script: train.sh

Model input and output:

  • Input: 1 channel CT image
  • Output: 3 channels:
    • Label 1: hepatic vessel
    • Label 2: liver tumor
    • Label 0: everything else

This Dice scores on the validation data achieved by this model are:

  • Hepatic vessel: 0.523
  • Liver tumor: 0.422
Spleen segmentation  
  • segmentation_ct_spleen

A pre-trained model for volumetric (3D) segmentation of the spleen from CT image.

This model is trained using the runner-up awarded pipeline of the "Medical Segmentation Decathlon Challenge 2018" with 32 training images and 9 validation images.

The training dataset is Task09_Spleen.tar from http://medicaldecathlon.com/.

The data must be converted to 1mm resolution before training:

 tlt-dataconvert -d ${SOURCE_IMAGE_ROOT} -r 1 -s .nii.gz -e .nii.gz -o ${DESTINATION_IMAGE_ROOT}
Note: To match up with the default setting, we suggest that ${DESTINATION_IMAGE_ROOT} match DATA_ROOT as defined in environment.json in this MMAR's config folder.

The training was performed with command train_2gpu.sh, which required 12GB-memory GPUs.

Training Graph Input Shape: dynamic

Actual Model Input: 96 x 96 x 96

Model input and output:

  • Input: 1 channel CT image
  • Output: 2 channels:
    • Label 1: spleen
    • Label 0: everything else

This model achieves the following Dice score on the validation data (our own split from the training dataset):

  • Spleen: 0.951

For details of model architecture, see [1]" (Liu et al.)

[1] Myronenko, Andriy. "3D MRI brain tumor segmentation using autoencoder regularization." International MICCAI Brainlesion Workshop. Springer, Cham, 2018. https://arxiv.org/abs/1810.11654.[2] Xia, Yingda, et al. "3D Semi-Supervised Learning with Uncertainty-Aware Multi-View Co-Training." arXiv preprint arXiv:1811.12506 (2018). https://arxiv.org/abs/1811.12506.[3] Liu, Siqi, et al. "3d anisotropic hybrid network: Transferring convolutional features from 2d images to 3d anisotropic volumes." International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2018. https://arxiv.org/abs/1711.08580.

Classification models

Chest X-ray Classification

classification_chestxray

A pre-trained densenet121 model for disease pattern detection in chest x-rays.

This model is trained using PLCO training data and evaluated on the PLCO validation data.

You can apply for access to the dataset at: https://biometry.nci.nih.gov/cdas/learn/plco/images/

The provided training configuration required 12GB-memory GPUs. The training was performed with command train.sh, which required 12GB-memory GPUs.

Training Graph Input Shape: 256 x 256

Input: 16-bit CXR png

Output: 15 binary labels, each bit is corresponding to the prediction of 'Nodule', 'Mass', 'Distortion of Pulmonary Architecture', 'Pleural Based Mass', 'Granuloma', 'Fluid in Pleural Space', 'Right Hilar Abnormality', 'Left Hilar Abnormality', 'Major Atelectasis', 'Infiltrate', 'Scarring', 'Pleural Fibrosis', 'Bone/Soft Tissue Lesion', 'Cardiac Abnormality', 'COPD'

Please refer to "medical/segmentation/examples/brats/tutorial_brats.ipynb" inside the docker and the files in the same folder for details.

This model achieves the following AUC score on the validation data:

  • Averaged AUC over all disease categories: 0.8680

Model folder inside the docker:

  • /opt/nvidia/medical/classification/examples/PLCO

This model achieves the following Dice score on the validation data:

This model achieves the following Dice score on the validation data: Averaged AUC over all disease categories: 0.8587.

Data transforms and augmentations

Here is a list of built-in data transformation functions. If you need additional transformation functions, please contact us at the TLT user forum: http://devtalk.nvidia.com.

Transforms Description

LoadNifty

Load NIfTI data. The value of each key (specified by fields) in input "dict" can be a string (a path to a single NIfTI file) or a list of strings (several paths to multiple NIfTI files, if there are several channels saved as separate files).

  • init_args:

    - fields: string or list of strings

    key_values to apply, e.g. ["image", "label"].

  • Returns:

    - Each field of "dict" is substituted by a 4D numpy array.

VolumeTo4dArray

Transforms the value of each key (specified by fields) in input "dict" from 3D to 4D numpy array by expanding one channel, if needed.

  • init_args:

    - fields: string or list of strings

    key_values to apply, e.g. ["image", "label"].

  • Returns:

    - Each field of "dict" is substituted by a 4D numpy array.

ScaleIntensityRange

Scale the Intensity range with optional clipping of the of the numpy array.

  • init_args:

    - field: string

    one key_value to apply, e.g. "image".

    - a_min: float

    - a_max: float

    Range of the original image

    - b_min: float

    - b_max: float

    Target range of the image.

    - Clip = False: bool

    Flag controls whether to clip the intensity out of the target range.

ScaleIntensityOscillation

Randomly shift scale level for image.
  • Args:
    • field: key string, e.g. "image". ‣ magnitude: scale shift is a random value random between 0 and magnitude.

  • Returns
    • Data with an offset on intensity scale.

CropSubVolumeBatchPosNegRatio

Randomly crop the foreground and

background ROIs from both the image and mask for training. The sampling ratio between positive and negative samples is adjusted with the epoch number.

‣ init_args:

- image_field: string

one key_value to apply, e.g. "image".

- label_field: string

one key_value to apply, e.g. "label".

- size: list of ints

cropped ROI size, e.g., [96, 96, 96].

- pos/neg: float

Positive numbers to determine the ratio between positive and negative samples.

- batch_size: int

A positive integer to determine how many patches are cropped from the single volume.

- data_format: string

”channels_first” (by default) or ”channels_last”.

- fast_crop: binary

True or False to determine whether to enable the fast cropping algorithm.

‣ Returns:

- Updated dictionary with cropped ROI image and mask

NPResize3D

Resize 3D volume with a given shape

‣ init_args:

- applied_keys: list of strings

key_values to apply, e.g. ["image", “label”].

- output_shape: list of ints

output size, e.g., [96, 96, 96].

- nearest: binary

True for nearest interpolation (for segmentation labels, etc.), and False for linear interpolation (for images, etc.).

‣ Returns:

- 3D volume with the given shape

TransformVolumeCropROI

FastPosNegRatio

Fast 3D data augmentation method (CPU based) by combining 3D morphological transforms (rotation, elastic deformation, and scaling) and ROI cropping. The sampling ratio is specified by pos/neg.

  • init_args:

    - applied_keys: string or list of strings

    key_values to apply, e.g. ["image", "label"].

    - size: list of int

    cropped ROI size, e.g., [96, 96, 96].

    - deform: boolean

    whether to apply 3D deformation.

    - rotation: boolean

    whether to apply 3D rotation.

    - rotation_degree: float

    the degree of rotation, e.g., 15 means randomly rotate the image/label in a range [-15, +15].

    - scale: boolean

    whether to apply 3D scaling.

    - scale_factor: float

    the percentage of scaling, e.g., 0.1 means randomly scaling the image/label in a range [-0.1, +0.1].

    - pos: float

    the factor controlling the ratio of positive ROI sampling.

    - neg: float

    the factor controlling the ratio of negative ROI sampling.

  • Returns:

    - Updated dictionary with cropped ROI image and mask after data augmentation.

AdjustContrast

Randomly adjust the contrast of the field in input "dict".

  • init_args:

    - field: string

    one key_value to apply, e.g. "image".

AddGaussianNoise

Randomly add Gaussian noise to the field in input "dict".

  • init_args:

    - field: string

    one key_value to apply, e.g. "image".

LoadPng

Load png image and the label. The value of "image" must be a string (a path to a single png file) while the value of the "label" must be a list of labels.

  • init_args:

    - fields: string or list of strings

    key_values to apply, e.g. ["image", "label"].

  • Returns:

    - "image" of "dict" is substituted by a 3D numpy array while the "label" of "dict" is substituted by a numpy list

CropRandomSubImageInRange

Randomly crop 2D image. The crop size is randomly selected between lower_size and image size.

  • init_args:

    - lower_size: int or float

    lower limit of crop size, if float, then must be fraction <1

    - max_displacement: float

    max displacement from center to crop

    - keep_aspect: boolean

    if true, then original aspect ratio is kept

  • Returns:

    - The "image" field of input "dict" is substituted by cropped ROI image.

NPResizeImage

Resize the 2D numpy array (channel x rows x height) as an image.

  • init_args:

    - applied_keys: string

    one key_value to apply, e.g. "image".

    - output_shape: list of int with length 2

    e.g., [256,256].

    - data_format: string

    ''channels_first', 'channels_last', or 'grayscale'.

NP2DRotate

Rotate 2D numpy array, or channelled 2D array. If random is set to true, then rotate within the range of [ -angle, angle]

  • init_args:

    - applied_keys: string

    one key_value to apply, e.g. "image".

    - angle: float

    e.g. 7.

    - random: boolean

    default is false.

NPExpandDims

Add a singleton dimension to the selected axis of the numpy array.

  • init_args:

    - applied_keys: string or list of strings

    key_values to apply, e.g. ["image", "label"].

    - expand_axis: int

    axis to expand, default is 0

NPRepChannels

Repeat a numpy array along specified axis, e.g., turn a grayscale image into a 3-channel image.

  • init_args:

    - applied_keys: string or list of strings

    key_values to apply, e.g. ["image", "label"].

    - channel_axis: int

    the axis along which to repeat values.

    - repeat: int

    the number of repetitions for each element.

CenterData

Center numpy array's value by subtracting a subtrahend and dividing by a divisor.

  • init_args:

    - applied_keys: string or list of strings

    key_values to apply, e.g. ["image", "label"].

    - subtrahend: float

    subtrahend. If None, it is computed as the mean of dict[key_value].

    - divisor: float

    divisor. If None, it is computed as the std of dict[key_value]

NPRandomFlip3D

Flip the 3D numpy array along random axes with the provided probability.

  • init_args:

    - applied_keys: string or list of strings

    key_values to apply, e.g. ["image", "label"].

    - probability: float

    probability to apply the flip, value between 0 and 1.0.

NPRandomZoom3D

Apply a random zooming to the 3D numpy array.

  • init_args:

    - applied_keys: string or list of strings

    key_values to apply, e.g. ["image", "label"].

    - lower_limits: list of float

    lower limit of the zoom along each dimension.

    - upper_limits: list of float

    upper limit of the zoom along each dimension.

    - data_format: string

    'channels_first' or "channels_last".

    - use_gpu: boolean

    whether to use cupy for GPU acceleration. Default is false.

    - keep_size: boolean

    default is false which means this function will change the size of the data array after zooming. Setting keep_size to True will result in an output of the same size as the input.

CropForegroundObject

Crop the 4D numpy array and resize the forground. The numpy array must have foreground voxels.

  • init_args:

    - size: list of int

    resized size.

    - image_field: string

    "image".

    - label_field: string

    "label".

  • - pad: int number of voxels for addinga margin around the object)- foreground_only: boolean whetherto treat all foreground labels as onebinary label (default) or whether toselect foreground label at random.
  • - keep_classes : boolean; if true, keep original label indices in label image (no thresholding), useful for multi-class tasks.
  • - pert: int; random perturbation in each dimension added to padding (in voxels).

NPRandomRot90_XY

Rotate the 4D numpy array along random axes on XY plane (axis = (1, 2)).

  • init_args:

    - applied_keys: string

    one key_value to apply, e.g. "image".

    - probability: float

    probability to utilize the transform, between 0 and 1.0.

AddExtremePointsChannel

Add and additional channel to 4D numpy array where the extreme points of the foreground labels are modeled as Gaussians.

  • init_args:

    - image_field: string

    "image".

    - label_field: string

    "label".

    - sigma: float

    size of Gaussian.

    - pert: boolean

    random perturbation added to the extreme points.

NormalizeNonzeroIntensities

Normalize 4D numpy array to zero mean and unit std, based on non-zero elements only for each input channel individually.

  • init_args:

    - fields: string or list of strings

    key_values to apply, e.g. ["image", "label"].

SplitAcrossChannels

Splits the 4D numpy array across channels to create new dict entries. New key_values shall be applied_key+channel number, e.g. "image1".

  • init_args:

    - applied_key: string

    one key_value to apply, e.g. "image".

LoadResolutionFromNifty

Get the image resolution from an NifTI image

  • init_args:

- applied_key: string

one key_value to apply, e.g. "image".

  • Returns:

    - "dict" has a new key-value pair: dict[applied_key+"_resolution"]: resolution of the NIfTI image

Load3DShapeFromNumpy

Get the image shape from an NifTI image

  • init_args:

- applied_key: string

one key_value to apply, e.g. "image".

  • Returns:

    - "dict" has a new key-value pair: dict[applied_key+"_shape"]: shape of the NIfTI image

ResampleVolume

Resample the 4D numpy array from current resolution to a specific resolution

  • init_args:

- applied_key: string

one key_value to apply, e.g. "image".

- resolution: list of float

input image resolution.

- target_resolution: list of float

target resolution.

BratsConvertLabels

Brats data specific. Convert input labels format (indices 1,2,4) into proper format.

  • init_args:

- fields: string or list of strings

key_values to apply, e.g. ["image", "label"].

CropSubVolumeRandomWithinBounds

Crops a random subvolume from within the bounds of 4D numpy array.

  • init_args:

    - fields: string or list of strings

    key_values to apply, e.g. ["image", "label"].

    - size: list of int

    the size of the crop region e.g. [224,224,128].

FlipAxisRandom

Flip the numpy array along its dimensions randomly.

  • init_args:

    - fields: string or list of strings

    key_values to apply, e.g. ["image", "label"].

    - axis : list of ints

    which axes to attempt to flip (e.g. [0,1,2] - for all 3 dimensions) - the axis indices must be provided only for spatial dimensions.

CropSubVolumeCenter

Crops a center subvolume from within the bounds of 4D numpy array.

  • init_args:

    - fields: string or list of strings

    key_values to apply, e.g. ["image", "label"].

    - size: list of int

    the size of the crop region e.g. [224,224,128] (similar to CropSubVolumeRandomWithinBounds, but crops the center)

Model training and validation configurations

Transfer Learning workflows are made of different types of components. For each type, there are usually multiple choices. To put together a workflow, you specify and configure the components to be used.

Transfer Learning offers two kinds of workflows: training and validation. Workflow configurations are defined in JSON files: config_train.json for training workflow and config_validation.json for validation.

Train configuration

Training config file config_train.json defines the configuration of the training workflow. The config contains three sections: global variables, train, and validate.

You can define global variables in the configuration JSON file. These variables can be overwritten through the environment.json or even command line. The typical global variables include:

{
"epochs": 5000,
"num_training_epoch_per_valid": 20,
"learning_rate": 1e-4, 
"multi_gpu": false,
…
}

By overwriting the values of these variables through command line, you can experiment with different training settings without having to modify the config file. For example: in commands/train.sh

python3 -u -m medical.tlt2.src.apps.train \ 
  -m $MMAR_ROOT \
   -c $CONFIG_FILE \
   -e $ENVIRONMENT_FILE \
   --set \
   epochs=1260 \
   learning_rate=0.0001 \
   num_training_epoch_per_valid=20 \
   multi_gpu=false

The “train” section defines the components for the training process, including "loss", “optimizer”, “lr_policy”, “model”, “pre_transforms” and “image_pipeline”. Each component is constructed by providing the component’s class “name” and the init arguments “args”.

Similarly, the “validate” section defines the components for validation process, including “metrics”, “pre_transforms”, “image_pipeline” and “inferer”. Each component is constructed the same way by providing the component class “name” and the corresponding init arguments “args”.

If you want to use an externally-implemented component class, you can do so by specifying the class “path” to replace the “name”.

Segmentation model example

Here is an example of config_train.json of the segmentation_ct_spleen model:

{

 "epochs": 1250,

 "num_training_epoch_per_valid": 20,

 "learning_rate": 1e-4,

 "multi_gpu": false,

 "train": {

   "loss": {

     "name": "Dice"

   },

   "optimizer": {

     "name": "Adam"

   },

   "lr_policy": {

     "name": "DecayLRonStep",

     "args": {

       "decay_ratio": 0.1,

       "decay_freq": 50000

     }

   },

   "model": {

     "name": "SegmAhnet3D",

     "args": {

       "num_classes": 2,

       "if_from_scratch": false,

       "if_use_psp": false,

       "pretrain_weight_name": "{PRETRAIN_WEIGHTS_FILE}",

       "plane": "z",

       "final_activation": "softmax"

     }

   },

   "pre_transforms": [

     {

       "name": "LoadNifty",

       "args": {

         "fields": [

           "image",

           "label"

         ]

       }

     },

     {

       "name": "VolumeTo4DArray",

       "args": {

         "fields": [

           "image",

           "label"

         ]

       }

     },

     {

       "name": "ScaleIntensityRange",

       "args": {

         "field": "image",

         "a_min": -57,

         "a_max": 164,

         "b_min": 0.0,

         "b_max": 1.0,

         "clip": true

       }

     },

     {

       "name": "CropSubVolumeBatchPosNegRatio",

       "args": {

         "size": [

           96,

           96,

           96

         ],

         "image_field": "image",

         "label_field": "label",

         "pos": 1,

         "neg": 1,

         "batch_size": 3,

         "fast_crop": true

       }

     },

     {

       "name": "NPRandomFlip3D",

       "args": {

         "applied_keys": [

           "image",

           "label"

         ],

         "probability": 0.0

       }

     },

     {

       "name": "NPRandomRot90XY",

       "args": {

         "applied_keys": [

           "image",

           "label"

         ],

         "probability": 0.0

       }

     },

     {

       "name": "ScaleIntensityOscillation",

       "args": {

         "field": "image",

         "magnitude": 0.10

       }

     }

   ],

   "image_pipeline": {

     "name": "ImagePipeline",

     "args": {

       "task": "segmentation",

       "data_list_file_path": "{DATASET_JSON}",

       "data_file_base_dir": "{DATA_ROOT}",

       "data_list_key": "training",

       "crop_size": [

         -1,

         -1,

         -1

       ],

       "data_format": "channels_first",

       "batch_size": 0,

       "num_channels": 1,

       "num_workers": 4,

       "prefetch_size": 0

     }

   }

 },

 "validate": {

   "metrics": [

     {

       "name": "MetricAverageFromArrayDice",

       "args": {

         "name": "mean_dice",

         "stopping_metric": true,

         "applied_key": "model",

         "label_key": "label"

       }

     }

   ],

   "pre_transforms": [

     {

       "name": "LoadNifty",

       "args": {

         "fields": [

           "image",

           "label"

         ]

       }

     },

     {

       "name": "VolumeTo4DArray",

       "args": {

         "fields": [

           "image",

           "label"

         ]

       }

     },

     {

       "name": "ScaleIntensityRange",

       "args": {

         "field": "image",

         "a_min": -57,

         "a_max": 164,

         "b_min": 0.0,

         "b_max": 1.0,

         "clip": true

       }

     }

   ],

   "image_pipeline": {

     "name": "ImagePipeline",

     "args": {

       "task": "segmentation",

       "data_list_file_path": "{DATASET_JSON}",

       "data_file_base_dir": "{DATA_ROOT}",

       "data_list_key": "validation",

       "crop_size": [

         -1,

         -1,

         -1

       ],

       "data_format": "channels_first",

       "batch_size": 1,

       "num_channels": 1,

       "num_workers": 4,

       "prefetch_size": 0

     }

   },

   "inferer": {

     "name": "ScanWindowInferer",

     "args": {

       "is_channels_first": true,

       "roi_size": [

         160,

         160,

         160

       ]

     }

   }

 }

}

Explanation of components in the Training workflow:

Section Component Description
Global epochs Number of training epochs
 

num_training_epoch

_per_valid

Validation frequency in number of epochs. If not specified, defaults to 1.
  learning_rate The initial learning rate
  multi_gpu Is the training on multiple GPUs? If not specified, defaults to false.
train loss The loss component. The Dice loss is used here.
  optimizer The optimizer component. The Adam optimizer is used here.
  lr_policy The learning rate policy. The DecayLRonStep policy is used here.
  model

The model network. The SegmAhnet3D

network is used here.

  pre_transforms List of transforms to be applied to the training data.
  image_pipeline The image pipeline that generates batched training data items. NOTE: The crop_size is [-1, -1, -1] for training with dynamic network input shape. The batch_size is set to 0. This is because the batching is not done by the image pipeline; instead it is done by the special transform

CropSubVolumeBatch

PosNegRatio

for faster cropping and batching.

validate metrics  
  pre_transforms Transforms to be applied to the validation data.
  image_pipeline The image pipeline that generates batched validation data items.
  inferer The inferer to be used for performing inference on validation data. Options are ScanWindowInferer and SimpleInferer

Dynamic Network Input Shape

TensorFlow 1.13 supports dynamic network input shapes. This allows the computation graph to be built with placeholders of dynamic shape [None, None, None], which can accept input data of any size. This makes it possible for dynamically computing the best input size for an image to obtain the best performance during inference.

Because the network input size is dynamic, when performing inference, you must explicitly set the actual input size. In this example, we set the ScanWindowInferer’s roi_size to [160, 160, 160], which is used as the actual input size to the network, where ROI stands for Region Of Interest.

Classification model example

Here is the example for the classification_chestxray:

{

 "epochs": 40,

 "multi_gpu": false,

 "learning_rate": 2e-4,

 "train": {

   "model": {

     "name": "DenseNet121",

     "args": {

       "weight_decay": 1e-5,

       "pretrain_weight_name": "{PRETRAIN_WEIGHTS_FILE}"

     }

   },

   "loss": {

     "name": "ClassificationLoss"

   },

   "optimizer": {

     "name": "Adam"

   },

   "pre_transforms": [

     {

       "name": "LoadPng",

       "args": {

         "fields": [

           "image"

         ]

       }

     },

     {

       "name": "CropRandomSubImageInRange",

       "args": {

         "lower_size": [

           0.9,

           0.9

         ],

         "data_format": "grayscale",

         "image_field": "image",

         "max_displacement": 200

       }

     },

     {

       "name": "NPResizeImage",

       "args": {

         "applied_keys": [

           "image"

         ],

         "output_shape": [

           256,

           256

         ],

         "data_format": "grayscale"

       }

     },

     {

       "name": "NP2DRotate",

       "args": {

         "applied_keys": [

           "image"

         ],

         "angle": 7,

         "random": true,

         "data_format": "grayscale"

       }

     },

     {

       "name": "NPExpandDims",

       "args": {

         "applied_keys": "image",

         "expand_axis": 2

       }

     },

     {

       "name": "NPRepChannels",

       "args": {

         "applied_keys": "image",

         "channel_axis": 2,

         "repeat": 3

       }

     },

     {

       "name": "CenterData",

       "args": {

         "applied_keys": "image",

         "subtrahend": [

           2876.37,

           2876.37,

           2876.37

         ],

         "divisor": [

           883,

           883,

           883

         ]

       }

     }

   ],

   "metrics": [

     {

       "name": "AccuracyComputer",

       "args": {

         "tags": "accuracy",

         "use_sigmoid": true

       }

     },

     {

       "name": "ClassificationMetric",

       "args": {

         "binary_preds_name": "binary_preds",

         "binary_labels_name": "binary_labels"

       },

       "do_summary": false,

       "do_print": false

     }

   ],

   "image_pipeline": {

     "name": "ImagePipeline",

     "args": {

       "task": "classification",

       "data_list_file_path": "{DATASET_JSON}",

       "data_file_base_dir": "{DATA_ROOT}",

       "data_list_key": "training",

       "crop_size": [

         256,

         256

       ],

       "data_format": "channels_last",

       "batch_size": 20,

       "num_channels": 3,

       "num_workers": 8,

       "prefetch_size": 21

     }

   }

 },

 "validate": {

   "pre_transforms": [

     {

       "name": "LoadPng",

       "args": {

         "fields": [

           "image"

         ]

       }

     },

     {

       "name": "NPResizeImage",

       "args": {

         "applied_keys": [

           "image"

         ],

         "output_shape": [

           256,

           256

         ],

         "data_format": "grayscale"

       }

     },

     {

       "name": "NPExpandDims",

       "args": {

         "applied_keys": "image",

         "expand_axis": 2

       }

     },

     {

       "name": "NPRepChannels",

       "args": {

         "applied_keys": "image",

         "channel_axis": 2,

         "repeat": 3

       }

     },

     {

       "name": "CenterData",

       "args": {

         "applied_keys": "image",

         "subtrahend": [

           2876.37,

           2876.37,

           2876.37

         ],

         "divisor": [

           883,

           883,

           883

         ]

       }

     }

   ],

   "metrics": [

     {

       "name": "MetricAverage",

       "args": {

         "name": "mean_accuracy",

         "applied_key": "val_accuracy"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Average_AUC",

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels",

         "auc_average": "macro",

         "stopping_metric": true

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Nodule",

         "class_index": 0,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Mass",

         "class_index": 1,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Distortion_pulmonary_architecture",

         "class_index": 2,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Pleural_based_mass",

         "class_index": 3,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Granuloma",

         "class_index": 4,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Fluid_in_pleural_space",

         "class_index": 5,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Right_hilar_abnormality",

         "class_index": 6,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Left_hilar_abnormality",

         "class_index": 7,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Major_atelectasis",

         "class_index": 8,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Infiltrate",

         "class_index": 9,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Scarring",

         "class_index": 10,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Pleural_fibrosis",

         "class_index": 11,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Bone_soft_tissue_lesion",

         "class_index": 12,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "Cardiac_abnormality",

         "class_index": 13,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     },

     {

       "name": "MetricAUC",

       "args": {

         "name": "COPD",

         "class_index": 14,

         "applied_key": "val_binary_preds",

         "label_key": "val_binary_labels"

       }

     }

   ],

   "image_pipeline": {

     "name": "ImagePipeline",

     "args": {

       "task": "classification",

       "data_list_file_path": "{DATASET_JSON}",

       "data_file_base_dir": "{DATA_ROOT}",

       "data_list_key": "validation",

       "crop_size": [

         256,

         256

       ],

       "data_format": "channels_last",

       "batch_size": 20,

       "num_channels": 3,

       "num_workers": 8,

       "prefetch_size": 21

     }

   },

   "inferer": {

     "name": "SimpleInferer"

   }

 }

}

Section Component Description
global epochs Number of training epochs
  learning_rate Initial learning rate
  multi_gpu Is the training on multiple GPUs? If not specified, defaults to false.
train model The model network component.
  loss The loss component
  optimizer The optimizer component
  pre_transforms List of transforms to be performed to the training data
  metrics Metrics to be computed during validation
  image_pipeline The image pipeline that produces batched training data
validate pre_transforms List of transforms to be performed to the validation data
  metrics Metrics to be computed during validation
  image_pipeline The image pipeline that produces batched validation data
  inferer The component that does inference on validation data.

Validation configuration

Validation config file config_validation.json defines the configuration of the validation workflow.

Segmentation model example

Here is the validation config of segmentation_ct_spleen:

{

   "batch_size": 1,


   "pre_transforms":

   [

       {

           "name": "LoadResolutionFromNifty",

           "args": {

             "applied_key": "image",

             "new_key": "image_resolution"

           }

       },

       {

           "name": "LoadNifty",

           "args": {

             "fields": "image"

           }

       },

       {

           "name": "Load3DShapeFromNumpy",

           "args": {

             "applied_key": "image",

             "new_key": "image_shape"

           }

       },

       {

           "name": "ResampleVolume",

           "args": {

             "applied_key": "image",

             "resolution": "image_resolution",

             "target_resolution": [

               1.0,

               1.0,

               1.0

             ]

           }

       },

       {

           "name": "VolumeTo4DArray",

           "args": {

             "fields": "image"

           }

       },

       {

           "name": "ScaleIntensityRange",

           "args": {

             "field": "image",

             "a_min": -57,

             "a_max": 164,

             "b_min": 0.0,

             "b_max": 1.0,

             "clip": true

           }

       }

   ],


   "post_transforms":

   [

       {

           "name": "ArgmaxAcrossChannels",

           "args": {

             "applied_key": "model"

           }

       },

       {

           "name": "NPResize3D",

           "args": {

             "applied_keys": "model",

             "output_shape_key": "image_shape",

             "nearest": true

           }

       },

       {

           "name": "SplitBasedOnLabel",

           "args": {

             "applied_key": "model",

             "channel_names": [

               "pred_class0",

               "pred_class1"

             ]

           }

       }

   ],


   "writers":

   [

     {

       "name": "NiftyWriter",

       "args": {

         "applied_key": "model",

         "dtype": "uint8",

         "write_path": "{MMAR_EVAL_OUTPUT_PATH}"

       }

     },

     {

       "name": "NiftyWriter",

       "args":

       {

           "applied_key": "pred_class0",

           "dtype": "uint8",

           "write_path": "{MMAR_EVAL_OUTPUT_PATH}"

       }

     },

     {

       "name": "NiftyWriter",

       "args":

       {

         "applied_key": "pred_class1",

         "dtype": "uint8",

         "write_path": "{MMAR_EVAL_OUTPUT_PATH}"

       }

     }

   ],


   "label_transforms":

   [

       {

           "name": "LoadNifty",

           "args": {

             "fields": "label"

           }

       },

       {

           "name": "SplitBasedOnLabel",

           "args": {

             "applied_key": "label",

             "channel_names": [

               "label_class0",

               "label_class1"

             ]

           }

       }

   ],


   "val_metrics":

   [

       {

           "name": "MetricAverageFromArrayDice",

           "args": {

             "name": "mean_dice",

             "applied_key": "pred_class1",

             "label_key": "label_class1",

             "report_path": "{MMAR_EVAL_OUTPUT_PATH}"

           }

       }

   ],


   "inferer":

   {

       "name": "ScanWindowInferer",

       "args": {

         "is_channels_first": true,

         "roi_size": [160, 160, 160],

         "batch_size": 3

       }

   },


   "model_loader":

   {

       "name": "FrozenGraphModelLoader",

       "args": {

         "model_file_path": "{MMAR_CKPT_DIR}/model.trt.pb"

       }

   }

}

Components Description
batch_size Size of validation data batch
pre_transforms Transforms to be applied to validation sample images before prediction computation
post_transforms Transforms to be applied to validation sample images after prediction computation
label_transforms Transforms to be applied to validation label images
val_metrics Validation metrics to be computed and reported
writers Writers that write prediction results to files
inferer The component that does prediction computation
model_loader The component that loads the pre-trained model

Converting from the old format

This release of the Ttransfer Learning Toolkit strictly follows a component-oriented approach. Each component is completely configured by its set of init parameters or “args”.

The first release, EA (Early Access), configuration format is mostly component oriented but not strictly so. Some parameters are defined outside of the components. For example, num_classes is a parameter of the model, but it is defined outside of the model component definition.

Another difference compared to the previous release's configuration is the lack of separation between class name and init parameters: the “name” parameter is used as the class name of the component, and all other parameters are treated as the init args of the component. We have found that some components (e.g. metrics) also use “name” as one of their init args. The EA release required a workaround use “tag” as the “name” parameter for init arg, and then when processing the component, special code is needed to change “tag” back to “name” before initializing the component.

This release separates class name from init args: just as in the Early Access release, “name” is used as the class name of the component; but all init args are placed within the “args” attribute to avoid potential conflict of parameter names.

Components

This section lists all currently available components implemented in Clara Train SDK. These components can be used in workflow configuration, as shown in section 6.4.

Type Name Description
Model SegmAhnet3D

A 3D segmentation model.

Args:

num_classes,

if_use_psp,

plane,

pretrain_weight_name,

final_activation='softmax', data_format='channels_first'

  SegResnet

A 3D segmentation model.

Args:

num_classes,

blocks_down='1,2,2,4',

blocks_up='1,1,1',

init_filters=8,

use_batch_norm=False,

use_group_norm=True,

use_group_normG=8,

reg_weight=0.0,

dropout_prob=0.0,

final_activation='softmax',

use_vae=False,

data_format='channels_first'

  DenseNet121

A 2D classification model.

Args:

weight_decay,

pretrain_weight_name

Loss Dice

A loss function with dice algorithm.

Args:

data_format='channels_first',

skip_background=False,

squared_pred=False,

jaccard=False,

smooth=1e-5,

top_smooth=0.0,

is_onehot_targets=False

  ClassificationLoss A loss function for classification models
Optimizer Adam This is a wrapper of

tf.train.AdamOptimizer

  SGD This is a wrapper of tf.train.MomentumOptimizer
Data Pipeline ImagePipeline

Produce batched data for training and validation.

Args:

task,

data_list_file_path,

data_file_base_dir,

data_list_key,

crop_size,

transforms,

data_format=”channels_first”,

num_data_dims=3,

num_channels=1,

num_label_channels=1,

batch_size=10,

num_workers=4,

prefetch_size=20,

shuffle=True,

repeat=True

Train Metric AccuracyComputer

Compute accuracy based on prediction and label.

Args:

tags: name of the accuracy tensor

use_sigmoid=False

  DiceMaskedOutput

Compute dice mask based on prediction and label.

Args:

tags,

data_format='channels_first',

skip_background=False,

is_onehot_targets=False,

is_independent_predictions

=False,

jaccard=False,

threshold=0.5

  DiceMetric

Compute dice value based on prediction and label.

Args:

data_format = 'channels_first',

skip_background=False,

is_onehot_targets=False,

is_independent_predictions

=False,

jaccard=False,

threshold = 0.5

Validation Metric

MetricAverageFromArrayDice

Computes dice score metric from full size np array and collects average.

Args:

applied_key,

name,

label_key='label',

do_print=True,

do_summary=True, stopping_metric=False, report_path=None

  MetricAUC

Computes AUC. Usually used for classification model validation.

Args:

applied_key,

name,

label_key,

do_print=True,

do_summary=True,

stopping_metric=False, report_path=None, auc_average='macro',

class_index=None

  MetricAverage

Generic class for tracking averages of metrics. Expects that the applied_key is a scalar value that will be averaged.

Args:

applied_key,

name,

do_print=True,

do_summary=True, stopping_metric=False, report_path=None

Model Loader CheckpointLoader

Load a model in checkpoint format.

Args:

checkpoint_dir,

input_node_names=None,

output_node_names=None,

checkpoint_file_prefix

='model.ckpt

  FrozenGraphModelLoader

Load a model in frozen graph format.

Args:

model_file_path,

input_node_names=None,

output_node_names=None

Writer NiftyWriter

Write inference result as NIFTY image.

Args:

applied_key,

write_path,

compressed=True,

dtype="float32",

use_identity=False

  ClassificationResultWriter

Write classification results.

Args:

applied_key,

write_path,

overwrite=True

Learning Rate Policy DecayLRonStep

Class for decaying learning rate based on the number of

steps. This policy decays the Learning rate by `decay_ratio`

specified every `decay_freq` steps.

Args:

decay_ratio,

decay_freq

  ReduceLRonPlateau

Class for reducing learning rates on plateau policy. Will reduce learning rate after a plateau has been reached a cert number

of times.

Args:

plateau_count_trigger, reduction_rate

  ReduceLRPoly

Class for reducing learning rate based on the epoch progress: lr = lr_init * (1 - e / total_epoch) ** poly_power.

Args:

poly_power

  ReduceLRCosine

Class for reducing learning rate based on the epoch progress

lr = lr_init * cos(0.5*pi* e / total_epoch).

Args:

poly_power

Inferer SimpleInferer Do inference by simply feeding the whole image to the network
  ScanWindowInferer

Scan the image into slices of specified ROI size, and then do inference on the slices.

Args:

roi_size,

is_channels_first=True, batch_size=1

Training with multiple GPUs

How It Works

TLT’s multi-GPU training is based on Horovod (https://github.com/horovod/horovod). It works as follows:

  • To train with N GPUs, N processes running exactly the same code are used. Each process is pinned to a GPU.
  • The Optimizer is made into Distributed Optimizer (by calling a horovod function).
  • Horovod synchronizes the gradients across all processes at each training step. For this to work, all processes have identical number of training steps.

Transfer Learning training uses two datasets: training dataset for minimizing loss, and validation dataset for validating the model to obtain the best model. In multi-GPU training, both datasets are sharded such that each process only takes a portion of the load.

  1. Training dataset sharding. The training dataset is divided among the number of GPUs. This is the main reason for reduced total training time - the number of training steps for each process/GPU is only 1/N of the total, where N is the number of GPUs. Since Horovod synchronizes the training processes at each step, the sharding algorithm makes sure that all shards have the same size: if the dataset size is not divisible by N, it adds the 1st element in the dataset to the short shards. At the beginning of each epoch, the content of each shard is shuffled globally such that each process gets to see the whole picture of the training dataset over time.
  2. Validation dataset sharding. The same algorithm is applied to the validation dataset, except that each shard does not need to be equal size.

When computing validation metrics, results by individual processes are aggregated using MPI’s gather function.

Training Parameters

It can be difficult to set up the training parameters properly with multi-GPU training.

  • Batch Size - The value of batch size is constrained by the GPU memory. You have to choose a batch size that is acceptable by all GPUs, if your GPUs don’t have the same amount of available memory.
  • Learning Rate - The value of learning rate is closely related to the number of GPUs and batch size. According to horovod, as the rule of thumb, you should scale up the learning rate by the number of GPUs. For example, suppose your LR for single GPU is 0.0001, you could start with a LR of 0.0002 when training with 2 GPUs. But it requires some experimentation to obtain the best LR.
Note: You can create your own train_ngpu.sh based on train_2gpu.sh. Make sure you adjust the learning rate accordingly.

Improving inference performance

If your model is trained with dynamic input shape (the Readme.md included in the model’s MMAR specifies whether the model is trained with dynamic input shape), you may be able to obtain significantly improved inference performance in both accuracy and speed.

TensorFlow 1.13 supports dynamic network input shapes. This allows the model’s computation graph to be built with placeholders of dynamic shape [None, None, None], which can accept input data of any size. Our POC shows that inference performance varies greatly for different input sizes. In general, inference tends to have better performance with bigger network input sizes.

Note: This is only true within certain size ranges. When the size goes beyond the range, the overall speed drops considerably, even though the total number of scanning windows are smaller. As of now, it is not clear how to accurately determine the upper bound. Only the model SegmAhnet3D has been modified to support dynamic input shape.

Training with dynamic shape

Transfer Learning has been modified to take advantage of training with dynamic shape. To use dynamic network input, you must modify config_train.json of your model, as shown here:

ImagePipeline

Set the crop_size of the two ImagePipeline components to [-1, -1, -1], as shown in this example. This sets the placeholder with input shape [None, None, None]. Here's an example:

"image_pipeline": {
   "name": "ImagePipeline",
   "args": {
       "task": "segmentation",
       "data_list_file_path": "{DATASET_JSON}",
       "data_file_base_dir": "{DATA_ROOT}",
       "data_list_key": "training",
       "crop_size": [-1, -1, -1],
       "data_format": "channels_first",
       "batch_size": 0,
       "num_channels": 1,
       "num_workers": 4,
       "prefetch_size": 0
   }
}

ScanWindowInferer

ScanWindowInferer can do inference for large images that cannot be fed to the model directly due to its large size. It is implemented with a sophisticated algorithm:

The ScanWindowInferer is first configured with a roi_size (roi = region of interest). Based on the roi_size, it “scans” the image into a set of overlapping patches called slices. It then computes prediction for each slice. It finally computes the overall prediction by aggregating the results from slice predictions.

If your model uses scanning window based inference during validation, you now must explicitly set its “roi_size” (see example below). Set it to a size that makes the best sense to your model: it produces good accuracy without going over the bound. In general, this size should be no less than the size of training crops. For SegmAhnet3D based models, the size must be divisible by 32. Here's an example:

"Inferer": {
   "name": "ScanWindowInferer",
   "args": {
     	"is_channels_first": true,
	"roi_size": [160, 160, 160]
   }
}
Note: Do NOT change crop size of any transforms for training. They decide the actual input size of the crops into the network for training.

Inference models trained with dynamic shape

To validate or inference with a model trained with dynamic input shape, you must also modify the ScanWindowInferer configuration in config_validation.json, by explicitly specifying its roi_size. You can be a little more generous here since you have more GPU memory to work with during validation and inference. To obtain optimal performance (higher accuracy with faster speed), you should experiment with different ROI sizes. Here's an example:

"Inferer": {
   "name": "ScanWindowInferer",
   "args": {
     	"is_channels_first": true,
	"roi_size": [224, 224, 224],
   }
}

The ScanWindowInferer offers another technique for improving inference speed: batch_size. The basic algorithm computes prediction for each slice one by one. This might not be able to fully utilize the GPU’s computing power. When specifying a batch_size > 1, you compute the predictions of multiple slices in one shot, hence potentially increasing the overall speed:

"Inferer": {
   "name": "ScanWindowInferer",
   "args": {
     	"is_channels_first": true,
	"roi_size": [224, 224, 224],
	“batch_size”: 2
   }
}
Note:

It is not always true that the bigger the size, the faster the overall inference. It takes some experimentation to determine the best roi_size.

roi_size can cause change to both inference accuracy and speed; whereas batch_size only cause changes to inference speed (i.e. inference should produce exactly the same accuracy for the same roi_size, regardless of batch_size).

Objectivity of trained models

The accuracy of the model is determined by the validation performed during training. With fixed network input shape, both training and validation (which runs inference against the graph) use the same network input shape. The accuracy of the model is therefore also fixed. However, with dynamic input shapes, training and validation no longer have to use the same input size. For example, you can use [96, 96, 96] as the crop size for training, whereas [160, 160, 160] as the ROI size of the ScanWindowInferer for validation. Using different ROI sizes for validation could produce different accuracy of the trained model.

So the question is how important is the accuracy value produced by the training process, and whether the quality of the trained model depends on the ROI size of the ScanWindowInferer used by validation?

To find answers to these questions, we ran multiple rounds of training with different ROI sizes with deterministic training enabled. All these runs produced the best model at exactly the same epoch with different “best mean dice” values.

Based on the results of these experiments, we conclude:

  • It appears that the quality of the trained model does not depend on the ROI size for validation, even though the accuracy values do vary for different ROI sizes. This means that the trained model is objective.
  • The accuracy value determined by the training process is still important, but only in relative sense. You can probably still compare two models and judge which is better, but you should do so with the same ROI size for the validation.

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DIGITS, DGX, DGX-1, DGX Station, GRID, Jetson, Kepler, NVIDIA GPU Cloud, Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, Tesla and Volta are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.