NVIDIA Clara Train 4.0
4.0

Getting started with Clara

Clara uses an algorithm for supervised training to find the best model based on training and validation datasets.

The training dataset contains pairs of data items that are used for minimizing loss. The validation dataset contains pairs of data items for validation during the training.

A single pass through a full dataset is referred to as an epoch. Since a full dataset cannot typically be processed in a single iteration, it is divided into batches of data items. For each batch, an optimizer minimizes a loss function and adjusts the weights of the model accordingly. Training metrics are collected and logged during this process.

Once all iterations are completed for the epoch, validation is performed if necessary. Validation is performed by running the validation dataset through the current model. Validation metrics are computed, which measure the quality of the current model using several metrics. One important metric is called the stopping metric, which is used to determine the quality of the model.

Validation is usually configured to be run every N epochs, where N is configurable. The result of the validation determines the best model. The algorithm keeps the current best key metric, which is initialized to a large negative number. Each time validation is done, the computed key metric is compared with the current best. If it is better, then the current best is set to the new metric value, and the current model is written to disk in the model.pt file. The model.pt represents the best model.

The more validations performed, the more likely you are to find the best model. Finding the best model by performing validation after each iteration can take a long time because validation needs to go through the whole validation dataset. In practice, validation should be performed every N epochs with N being configured using the num_interval_per_valid parameter.

When the training is complete, the final_model.pt is written to disk and used for fine-tuning. This general algorithm is used for all modes of training: train, fine-tune, multi-gpu train, and multi-gpu fine-tune.

If the data format is DICOM or the resolution is not isotropic, one can use the provided data converter tool to convert the data to isotropic NIfTI format. Furthermore, many pre-trained models were trained on 1x1x1mm resolution images, and to use those pre-trained models as a starting point, convert the data to 1x1x1mm NIfTI format (Notice: If the dataset is already in NIfTI format, but not with 1x1x1 mm spacing, the data conversion is still required for the dataset.).

The medl-dataconvert command converts all dicom volumes in your/data/directory to NIfTI format and optionally re-samples them to the provided resolution. If the images to be converted are segmentation labels, the -l flag needs to be added so the resampler will use nearest neighbor interpolator (otherwise linear interpolator is used).

Copy
Copied!
            

medl-dataconvert -d your/data/directory -r 1 -s .dcm -e .nii.gz -o your/output/directory

Note

If you need to convert both 3D volumetric images and their segmentation labels, put them into two different folders, and run the converter once for the images and once for the labels using the -l flag.

Supported options are:

Option

Description

-d

Input directory with subdirectories containing dicom images.

-r

Output image resolution. If not provided, dicom resolution will be preserved. If only a single value is provided, target resolution will be isotrophic (e.g. -r 1 for 1x1x1mm resolution)

-s

Input file format, can be .dcm, .nii, .nii.gz, .mha, .mhd.

-e

Output file format, can be .nii, .nii.gz, .mha, .mhd.

-o

Output directory.

-f

(Optional) Force overwriting exsisting files if output directory already exists.

-l

(Optional) Flag indicating that the data is LABEL/SEGMENTATION masks and the nearest neighbor interpolation is used for re-sampling.

Classification models: Prepare the data

This section describes the format in which the data can be used with transfer learning for 2D classification tasks.

Classification models: Data format

All input images and labels must be in png format. If you are planning to resample images, e.g., to 256x256, it is best to do that as a pre-processing step, rather than have the Clara Train SDK do that on the fly. The png files can be 8- or 16-bit. You must also have ground truth labels available. These are often binary, i.e., {0,1}, or multi-class, i.e., {0,…,C} if there are C classes.

Classification models: Folder structure

The layout of data files can be arbitrary, but the JSON file describing the data list must contain relative paths to all image files.:

Copy
Copied!
            

|--dataset_root: |--datalist.json |--png_files |--im1.png |--im2.png |--im3.png


Classification models: Datalist JSON file

The JSON file describing the data structure must include a label_format` key. The corresponding value should be a list of natural numbers, specifying the number of type of labels in the dataset. For instance, for the PLCO dataset, there are 15 binary labels, so it should be a list of 15 ones: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1].

The datafile should also have a training and validation key. These keys contain:

  • a list of dictionaries, where the value for the image key must be a relative path to the png file.

  • the value for the label key must be a list of natural numbers corresponding to the ground truth labels.

The labels for each image must match the label_format specified above.:

Copy
Copied!
            

{ "label_format": [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1], "training": [ { "image" : "im1.png" "label" : [0,0,1,0,0,0,0,0,0,0,1,0,0,0,0] }, ...

The validation key is optional and only needs to be specified if the main training config file specifies metrics to compute. If the validation key is provided, it specifies the corresponding images and labels used to compute the validation metrics at the end of each training epoch (or less/more frequently if specified in the main training config).

Training a classification model

Run train.sh to train the model.

Copy
Copied!
            

cd path/to/mmar/commands/folder ./train.sh

To fine-tune based on the pre-trained model included in the MMAR, first change the DATA_ROOT and DATASET_JSON to point to your dataset and data split configuration. Then run the train_finetune.sh:

Copy
Copied!
            

cd path/to/mmar/commands/folder ./train_finetune.sh

The resulting checkpoint files are stored in the models folder of the MMAR.

For more details about MMAR, see Medical Model Archive (MMAR).

Classification models: Multi-GPU training

To run multi-gpu training, run train_2gpu.sh. See Medical Model Archive (MMAR).

When training or finetuning the models in multi-GPU setting on small number of training data, it is recommended to adjust the learning rate provided in the configuration files, e.g. multiple the learning rate by the GPU number as is recommended in https://arxiv.org/pdf/1706.02677.pdf.

Classification models: Tensorboard visualization

You can run the following command to use Tensorboard for visualization.

Copy
Copied!
            

python3 -m tensorboard.main --logdir "${MODEL_DIR}"


Classification models: Exporting the model to a TensorRT optimized model inference

After the model has been trained, run export.sh from the “commands” folder in MMAR to export the checkpoint into frozen graphs.

Copy
Copied!
            

cd path/to/mmar/commands/folder ./export.sh

Two frozen graph files will be produced in the models folder of the MMAR:

  • model.fzn.pb - a regular frozen graph

  • model.trt.pb - TRT-optimized frozen graph

Classification model evaluation with ground truth

Run validate.sh from the MMAR.

Copy
Copied!
            

cd path/to/mmar/commands/folder ./validate.sh

The validation result files are created in the eval folder of the MMAR.

Classification model inference

Run infer.sh from the MMAR.

Copy
Copied!
            

cd path/to/mmar/commands/folder ./infer.sh

The inference result files are created in the eval folder of the MMAR.

Note

Use the same configuration file for both validation and inference. For inference, the metric values specified in the configuration file won’t be computed, and no ground truth label is needed.


This section provides instructions on preparing your data, training models, exporting, evaluating and performing inference on the trained segmentation models using transfer learning.

Segmentation models: Prepare the data

All input images and labels must be in NIfTI format. Each input image and its corresponding label mask must have the same image dimension. To visualize or save NIfTI images, you can use free viewers such as ITK-SNAP or MITK.

If your native data format is different from NIfTI or if you want to convert the image and label mask to isotropic resolution, you can use the provided data converter or some other software of your choice, such as ITK-SNAP or directly in Python.

Segmentation models: Folder structure

The layout of data files can be arbitrary, but the JSON file describing the data list must contain the relative paths to all data files.:

Copy
Copied!
            

|--dataset_root: |--datalist.json |--train |--im1.nii.gz |--lb1.nii.gz |--im2.nii.gz |--lb2.nii.gz |--im3.nii.gz |--lb3.nii.gz |--im4.nii.gz |--lb4.nii.gz |--val |--im1.nii.gz |--lb1.nii.gz |--im2.nii.gz |--lb2.nii.gz

For example, the datalist.json file looks similar to this. Here all paths are relative to datalist.json location.:

Copy
Copied!
            

{ "training": [ { "image" : "train/im1.nii.gz", "label" : "train/lb1.nii.gz" }, { "image" : "train/im2.nii.gz", "label" : "train/lb2.nii.gz" }, { "image" : "train/im3.nii.gz", "label" : "train/lb3.nii.gz" }, { "image" : "train/im4.nii.gz", "label" : "train/lb4.nii.gz" }, ], "validation": [ { "image" : "val/im1.nii.gz", "label" : "val/lb1.nii.gz" }, { "image" : "val/im2.nii.gz", "label" : "val/lb2.nii.gz" }, ] }

The training and validation lists contain the images to be used in the training and validation steps, respectively.

Note

By default, all paths inside the datalist.json are assumed relative to the datalist.json file location. You can optionally specify the ROOT base path of the datasets by specifying it in the main config file (image_base_dir JSON key) or as a command line option (–file_root) to the train command.


Segmentation models: Datalist JSON file

The JSON file describing the data structure must include the training key with a list of items (each containing image and label keys).

The value for the image key can be a string containing the path to a single NIfTI file or a list of strings that are paths to NIfTI files. If there are several channels they are saved as separate files. Here is an example:

Copy
Copied!
            

{ "image" : [ "train/im1_ch1.nii.gz", "train/im1_ch2.nii.gz", "train/im1_ch3.nii.gz", "train/im1_ch4.nii.gz" ] "label" : "train/lb1.nii.gz" },

Note

If image includes several files, they will be concatenated as separate channels of the network input. These images must be already spatially aligned.

The value for the label key, must be a string containing the path to a single NIfTI file with dense segmentation masks. The label mask defines segmentation using indices. Each integer index is a separate class or a multichannel one-hot-encoded image, where each channel represents a separate class.

The validation key is optional. If provided, the corresponding images/labels will be used to compute the validation metrics at the end of each specified training epoch in this release or less/more frequent if specified in the main training config. The validation section does not need to include the label keys, if the datalist.json is used for inference to compute the output segmentation masks.

Training a segmentation model

Segmentation training

Use train.sh to train the model:

Copy
Copied!
            

cd path/to/mmar/commands/folder ./train.sh


Segmentation models: Fine tuning

To fine-tune based on the pre-trained model included in the MMAR, first change the DATA_ROOT and DATASET_JSON to point to your dataset and data split configuration. Then run train_finetune.sh:

Copy
Copied!
            

cd path/to/mmar/commands/folder ./train_finetune.sh

The resultant checkpoint files are stored in the models folder of the MMAR.

For more details see Medical Model Archive (MMAR).

Segmentation models: Multi-GPU training

To run 2-gpu training, run train_2gpu.sh.

Copy
Copied!
            

cd path/to/mmar/commands/folder ./train_2gpu.sh

To fine-tune based on the pre-trained model included in the MMAR, first change the DATA_ROOT and DATASET_JSON to point to your dataset and data split configuration. Then run train_2gpu_finetune.sh:

Copy
Copied!
            

cd path/to/mmar/commands/folder ./train_2gpu_finetune.sh

The resulting checkpoint files are stored in the models folder of the MMAR.

See Medical Model Archive (MMAR) for more details.

When training or fine-tuning the models using the multi-GPU setting on a relatively small training dataset, it is recommended to adjust the learning rate provided in the configuration files, e.g. multiply the learning rate by the number of GPUs as is recommended in https://arxiv.org/pdf/1706.02677.pdf. You can create your own train_Ngpu.sh based on train_2gpu.sh. Make sure to adjust the learning rate accordingly.

Segmentation model evaluation with ground truth

Run validate.sh from the MMAR.

Copy
Copied!
            

cd path/to/mmar/commands/folder ./validate.sh

The validation result files are created in the eval folder of the MMAR.

See Model training and validation configurations for example of validation config for classification model.

Segmentation model inference

Use infer.sh to run inference on the model from the Medical Model Archive.

Copy
Copied!
            

cd path/to/mmar/commands/folder ./infer.sh

The inference result files are created in the eval folder of the MMAR.

Note

The same configuration file is used for both validation and inference. For inference, the metric values specified in the configuration file won’t be computed, and no ground truth label is needed.


© Copyright 2020, NVIDIA. Last updated on Feb 2, 2023.