NVIDIA Clara Train 3.1
v3.1

Model training and validation configurations

Clara workflows are made of different types of components. For each type, there are usually multiple choices. To put together a workflow, you specify and configure the components to be used.

Clara offers two kinds of workflows: training and validation. Workflow configurations are defined in JSON files: config_train.json for training workflows and config_validation.json for validation workflows.

The training config file config_train.json defines the configuration of the training workflow. The config contains three sections: global variables, train, and validate.

You can define global variables in the configuration JSON file. These variables can be overwritten through the environment.json or even command line. The typical global variables include:

Copy
Copied!
            

{ "epochs": 5000, "num_training_epoch_per_valid": 20, "learning_rate": 1e-4, "multi_gpu": false, … }

By overwriting the values of these variables through command line, you can experiment with different training settings without having to modify the config file. For example: in commands/train.sh

Copy
Copied!
            

python3 -u -m nvmidl.apps.train \ -m $MMAR_ROOT \ -c $CONFIG_FILE \ -e $ENVIRONMENT_FILE \ --set \ epochs=1260 \ learning_rate=0.0001 \ num_training_epoch_per_valid=20 \ multi_gpu=false

The “train” section defines the components for the training process, including “loss”, “optimizer”, “lr_policy”, “model”, “pre_transforms” and “image_pipeline”. Each component is constructed by providing the component’s class “name” and the init arguments “args”.

Similarly, the “validate” section defines the components for validation process, including “metrics”, “pre_transforms”, “image_pipeline” and “inferer”. Each component is constructed the same way by providing the component class “name” and the corresponding init arguments “args”.

If you want to use an externally-implemented component class, you can do so by specifying the class “path” to replace the “name”. For more details see Bring your own components (BYOC).

Explanation of components in the Training workflow:

Section

Component

Description

global

epochs

Number of training epochs

num_training_epoch_per_valid

Validation frequency in number of epochs. If not specified, defaults to 1.

learning_rate

The initial learning rate

multi_gpu

Is the training on multiple GPUs? If not specified, defaults to false.

determinism

Optional section with seeds for deterministic training. See determinism for details.

use_amp

Whether or not to use Automatic Mixed Precision. Defaults to false.

dynamic_input_shape

Whether or not to construct graph with dynamic input shape, allowing for inference with images of different shape. Defaults to false.

train

loss

The loss component

optimizer

The optimizer component

lr_policy

The learning rate policy

model

The model network component

pre_transforms

List of transforms to be applied to the training data. Typically, “fields” corresponds to the fields in the Datalist JSON file that operations will be applied on.

batch_transforms

Component to configure using MergeBatchDims. See ImagePipeline for details.

image_pipeline

The image pipeline that generates batched training data items. See ImagePipeline for details.

validate

metrics

Metrics to be computed during validation

pre_transforms

Transforms to be applied to the validation data

image_pipeline

The image pipeline that generates batched validation data items. See ImagePipeline for details.

inferer

The inferer to be used for performing inference on validation data. Options are TFScanWindowInferer and TFSimpleInferer

Segmentation model example

Here is an example of config_train.json of the segmentation_ct_spleen model:

Copy
Copied!
            

{ "epochs": 1250, "num_training_epoch_per_valid": 20, "learning_rate": 1e-4, "multi_gpu": false, "determinism": { "python_seed": "20191201", "random_seed": 123456, "numpy_seed": 654321, "tf_seed": 11111 }, "use_amp": false, "dynamic_input_shape": true, "train": { "loss": { "name": "Dice" }, "optimizer": { "name": "Adam" }, "lr_policy": { "name": "DecayOnStep", "args": { "decay_ratio": 0.1, "decay_freq": 50000 } }, "model": { "name": "SegAhnet", "args": { "num_classes": 2, "if_use_psp": false, "pretrain_weight_name": "{PRETRAIN_WEIGHTS_FILE}", "plane": "z", "final_activation": "softmax", "n_spatial_dim": 3 } }, "pre_transforms": [ { "name": "LoadNifti", "args": { "fields": ["image", "label"] } }, { "name": "ConvertToChannelsFirst", "args": { "fields": ["image", "label"] } }, { "name": "ScaleIntensityRange", "args": { "fields": "image", "a_min": -57, "a_max": 164, "b_min": 0.0, "b_max": 1.0, "clip": true } }, { "name": "FastCropByPosNegRatio", "args": { "size": [96, 96, 96], "fields": "image", "label_field": "label", "pos": 1, "neg": 1, "batch_size": 3 } }, { "name": "RandomAxisFlip", "args": { "fields": ["image", "label"], "probability": 0.0 } }, { "name": "RandomRotate3D", "args": { "fields": ["image", "label"], "probability": 0.0 } }, { "name": "ScaleIntensityOscillation", "args": { "fields": "image", "magnitude": 0.10 } } ], "image_pipeline": { "name": "SegmentationImagePipelineWithCache", "args": { "data_list_file_path": "{DATASET_JSON}", "data_file_base_dir": "{DATA_ROOT}", "data_list_key": "training", "output_crop_size": [96, 96, 96], "output_batch_size": 3, "batched_by_transforms": true, "num_workers": 2, "prefetch_size": 0, "num_cache_objects": 20, "replace_percent": 0.25 } } }, "validate": { "metrics": [ { "name": "ComputeAverageDice", "args": { "name": "mean_dice", "is_key_metric": true, "field": "model", "label_field": "label" } } ], "pre_transforms": [ { "name": "LoadNifti", "args": { "fields": ["image", "label"] } }, { "name": "ConvertToChannelsFirst", "args": { "fields": ["image", "label"] } }, { "name": "ScaleIntensityRange", "args": { "fields": "image", "a_min": -57, "a_max": 164, "b_min": 0.0, "b_max": 1.0, "clip": true } } ], "image_pipeline": { "name": "SegmentationImagePipeline", "args": { "data_list_file_path": "{DATASET_JSON}", "data_file_base_dir": "{DATA_ROOT}", "data_list_key": "validation", "output_crop_size": [-1, -1, -1], "output_batch_size": 1, "num_workers": 2, "prefetch_size": 0 } }, "inferer": { "name": "TFScanWindowInferer", "args": { "roi_size": [160, 160, 160] } } } }


Classification model example

Here is the example for the classification_chestxray:

Copy
Copied!
            

{ "epochs": 40, "multi_gpu": false, "learning_rate": 2e-4, "train": { "model": { "name": "DenseNet121", "args": { "weight_decay": 1e-5, "data_format": "channels_last", "pretrain_weight_name": "{PRETRAIN_WEIGHTS_FILE}" } }, "loss": { "name": "BinaryClassificationLoss" }, "optimizer": { "name": "Adam" }, "pre_transforms": [ { "name": "LoadPng", "args": { "fields": ["image"] } }, { "name": "CropRandomSizeWithDisplacement", "args": { "lower_size": [0.9, 0.9], "fields": "image", "max_displacement": 200 } }, { "name": "ScaleToShape", "args": { "fields": ["image"], "target_shape": [256, 256] } }, { "name": "RandomRotate2D", "args": { "fields": ["image"], "angle": 7, "is_random": true } }, { "name": "ConvertToChannelsLast", "args": { "fields": "image" } }, { "name": "RepeatChannel", "args": { "fields": "image", "repeat_times": 3 } }, { "name": "CenterData", "args": { "fields": "image", "subtrahend": [2876.37, 2876.37, 2876.37], "divisor": [883, 883, 883] } } ], "aux_ops": [ { "name": "ComputeAccuracy", "args": { "tag": "accuracy", "use_sigmoid": true } }, { "name": "ComputeBinaryPreds", "args": { "binary_preds_name": "binary_preds", "binary_labels_name": "binary_labels" } } ], "image_pipeline": { "name": "ClassificationImagePipeline", "args": { "data_list_file_path": "{DATASET_JSON}", "data_file_base_dir": "{DATA_ROOT}", "data_list_key": "training", "output_crop_size": [256, 256], "output_data_format": "channels_last", "output_batch_size": 20, "output_image_channels": 3, "num_workers": 4, "prefetch_size": 21 } } }, "validate": { "pre_transforms": [ { "name": "LoadPng", "args": { "fields": ["image"] } }, { "name": "ScaleToShape", "args": { "fields": ["image"], "target_shape": [256, 256] } }, { "name": "ConvertToChannelsLast", "args": { "fields": "image" } }, { "name": "RepeatChannel", "args": { "fields": "image", "repeat_times": 3 } }, { "name": "CenterData", "args": { "fields": "image", "subtrahend": [2876.37, 2876.37, 2876.37], "divisor": [883, 883, 883] } } ], "metrics": [ { "name": "ComputeAverage", "args": { "name": "mean_accuracy", "field": "accuracy" } }, { "name": "ComputeAUC", "args": { "name": "Average_AUC", "field": "binary_preds", "label_field": "binary_labels", "auc_average": "macro", "is_key_metric": true } }, { "name": "ComputeAUC", "args": { "name": "Nodule", "class_index": 0, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Mass", "class_index": 1, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Distortion_pulmonary_architecture", "class_index": 2, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Pleural_based_mass", "class_index": 3, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Granuloma", "class_index": 4, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Fluid_in_pleural_space", "class_index": 5, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Right_hilar_abnormality", "class_index": 6, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Left_hilar_abnormality", "class_index": 7, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Major_atelectasis", "class_index": 8, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Infiltrate", "class_index": 9, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Scarring", "class_index": 10, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Pleural_fibrosis", "class_index": 11, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Bone_soft_tissue_lesion", "class_index": 12, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "Cardiac_abnormality", "class_index": 13, "field": "binary_preds", "label_field": "binary_labels" } }, { "name": "ComputeAUC", "args": { "name": "COPD", "class_index": 14, "field": "binary_preds", "label_field": "binary_labels" } } ], "image_pipeline": { "name": "ClassificationImagePipeline", "args": { "data_list_file_path": "{DATASET_JSON}", "data_file_base_dir": "{DATA_ROOT}", "data_list_key": "validation", "output_crop_size": [256, 256], "output_data_format": "channels_last", "output_batch_size": 20, "output_image_channels": 3, "num_workers": 4, "prefetch_size": 21 } }, "inferer": { "name": "TFSimpleInferer" } } }


Component reference in config_train.json

In config_train.json, many transforms defined in the “validate” section are exact copies of those in the “train” section. The component reference mechanism reduces human effort in duplicating the transforms.

With this, “ref” in the “validate” section will point to a transform defined in the “train” section with the specified name.

Note

The component reference mechanism only works for transforms!

Example:

Copy
Copied!
            

{ ... "train": { ... "pre_transforms": [ { "name": "LoadNifti", "args": { "fields": [ "image", "label" ] } }, { "name": "ConvertToChannelsFirst", "args": { "fields": [ "image", "label" ] } }, { "name": "ScaleIntensityRange", "args": { "fields": "image", "a_min": -57, "a_max": 164, "b_min": 0.0, "b_max": 1.0, "clip": true } } ... ], ... "validate": { ... "pre_transforms": [ { "ref": "LoadNifti" }, { "ref": "ConvertToChannelsFirst" }, { "ref": "ScaleIntensityRange" } ], ... } }

You can also assign an alias to the transform in the “train” section, and then use the alias in “ref” in the “validate” section. This feature could be useful in case the same transform type is used more than once in the “train” section – you can assign different aliases for them and reference them via the aliases in “validate” section.

In the following example, the “LoadNifti” transform is assigned an alias “loadImage”, which is referenced in the “validate” section:

Copy
Copied!
            

{ ... "train": { ... "pre_transforms": [ { "name": "LoadNifti#loadImage", "args": { "fields": [ "image", "label" ] } }, ... ], ... "validate": { ... "pre_transforms": [ { "ref": "loadImage" }, ... ], ... } }


The validation config file config_validation.json defines the configuration of the validation workflow.

Components

Description

batch_size

Size of validation data batch

pre_transforms

Transforms to be applied to validation sample images before prediction computation

post_transforms

Transforms to be applied to validation sample images after prediction computation

label_transforms

Transforms to be applied to validation label images

val_metrics

Validation metrics to be computed and reported

writers

Writers that write prediction results to files

inferer

The component that does prediction computation

model_loader

The component that loads the pre-trained model

Validation configuration: segmentation model example

Here is the validation config of segmentation_ct_spleen:

Copy
Copied!
            

{ "batch_size": 1, "pre_transforms": [ { "name": "LoadNifti", "args": { "fields": "image" } }, { "name": "ConvertToChannelsFirst", "args": { "fields": "image" } }, { "name": "ScaleIntensityRange", "args": { "fields": "image", "a_min": -57, "a_max": 164, "b_min": 0.0, "b_max": 1.0, "clip": true } } ], "post_transforms": [ { "name": "ArgmaxAcrossChannels", "args": { "fields": "model" } }, { "name": "SplitBasedOnLabel", "args": { "field": "model", "label_names": ["pred_class0", "pred_class1"] } }, { "name": "CopyProperties", "args": { "fields": ["pred_class0", "pred_class1", "model"], "from_field": "image", "properties": ["affine"] } } ], "writers": [ { "name": "WriteNifti", "args": { "field": "model", "dtype": "uint8", "write_path": "{MMAR_EVAL_OUTPUT_PATH}" } }, { "name": "WriteNifti", "args": { "field": "pred_class0", "dtype": "uint8", "write_path": "{MMAR_EVAL_OUTPUT_PATH}" } }, { "name": "WriteNifti", "args": { "field": "pred_class1", "dtype": "uint8", "write_path": "{MMAR_EVAL_OUTPUT_PATH}" } } ], "label_transforms": [ { "name": "LoadNifti", "args": { "fields": "label" } }, { "name": "ConvertToChannelsFirst", "args": { "fields": "label" } }, { "name": "SplitBasedOnLabel", "args": { "field": "label", "label_names": ["label_class0", "label_class1"] } } ], "val_metrics": [ { "name": "ComputeAverageDice", "args": { "name": "mean_dice", "field": "pred_class1", "label_field": "label_class1", "report_path": "{MMAR_EVAL_OUTPUT_PATH}" } } ], "inferer": { "name": "TFScanWindowInferer", "args": { "roi_size": [160, 160, 160], "batch_size": 3 } }, "model_loader": { "name": "FrozenGraphModelLoader", "args": { "model_file_path": "{MMAR_CKPT_DIR}/model.trt.pb" } } }

Note

Because the network input size is dynamic, when performing inference, you must explicitly set the actual input size. In this example, TFScanWindowInferer’s roi_size is set to [160, 160, 160], which is used as the actual input size to the network, where ROI stands for Region Of Interest. See Improving inference performance for details.

Users can disable a component by adding ‘disabled’: true in that component, as follows:

Copy
Copied!
            

{ "name": "RandomSpatialFlip", "args": { "fields": ["image" "label"] }, "disabled": true }

This option is useful if users would like to temporarily disable a component when trying different combinations of components in experiments.

Configuring validation or testing dataset

Similar to how data_list_key is used in ImagePipeline to specify the key in the dataset json containing the dataset you want to use, DATA_LIST_KEY can be used to configure which dataset to use for validation or inference.

By default, DATA_LIST_KEY is an environment variable that is set to “validation”. If you want to set a different value for example “my_infer_key” for a file /workspace/data/my_own_infer_dataset/data_list.json containing:

Copy
Copied!
            

{ "my_infer_key": [ { "image": "test/test000.nii.gz" }, { "image": "test/test001.nii.gz" }, ... }

Simply set DATA_LIST_KEY to “my_infer_key” in the environment.json file like follows:

Copy
Copied!
            

{ "DATA_ROOT": "/workspace/data/my_own_infer_dataset", "DATASET_JSON": "/workspace/data/my_own_infer_dataset/data_list.json", "MMAR_CKPT_DIR": "models", "MMAR_EVAL_OUTPUT_PATH": "eval", "PROCESSING_TASK": "segmentation", "DATA_LIST_KEY": "my_infer_key" }

For details on how to configure the dataset json file properly specifically for classification models, see Classification models: Datalist JSON file.

Converting from the old formats

Clara Train API replaces a lot of what was in Transfer Learning Toolkit for GA (General Access). Many transforms have been renamed and parameters may have also changed. Click here for details.

Converting from Transfer Learning Toolkit for EA (Early Access) to GA (General Access)

TLT for GA strictly followed a component-oriented approach with each component completely configured by its set of init parameters or “args”.

In the first release, EA, the configuration format was mostly component oriented but not strictly so. Some parameters were defined outside of the components. For example, num_classes was a parameter of the model, but it was defined outside of the model component definition.

Another difference in the configuration was the lack of separation between class name and init parameters: the “name” parameter was used as the class name of the component, and all other parameters were treated as the init args of the component. However, some components (e.g. metrics) also used “name” as one of their init args. The EA release required a workaround to use “tag” as the “name” parameter for init arg, and then when processing the component, special code was needed to change “tag” back to “name” before initializing the component.

The GA release separated class name from init args: just as in the Early Access release, “name” was used as the class name of the component; but all init args were placed within the “args” attribute to avoid potential conflict of parameter names.

© Copyright 2020, NVIDIA. Last updated on Feb 2, 2023.