Model training and validation configurations
Clara workflows are made of different types of components. For each type, there are usually multiple choices. To put together a workflow, you specify and configure the components to be used.
Clara offers two kinds of workflows: training and validation. Workflow configurations are defined in JSON files: config_train.json for training workflows and config_validation.json for validation workflows.
The training config file config_train.json defines the configuration of the training workflow. The config contains three sections: global variables, train, and validate.
You can define global variables in the configuration JSON file. These variables can be overwritten through the environment.json or even command line.
By overwriting the values of these variables through command line, you can experiment with different training settings
without having to modify the config file. For example, in commands/train.sh
:
python3 -u -m medl.apps.train \
-m $MMAR_ROOT \
-c $CONFIG_FILE \
-e $ENVIRONMENT_FILE \
--write_train_stats \
--set \
print_conf=True \
epochs=1260 \
learning_rate=0.0002 \
num_interval_per_valid=20 \
multi_gpu=False \
cudnn_benchmark=False \
dont_load_ckpt_model=True \
${additional_options}
The “train” section defines the components for the training process. Each component is constructed by providing the component’s class “name” and the init arguments “args”.
Similarly, the “validate” section defines the components for validation process. Each component is constructed the same way by providing the component class “name” and the corresponding init arguments “args”.
If you want to use an externally-implemented component class, you can do so by specifying the class “path” to replace the “name”. For more details see Bring your own components (BYOC).
Explanation of components in the Training workflow:
Section |
Component |
Description |
---|---|---|
global |
epochs |
Number of training epochs |
num_interval_per_valid |
Validation frequency in number of epochs |
|
learning_rate |
The initial learning rate |
|
multi_gpu |
Is the training on multiple GPUs? If not specified, defaults to false. |
|
determinism |
Optional section with seeds for deterministic training. |
|
cudnn_benchmark |
Whether or not to set torch.backends.cudnn.benchmark. Will not set any value if not in config. See performance tuning guide: cuDNN auto-tuner. |
|
amp |
Whether or not to use Automatic Mixed Precision. Defaults to false. |
|
dont_load_ckpt_model |
Whether or not to load previous model for fine-tuning. Please note that this is a double negative so for fine-tuning dont_load_ckpt_model should be false. |
|
dont_load_ts_model |
Used in conjunction with the above to validate with or without CKPT. See clara_pt_spleen_ct_segmentation MMAR commands for examples. |
|
tf32 |
Whether or not to use TF32 on Ampere GPUs, supported from PyTorch 1.7. This will set both |
|
train |
loss |
The loss component |
optimizer |
The optimizer component |
|
lr_scheduler |
The learning rate policy |
|
model |
The model network component |
|
pre_transforms |
List of transforms to be applied to the training data. Typically, “keys” corresponds to the fields in the Datalist JSON file that operations will be applied on. |
|
dataset |
Configuration for data items. |
|
dataloader |
The dataloader generates batched training data items. |
|
inferer |
||
handlers |
Handlers serve various important functions. Please note that the order of some may make a difference. |
|
post_transforms |
The transforms to be applied to the trained model output. |
|
key_metric |
Key metric to compute on training data see PyTorch-Ignite metrics. |
|
additional_metrics |
Additional metrics to compute see PyTorch-Ignite metrics. |
|
trainer |
||
validate |
pre_transforms |
List of transforms to be applied to the training data. Can use “ref” for same transforms in train. |
dataset |
Configuration for data items. |
|
dataloader |
The dataloader generates batched training data items. |
|
inferer |
||
handlers |
Handlers serve various important functions. Please note that the order of some may make a difference. |
|
post_transforms |
The transforms to be applied on model validation output. |
|
key_metric |
Key metric to compute on validation data see PyTorch-Ignite metrics. |
|
additional_metrics |
Additional metrics to compute |
|
evaluator |
You can define global variables in the configuration JSON file, but these variables can be overwritten through environment.json and then the shell command or command line at the highest priority. Examples of this can be seen throughout the commands in the sample MMARs. Please note that for boolean values in the shell script, True and False need to be capitalized whereas they are not in the JSON configs.
Segmentation model configuration example
Here is an example of config_train.json of the example_clara_pt_spleen_ct_segmentation model:
{
"epochs": 1260,
"num_interval_per_valid": 20,
"learning_rate": 2e-4,
"multi_gpu": false,
"amp": true,
"determinism": {
"random_seed": 0
},
"cudnn_benchmark": false,
"dont_load_ckpt_model": true,
"network_summary": {
"network": "@model",
"input_size": [2, 1, 96, 96, 96]
},
"train": {
"loss": {
"name": "DiceLoss",
"args":{
"to_onehot_y": true,
"softmax": true
}
},
"optimizer": {
"name": "Adam",
"args": {
"params": "#@model.parameters()",
"lr": "{learning_rate}"
}
},
"lr_scheduler": {
"name": "StepLR",
"args": {
"optimizer": "@optimizer",
"step_size": 5000,
"gamma": 0.1
}
},
"model": {
"name": "UNet",
"args": {
"spatial_dims": 3,
"in_channels": "{INPUT_CHANNELS}",
"out_channels": "{OUTPUT_CHANNELS}",
"channels": [16, 32, 64, 128, 256],
"strides": [2, 2, 2, 2],
"num_res_units": 2,
"norm": "batch"
}
},
"pre_transforms": [
{
"name": "LoadImaged",
"args": {
"keys": [
"image",
"label"
]
}
},
{
"name": "EnsureChannelFirstd",
"args": {
"keys": [
"image",
"label"
]
}
},
{
"name": "ScaleIntensityRanged",
"args": {
"keys": "image",
"a_min": -57,
"a_max": 164,
"b_min": 0.0,
"b_max": 1.0,
"clip": true
}
},
{
"name": "CropForegroundd",
"args": {
"keys": [
"image",
"label"
],
"source_key": "image"
}
},
{
"name": "RandCropByPosNegLabeld",
"args": {
"keys": [
"image",
"label"
],
"label_key": "label",
"spatial_size": [
96,
96,
96
],
"pos": 1,
"neg": 1,
"num_samples": 4,
"image_key": "image",
"image_threshold": 0
}
},
{
"name": "RandShiftIntensityd",
"args": {
"keys": "image",
"offsets": 0.1,
"prob": 0.5
}
},
{
"name": "ToTensord",
"args": {
"keys": [
"image",
"label"
]
}
}
],
"dataset": {
"name": "CacheDataset",
"data_list_file_path": "{DATASET_JSON}",
"data_file_base_dir": "{DATA_ROOT}",
"data_list_key": "{TRAIN_DATALIST_KEY}",
"args": {
"transform": "@pre_transforms",
"cache_num": 32,
"cache_rate": 1.0,
"num_workers": 4
}
},
"dataloader": {
"name": "DataLoader",
"args": {
"dataset": "@dataset",
"batch_size": 2,
"shuffle": true,
"num_workers": 4
}
},
"inferer": {
"name": "SimpleInferer"
},
"handlers": [
{
"name": "CheckpointLoader",
"disabled": "{dont_load_ckpt_model}",
"args": {
"load_path": "{MMAR_CKPT}",
"load_dict": {"model": "@model"}
}
},
{
"name": "LrScheduleHandler",
"args": {
"lr_scheduler": "@lr_scheduler",
"print_lr": true
}
},
{
"name": "ValidationHandler",
"args": {
"validator": "@evaluator",
"epoch_level": true,
"interval": "{num_interval_per_valid}"
}
},
{
"name": "CheckpointSaver",
"rank": 0,
"args": {
"save_dir": "{MMAR_CKPT_DIR}",
"save_dict": {
"model": "@model",
"optimizer": "@optimizer",
"lr_scheduler": "@lr_scheduler",
"train_conf": "@conf"
},
"save_final": true,
"save_interval": 400
}
},
{
"name": "StatsHandler",
"rank": 0,
"args": {
"tag_name": "train_loss",
"output_transform": "#monai.handlers.from_engine(['loss'], first=True)"
}
},
{
"name": "TensorBoardStatsHandler",
"rank": 0,
"args": {
"log_dir": "{MMAR_CKPT_DIR}",
"tag_name": "train_loss",
"output_transform": "#monai.handlers.from_engine(['loss'], first=True)"
}
}
],
"post_transforms": [
{
"name": "Activationsd",
"args": {
"keys": "pred",
"softmax": true
}
},
{
"name": "AsDiscreted",
"args": {
"keys": ["pred", "label"],
"argmax": [true, false],
"to_onehot": 2
}
}
],
"key_metric": {
"name": "Accuracy",
"log_label": "train_acc",
"args": {
"output_transform": "#monai.handlers.from_engine(['pred', 'label'])"
}
},
"trainer": {
"name": "SupervisedTrainer",
"args": {
"max_epochs": "{epochs}",
"device": "cuda",
"train_data_loader": "@dataloader",
"network": "@model",
"loss_function": "@loss",
"optimizer": "@optimizer",
"inferer": "@inferer",
"postprocessing": "@post_transforms",
"key_train_metric": "@key_metric",
"train_handlers": "@handlers",
"amp": "{amp}"
}
}
},
"validate": {
"pre_transforms": [
{
"ref": "LoadImaged"
},
{
"ref": "EnsureChannelFirstd"
},
{
"ref": "ScaleIntensityRanged"
},
{
"ref": "CropForegroundd"
},
{
"ref": "ToTensord"
}
],
"dataset": {
"name": "CacheDataset",
"data_list_file_path": "{DATASET_JSON}",
"data_file_base_dir": "{DATA_ROOT}",
"data_list_key": "{VAL_DATALIST_KEY}",
"args": {
"transform": "@pre_transforms",
"cache_num": 9,
"cache_rate": 1.0,
"num_workers": 4
}
},
"dataloader": {
"name": "DataLoader",
"args": {
"dataset": "@dataset",
"batch_size": 1,
"shuffle": false,
"num_workers": 4
}
},
"inferer": {
"name": "SlidingWindowInferer",
"args": {
"roi_size": [
160,
160,
160
],
"sw_batch_size": 4,
"overlap": 0.5
}
},
"handlers": [
{
"name": "StatsHandler",
"rank": 0,
"args": {
"output_transform": "lambda x: None"
}
},
{
"name": "TensorBoardStatsHandler",
"rank": 0,
"args": {
"log_dir": "{MMAR_CKPT_DIR}",
"output_transform": "lambda x: None"
}
},
{
"name": "CheckpointSaver",
"rank": 0,
"args": {
"save_dir": "{MMAR_CKPT_DIR}",
"save_dict": {"model": "@model", "train_conf": "@conf"},
"save_key_metric": true
}
}
],
"post_transforms": [
{
"ref": "Activationsd"
},
{
"ref": "AsDiscreted"
}
],
"key_metric": {
"name": "MeanDice",
"log_label": "val_mean_dice",
"args": {
"include_background": false,
"output_transform": "#monai.handlers.from_engine(['pred', 'label'])"
}
},
"additional_metrics": [
{
"name": "Accuracy",
"log_label": "val_acc",
"args": {
"output_transform": "#monai.handlers.from_engine(['pred', 'label'])"
}
}
],
"evaluator": {
"name": "SupervisedEvaluator",
"args": {
"device": "cuda",
"val_data_loader": "@dataloader",
"network": "@model",
"inferer": "@inferer",
"postprocessing": "@post_transforms",
"key_val_metric": "@key_metric",
"additional_metrics": "@additional_metrics",
"val_handlers": "@handlers",
"amp": "{amp}"
}
}
}
}
Component reference in config_train.json
In config_train.json, many transforms defined in the “validate” section are exact copies of those in the “train” section. The component reference mechanism reduces human effort in duplicating the transforms as well as simplifies the configs. This can be seen throughout the “pre_transforms” in the “validate” section above.
With this, “ref” in the “validate” section will point to a transform defined in the “train” section with the specified name.
The component reference mechanism only works for transforms!
You can also assign an alias to the transform in the “train” section, and then use the alias in “ref” in the “validate” section. This feature could be useful in case the same transform type is used more than once in the “train” section – you can assign different aliases for them and reference them via the aliases in “validate” section.
In the following example, the “LoadImaged” transform is assigned an alias “loadImageExample”, which is referenced in the “validate” section:
{
...
"train": {
...
"pre_transforms": [
{
"name": "LoadImaged#loadImageExample",
"args": {
"keys": [
"image",
"label"
]
}
},
...
],
...
"validate": {
...
"pre_transforms": [
{
"ref": "loadImageExample"
},
...
],
...
}
}
The validation config file config_validation.json defines the configuration of the validation workflow.
Validation configuration: segmentation model example
Here is the validation config of example_clara_pt_spleen_ct_segmentation:
{
"multi_gpu": false,
"amp": true,
"dont_load_ts_model": false,
"dont_load_ckpt_model": true,
"network_summary": {
"network": "@model",
"input_size": [2, 1, 96, 96, 96]
},
"model": [
{
"ts_path": "{MMAR_TORCHSCRIPT}",
"disabled": "{dont_load_ts_model}"
},
{
"ckpt_path": "{MMAR_CKPT}",
"disabled": "{dont_load_ckpt_model}"
}
],
"pre_transforms": [
{
"name": "LoadImaged",
"args": {
"keys": [
"image",
"label"
]
}
},
{
"name": "EnsureChannelFirstd",
"args": {
"keys": [
"image",
"label"
]
}
},
{
"name": "ScaleIntensityRanged",
"args": {
"keys": "image",
"a_min": -57,
"a_max": 164,
"b_min": 0.0,
"b_max": 1.0,
"clip": true
}
},
{
"name": "CropForegroundd",
"args": {
"keys": "image",
"source_key": "image"
}
},
{
"name": "ToTensord",
"args": {
"keys": [
"image",
"label"
]
}
}
],
"dataset": {
"name": "Dataset",
"data_list_file_path": "{DATASET_JSON}",
"data_file_base_dir": "{DATA_ROOT}",
"data_list_key": "{VAL_DATALIST_KEY}",
"args": {
"transform": "@pre_transforms"
}
},
"dataloader": {
"name": "DataLoader",
"args": {
"dataset": "@dataset",
"batch_size": 1,
"shuffle": false,
"num_workers": 4
}
},
"inferer": {
"name": "SlidingWindowInferer",
"args": {
"roi_size": [
160,
160,
160
],
"sw_batch_size": 4,
"overlap": 0.5
}
},
"handlers": [
{
"name": "CheckpointLoader",
"disabled": "{dont_load_ckpt_model}",
"args": {
"load_path": "{MMAR_CKPT}",
"load_dict": {"model": "@model"}
}
},
{
"name": "StatsHandler",
"rank": 0,
"args": {
"output_transform": "lambda x: None"
}
},
{
"name": "MetricsSaver",
"args": {
"save_dir": "{MMAR_EVAL_OUTPUT_PATH}",
"metrics": ["val_mean_dice", "val_acc"],
"metric_details": ["val_mean_dice"],
"batch_transform": "#monai.handlers.from_engine(['image_meta_dict'])",
"summary_ops": "*",
"save_rank": 0
}
}
],
"post_transforms": [
{
"name": "Activationsd",
"args": {
"keys": "pred",
"softmax": true
}
},
{
"name": "Invertd",
"args": {
"keys": "pred",
"transform": "@pre_transforms",
"orig_keys": "image",
"meta_keys": "pred_meta_dict",
"nearest_interp": false,
"to_tensor": true,
"device": "cuda"
}
},
{
"name": "AsDiscreted",
"args": {
"keys": ["pred", "label"],
"argmax": [true, false]
}
},
{
"name": "SaveImaged",
"args": {
"keys": "pred",
"meta_keys": "pred_meta_dict",
"output_dir": "{MMAR_EVAL_OUTPUT_PATH}",
"resample": false,
"squeeze_end_dims": true
}
}
],
"key_metric": {
"name": "MeanDice",
"log_label": "val_mean_dice",
"args": {
"include_background": true,
"output_transform": "#monai.handlers.from_engine(['pred', 'label'])"
}
},
"additional_metrics": [
{
"name": "Accuracy",
"log_label": "val_acc",
"args": {
"output_transform": "#monai.handlers.from_engine(['pred', 'label'])"
}
}
],
"evaluator": {
"name": "SupervisedEvaluator",
"args": {
"device": "cuda",
"val_data_loader": "@dataloader",
"network": "@model",
"inferer": "@inferer",
"postprocessing": "@post_transforms",
"key_val_metric": "@key_metric",
"additional_metrics": "@additional_metrics",
"val_handlers": "@handlers",
"amp": "{amp}"
}
}
}
Users can disable a component by adding ‘disabled’: true in that component, as follows:
{
"name": "RandFlipd",
"args": {
"keys": [
"image",
"label"
],
"prob": 0.5,
"spatial_axis": 0
},
"disabled": true
}
This option is useful if users would like to temporarily disable a component when trying different combinations of components in experiments.
For federated learning, there is a config_cross_validation.json file that is needed in the MMAR config directory to run
cross-site validation in NVIDIA FLARE. The contents of this file is a copy of config_validation.json but the model
section is replaced with the configuration from config_train.json in the MMAR.
Converting from the old formats
With the introduction of Pytorch and Ignite as a back end and the usage of components from MONAI, changes are required to convert previous MMARs from before Clara Train v4.0 to work with v4.0 and later. Additional changes are required to convert MMARs to work with Clara Train v4.1 due to considerations for future compatibility. See Upgrading from previous versions of Clara Train for details.