NVIDIA Clara Train 4.1
1.0

Model training and validation configurations

Clara workflows are made of different types of components. For each type, there are usually multiple choices. To put together a workflow, you specify and configure the components to be used.

Clara offers two kinds of workflows: training and validation. Workflow configurations are defined in JSON files: config_train.json for training workflows and config_validation.json for validation workflows.

The training config file config_train.json defines the configuration of the training workflow. The config contains three sections: global variables, train, and validate.

You can define global variables in the configuration JSON file. These variables can be overwritten through the environment.json or even command line.

By overwriting the values of these variables through command line, you can experiment with different training settings without having to modify the config file. For example, in commands/train.sh:

Copy
Copied!
            

python3 -u -m medl.apps.train \ -m $MMAR_ROOT \ -c $CONFIG_FILE \ -e $ENVIRONMENT_FILE \ --write_train_stats \ --set \ print_conf=True \ epochs=1260 \ learning_rate=0.0002 \ num_interval_per_valid=20 \ multi_gpu=False \ cudnn_benchmark=False \ dont_load_ckpt_model=True \ ${additional_options}

The “train” section defines the components for the training process. Each component is constructed by providing the component’s class “name” and the init arguments “args”.

Similarly, the “validate” section defines the components for validation process. Each component is constructed the same way by providing the component class “name” and the corresponding init arguments “args”.

If you want to use an externally-implemented component class, you can do so by specifying the class “path” to replace the “name”. For more details see Bring your own components (BYOC).

Explanation of components in the Training workflow:

Section

Component

Description

global

epochs

Number of training epochs

num_interval_per_valid

Validation frequency in number of epochs

learning_rate

The initial learning rate

multi_gpu

Is the training on multiple GPUs? If not specified, defaults to false.

determinism

Optional section with seeds for deterministic training.

cudnn_benchmark

Whether or not to set torch.backends.cudnn.benchmark. Will not set any value if not in config. See performance tuning guide: cuDNN auto-tuner.

amp

Whether or not to use Automatic Mixed Precision. Defaults to false.

dont_load_ckpt_model

Whether or not to load previous model for fine-tuning. Please note that this is a double negative so for fine-tuning dont_load_ckpt_model should be false.

dont_load_ts_model

Used in conjunction with the above to validate with or without CKPT. See clara_pt_spleen_ct_segmentation MMAR commands for examples.

tf32

Whether or not to use TF32 on Ampere GPUs, supported from PyTorch 1.7. This will set both torch.backends.cuda.matmul.allow_tf32 and torch.backends.cudnn.allow_tf32 to the specified value if included in the config. For more details, see https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices

train

loss

The loss component

optimizer

The optimizer component

lr_scheduler

The learning rate policy

model

The model network component

pre_transforms

List of transforms to be applied to the training data. Typically, “keys” corresponds to the fields in the Datalist JSON file that operations will be applied on.

dataset

Configuration for data items.

dataloader

The dataloader generates batched training data items.

inferer

https://docs.monai.io/en/latest/inferers.html

handlers

Handlers serve various important functions. Please note that the order of some may make a difference.

post_transforms

The transforms to be applied to the trained model output.

key_metric

Key metric to compute on training data see PyTorch-Ignite metrics.

additional_metrics

Additional metrics to compute see PyTorch-Ignite metrics.

trainer

https://docs.monai.io/en/latest/engines.html#trainer

validate

pre_transforms

List of transforms to be applied to the training data. Can use “ref” for same transforms in train.

dataset

Configuration for data items.

dataloader

The dataloader generates batched training data items.

inferer

https://docs.monai.io/en/latest/inferers.html

handlers

Handlers serve various important functions. Please note that the order of some may make a difference.

post_transforms

The transforms to be applied on model validation output.

key_metric

Key metric to compute on validation data see PyTorch-Ignite metrics.

additional_metrics

Additional metrics to compute

evaluator

https://docs.monai.io/en/latest/engines.html#evaluator

Attention

You can define global variables in the configuration JSON file, but these variables can be overwritten through environment.json and then the shell command or command line at the highest priority. Examples of this can be seen throughout the commands in the sample MMARs. Please note that for boolean values in the shell script, True and False need to be capitalized whereas they are not in the JSON configs.


Segmentation model configuration example

Here is an example of config_train.json of the example_clara_pt_spleen_ct_segmentation model:

Copy
Copied!
            

{ "epochs": 1260, "num_interval_per_valid": 20, "learning_rate": 2e-4, "multi_gpu": false, "amp": true, "determinism": { "random_seed": 0 }, "cudnn_benchmark": false, "dont_load_ckpt_model": true, "network_summary": { "network": "@model", "input_size": [2, 1, 96, 96, 96] }, "train": { "loss": { "name": "DiceLoss", "args":{ "to_onehot_y": true, "softmax": true } }, "optimizer": { "name": "Adam", "args": { "params": "#@model.parameters()", "lr": "{learning_rate}" } }, "lr_scheduler": { "name": "StepLR", "args": { "optimizer": "@optimizer", "step_size": 5000, "gamma": 0.1 } }, "model": { "name": "UNet", "args": { "spatial_dims": 3, "in_channels": "{INPUT_CHANNELS}", "out_channels": "{OUTPUT_CHANNELS}", "channels": [16, 32, 64, 128, 256], "strides": [2, 2, 2, 2], "num_res_units": 2, "norm": "batch" } }, "pre_transforms": [ { "name": "LoadImaged", "args": { "keys": [ "image", "label" ] } }, { "name": "EnsureChannelFirstd", "args": { "keys": [ "image", "label" ] } }, { "name": "ScaleIntensityRanged", "args": { "keys": "image", "a_min": -57, "a_max": 164, "b_min": 0.0, "b_max": 1.0, "clip": true } }, { "name": "CropForegroundd", "args": { "keys": [ "image", "label" ], "source_key": "image" } }, { "name": "RandCropByPosNegLabeld", "args": { "keys": [ "image", "label" ], "label_key": "label", "spatial_size": [ 96, 96, 96 ], "pos": 1, "neg": 1, "num_samples": 4, "image_key": "image", "image_threshold": 0 } }, { "name": "RandShiftIntensityd", "args": { "keys": "image", "offsets": 0.1, "prob": 0.5 } }, { "name": "ToTensord", "args": { "keys": [ "image", "label" ] } } ], "dataset": { "name": "CacheDataset", "data_list_file_path": "{DATASET_JSON}", "data_file_base_dir": "{DATA_ROOT}", "data_list_key": "{TRAIN_DATALIST_KEY}", "args": { "transform": "@pre_transforms", "cache_num": 32, "cache_rate": 1.0, "num_workers": 4 } }, "dataloader": { "name": "DataLoader", "args": { "dataset": "@dataset", "batch_size": 2, "shuffle": true, "num_workers": 4 } }, "inferer": { "name": "SimpleInferer" }, "handlers": [ { "name": "CheckpointLoader", "disabled": "{dont_load_ckpt_model}", "args": { "load_path": "{MMAR_CKPT}", "load_dict": {"model": "@model"} } }, { "name": "LrScheduleHandler", "args": { "lr_scheduler": "@lr_scheduler", "print_lr": true } }, { "name": "ValidationHandler", "args": { "validator": "@evaluator", "epoch_level": true, "interval": "{num_interval_per_valid}" } }, { "name": "CheckpointSaver", "rank": 0, "args": { "save_dir": "{MMAR_CKPT_DIR}", "save_dict": { "model": "@model", "optimizer": "@optimizer", "lr_scheduler": "@lr_scheduler", "train_conf": "@conf" }, "save_final": true, "save_interval": 400 } }, { "name": "StatsHandler", "rank": 0, "args": { "tag_name": "train_loss", "output_transform": "#monai.handlers.from_engine(['loss'], first=True)" } }, { "name": "TensorBoardStatsHandler", "rank": 0, "args": { "log_dir": "{MMAR_CKPT_DIR}", "tag_name": "train_loss", "output_transform": "#monai.handlers.from_engine(['loss'], first=True)" } } ], "post_transforms": [ { "name": "Activationsd", "args": { "keys": "pred", "softmax": true } }, { "name": "AsDiscreted", "args": { "keys": ["pred", "label"], "argmax": [true, false], "to_onehot": 2 } } ], "key_metric": { "name": "Accuracy", "log_label": "train_acc", "args": { "output_transform": "#monai.handlers.from_engine(['pred', 'label'])" } }, "trainer": { "name": "SupervisedTrainer", "args": { "max_epochs": "{epochs}", "device": "cuda", "train_data_loader": "@dataloader", "network": "@model", "loss_function": "@loss", "optimizer": "@optimizer", "inferer": "@inferer", "postprocessing": "@post_transforms", "key_train_metric": "@key_metric", "train_handlers": "@handlers", "amp": "{amp}" } } }, "validate": { "pre_transforms": [ { "ref": "LoadImaged" }, { "ref": "EnsureChannelFirstd" }, { "ref": "ScaleIntensityRanged" }, { "ref": "CropForegroundd" }, { "ref": "ToTensord" } ], "dataset": { "name": "CacheDataset", "data_list_file_path": "{DATASET_JSON}", "data_file_base_dir": "{DATA_ROOT}", "data_list_key": "{VAL_DATALIST_KEY}", "args": { "transform": "@pre_transforms", "cache_num": 9, "cache_rate": 1.0, "num_workers": 4 } }, "dataloader": { "name": "DataLoader", "args": { "dataset": "@dataset", "batch_size": 1, "shuffle": false, "num_workers": 4 } }, "inferer": { "name": "SlidingWindowInferer", "args": { "roi_size": [ 160, 160, 160 ], "sw_batch_size": 4, "overlap": 0.5 } }, "handlers": [ { "name": "StatsHandler", "rank": 0, "args": { "output_transform": "lambda x: None" } }, { "name": "TensorBoardStatsHandler", "rank": 0, "args": { "log_dir": "{MMAR_CKPT_DIR}", "output_transform": "lambda x: None" } }, { "name": "CheckpointSaver", "rank": 0, "args": { "save_dir": "{MMAR_CKPT_DIR}", "save_dict": {"model": "@model", "train_conf": "@conf"}, "save_key_metric": true } } ], "post_transforms": [ { "ref": "Activationsd" }, { "ref": "AsDiscreted" } ], "key_metric": { "name": "MeanDice", "log_label": "val_mean_dice", "args": { "include_background": false, "output_transform": "#monai.handlers.from_engine(['pred', 'label'])" } }, "additional_metrics": [ { "name": "Accuracy", "log_label": "val_acc", "args": { "output_transform": "#monai.handlers.from_engine(['pred', 'label'])" } } ], "evaluator": { "name": "SupervisedEvaluator", "args": { "device": "cuda", "val_data_loader": "@dataloader", "network": "@model", "inferer": "@inferer", "postprocessing": "@post_transforms", "key_val_metric": "@key_metric", "additional_metrics": "@additional_metrics", "val_handlers": "@handlers", "amp": "{amp}" } } } }


Component reference in config_train.json

In config_train.json, many transforms defined in the “validate” section are exact copies of those in the “train” section. The component reference mechanism reduces human effort in duplicating the transforms as well as simplifies the configs. This can be seen throughout the “pre_transforms” in the “validate” section above.

With this, “ref” in the “validate” section will point to a transform defined in the “train” section with the specified name.

Note

The component reference mechanism only works for transforms!

You can also assign an alias to the transform in the “train” section, and then use the alias in “ref” in the “validate” section. This feature could be useful in case the same transform type is used more than once in the “train” section – you can assign different aliases for them and reference them via the aliases in “validate” section.

In the following example, the “LoadImaged” transform is assigned an alias “loadImageExample”, which is referenced in the “validate” section:

Copy
Copied!
            

{ ... "train": { ... "pre_transforms": [ { "name": "LoadImaged#loadImageExample", "args": { "keys": [ "image", "label" ] } }, ... ], ... "validate": { ... "pre_transforms": [ { "ref": "loadImageExample" }, ... ], ... } }


The validation config file config_validation.json defines the configuration of the validation workflow.

Validation configuration: segmentation model example

Here is the validation config of example_clara_pt_spleen_ct_segmentation:

Copy
Copied!
            

{ "multi_gpu": false, "amp": true, "dont_load_ts_model": false, "dont_load_ckpt_model": true, "network_summary": { "network": "@model", "input_size": [2, 1, 96, 96, 96] }, "model": [ { "ts_path": "{MMAR_TORCHSCRIPT}", "disabled": "{dont_load_ts_model}" }, { "ckpt_path": "{MMAR_CKPT}", "disabled": "{dont_load_ckpt_model}" } ], "pre_transforms": [ { "name": "LoadImaged", "args": { "keys": [ "image", "label" ] } }, { "name": "EnsureChannelFirstd", "args": { "keys": [ "image", "label" ] } }, { "name": "ScaleIntensityRanged", "args": { "keys": "image", "a_min": -57, "a_max": 164, "b_min": 0.0, "b_max": 1.0, "clip": true } }, { "name": "CropForegroundd", "args": { "keys": "image", "source_key": "image" } }, { "name": "ToTensord", "args": { "keys": [ "image", "label" ] } } ], "dataset": { "name": "Dataset", "data_list_file_path": "{DATASET_JSON}", "data_file_base_dir": "{DATA_ROOT}", "data_list_key": "{VAL_DATALIST_KEY}", "args": { "transform": "@pre_transforms" } }, "dataloader": { "name": "DataLoader", "args": { "dataset": "@dataset", "batch_size": 1, "shuffle": false, "num_workers": 4 } }, "inferer": { "name": "SlidingWindowInferer", "args": { "roi_size": [ 160, 160, 160 ], "sw_batch_size": 4, "overlap": 0.5 } }, "handlers": [ { "name": "CheckpointLoader", "disabled": "{dont_load_ckpt_model}", "args": { "load_path": "{MMAR_CKPT}", "load_dict": {"model": "@model"} } }, { "name": "StatsHandler", "rank": 0, "args": { "output_transform": "lambda x: None" } }, { "name": "MetricsSaver", "args": { "save_dir": "{MMAR_EVAL_OUTPUT_PATH}", "metrics": ["val_mean_dice", "val_acc"], "metric_details": ["val_mean_dice"], "batch_transform": "#monai.handlers.from_engine(['image_meta_dict'])", "summary_ops": "*", "save_rank": 0 } } ], "post_transforms": [ { "name": "Activationsd", "args": { "keys": "pred", "softmax": true } }, { "name": "Invertd", "args": { "keys": "pred", "transform": "@pre_transforms", "orig_keys": "image", "meta_keys": "pred_meta_dict", "nearest_interp": false, "to_tensor": true, "device": "cuda" } }, { "name": "AsDiscreted", "args": { "keys": ["pred", "label"], "argmax": [true, false] } }, { "name": "SaveImaged", "args": { "keys": "pred", "meta_keys": "pred_meta_dict", "output_dir": "{MMAR_EVAL_OUTPUT_PATH}", "resample": false, "squeeze_end_dims": true } } ], "key_metric": { "name": "MeanDice", "log_label": "val_mean_dice", "args": { "include_background": true, "output_transform": "#monai.handlers.from_engine(['pred', 'label'])" } }, "additional_metrics": [ { "name": "Accuracy", "log_label": "val_acc", "args": { "output_transform": "#monai.handlers.from_engine(['pred', 'label'])" } } ], "evaluator": { "name": "SupervisedEvaluator", "args": { "device": "cuda", "val_data_loader": "@dataloader", "network": "@model", "inferer": "@inferer", "postprocessing": "@post_transforms", "key_val_metric": "@key_metric", "additional_metrics": "@additional_metrics", "val_handlers": "@handlers", "amp": "{amp}" } } }

Users can disable a component by adding ‘disabled’: true in that component, as follows:

Copy
Copied!
            

{ "name": "RandFlipd", "args": { "keys": [ "image", "label" ], "prob": 0.5, "spatial_axis": 0 }, "disabled": true }

This option is useful if users would like to temporarily disable a component when trying different combinations of components in experiments.

For federated learning, there is a config_cross_validation.json file that is needed in the MMAR config directory to run cross-site validation in NVIDIA FLARE. The contents of this file is a copy of config_validation.json but the model section is replaced with the configuration from config_train.json in the MMAR.

Converting from the old formats

With the introduction of Pytorch and Ignite as a back end and the usage of components from MONAI, changes are required to convert previous MMARs from before Clara Train v4.0 to work with v4.0 and later. Additional changes are required to convert MMARs to work with Clara Train v4.1 due to considerations for future compatibility. See Upgrading from previous versions of Clara Train for details.

© Copyright 2021, NVIDIA. Last updated on Feb 2, 2023.