NVIDIA Docs Hub NVIDIA Clara NVIDIA Clara Train 4.0 Model training and validation configurations

Model training and validation configurations

Clara workflows are made of different types of components. For each type, there are usually multiple choices. To put together a workflow, you specify and configure the components to be used.

Clara offers two kinds of workflows: training and validation. Workflow configurations are defined in JSON files: config_train.json for training workflows and config_validation.json for validation workflows.

Training configuration

The training config file config_train.json defines the configuration of the training workflow. The config contains three sections: global variables, train, and validate.

You can define global variables in the configuration JSON file. These variables can be overwritten through the environment.json or even command line. The typical global variables include:

By overwriting the values of these variables through command line, you can experiment with different training settings without having to modify the config file. For example: in commands/train.sh

Copy
Copied!

            
            python3 -u -m medl.apps.train \
    -m $MMAR_ROOT \
    -c $CONFIG_FILE \
    -e $ENVIRONMENT_FILE \
    --write_train_stats \
    --set \
    print_conf=True \
    epochs=1260 \
    learning_rate=0.0002 \
    num_interval_per_valid=20 \
    multi_gpu=False \
    cudnn_benchmark=False \
    dont_load_ckpt_model=True \
    ${additional_options}

The “train” section defines the components for the training process. Each component is constructed by providing the component’s class “name” and the init arguments “args”.

Similarly, the “validate” section defines the components for validation process. Each component is constructed the same way by providing the component class “name” and the corresponding init arguments “args”.

If you want to use an externally-implemented component class, you can do so by specifying the class “path” to replace the “name”. For more details see Bring your own components (BYOC).

Explanation of components in the Training workflow:

Section	Component	Description
global	epochs	Number of training epochs
	num_interval_per_valid	Validation frequency in number of epochs
	learning_rate	The initial learning rate
	multi_gpu	Is the training on multiple GPUs? If not specified, defaults to false.
	determinism	Optional section with seeds for deterministic training.
	cudnn_benchmark	Whether or not to set torch.backends.cudnn.benchmark. Will not set any value if not in config. See performance tuning guide: cuDNN auto-tuner.
	amp	Whether or not to use Automatic Mixed Precision. Defaults to false.
	dont_load_ckpt_model	Whether or not to load previous model for fine-tuning. Please note that this is a double negative so for fine-tuning dont_load_ckpt_model should be false.
	dont_load_ts_model	Used in conjunction with the above to validate with or without CKPT. See clara_pt_spleen_ct_segmentation MMAR commands for examples.
	tf32	Whether or not to use TF32 on Ampere GPUs, supported from PyTorch 1.7. This will set both `torch.backends.cuda.matmul.allow_tf32` and `torch.backends.cudnn.allow_tf32` to the specified value if included in the config. For more details, see https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices
train	loss	The loss component
	optimizer	The optimizer component
	lr_scheduler	The learning rate policy
	model	The model network component
	pre_transforms	List of transforms to be applied to the training data. Typically, “keys” corresponds to the fields in the Datalist JSON file that operations will be applied on.
	dataset	Configuration for data items.
	dataloader	The dataloader generates batched training data items.
	inferer	https://docs.monai.io/en/latest/inferers.html
	handlers	Handlers serve various important functions. Please note that the order of some may make a difference.
	post_transforms	The post_transforms to be applied.
	metrics	The metrics to keep track of. Custom metrics must conform to the signature expected by PyTorch Ignite metrics.
	trainer	https://docs.monai.io/en/latest/engines.html#trainer
validate	pre_transforms	List of transforms to be applied to the training data.
	dataset	Configuration for data items.
	dataloader	The dataloader generates batched training data items.
	inferer	https://docs.monai.io/en/latest/inferers.html
	handlers	Handlers serve various important functions. Please note that the order of some may make a difference.
	post_transforms	The post_transforms to be applied.
	metrics	The metrics to keep track of.
	evaluator	https://docs.monai.io/en/latest/engines.html#evaluator

Attention

You can define global variables in the configuration JSON file, but these variables can be overwritten through environment.json and then the shell command or command line at the highest priority. Examples of this can be seen throughout the commands in the sample MMARs. Please note that for boolean values in the shell script, True and False need to be capitalized whereas they are not in the JSON configs.

Segmentation model configuration example

Here is an example of config_train.json of the clara_pt_spleen_ct_segmentation model:

Copy
Copied!

            
            {
  "epochs": 1260,
  "num_interval_per_valid": 20,
  "learning_rate": 2e-4,
  "multi_gpu": false,
  "amp": true,
  "determinism": {
    "random_seed": 0
  },
  "cudnn_benchmark": false,
  "dont_load_ckpt_model": true,
  "train": {
    "loss": {
      "name": "DiceLoss",
      "args":{
        "to_onehot_y": true,
        "softmax": true
      }
    },
    "optimizer": {
      "name": "Adam",
      "args": {
        "lr": "{learning_rate}"
      }
    },
    "lr_scheduler": {
      "name": "StepLR",
      "args": {
        "step_size": 5000,
        "gamma": 0.1
      }
    },
    "model": {
      "name": "UNet",
      "args": {
        "dimensions": 3,
        "in_channels": 1,
        "out_channels": 2,
        "channels": [16, 32, 64, 128, 256],
        "strides": [2, 2, 2, 2],
        "num_res_units": 2,
        "norm": "batch"
      }
    },
    "pre_transforms": [
      {
        "name": "LoadImaged",
        "args": {
          "keys": [
            "image",
            "label"
          ]
        }
      },
      {
        "name": "EnsureChannelFirstd",
        "args": {
          "keys": [
            "image",
            "label"
          ]
        }
      },
      {
        "name": "ScaleIntensityRanged",
        "args": {
          "keys": "image",
          "a_min": -57,
          "a_max": 164,
          "b_min": 0.0,
          "b_max": 1.0,
          "clip": true
        }
      },
      {
        "name": "CropForegroundd",
        "args": {
          "keys": [
            "image",
            "label"
          ],
          "source_key": "image"
        }
      },
      {
        "name": "RandCropByPosNegLabeld",
        "args": {
          "keys": [
            "image",
            "label"
          ],
          "label_key": "label",
          "spatial_size": [
            96,
            96,
            96
          ],
          "pos": 1,
          "neg": 1,
          "num_samples": 4,
          "image_key": "image",
          "image_threshold": 0
        }
      },
      {
        "name": "RandShiftIntensityd",
        "args": {
          "keys": "image",
          "offsets": 0.1,
          "prob": 0.5
        }
      },
      {
        "name": "ToTensord",
        "args": {
          "keys": [
            "image",
            "label"
          ]
        }
      }
    ],
    "dataset": {
      "name": "CacheDataset",
      "data_list_file_path": "{DATASET_JSON}",
      "data_file_base_dir": "{DATA_ROOT}",
      "data_list_key": "training",
      "args": {
        "cache_num": 32,
        "cache_rate": 1.0,
        "num_workers": 4
      }
    },
    "dataloader": {
      "name": "DataLoader",
      "args": {
        "batch_size": 2,
        "shuffle": true,
        "num_workers": 4
      }
    },
    "inferer": {
      "name": "SimpleInferer"
    },
    "handlers": [
      {
        "name": "CheckpointLoader",
        "disabled": "{dont_load_ckpt_model}",
        "args": {
          "load_path": "{MMAR_CKPT}",
          "load_dict": ["model"]
        }
      },
      {
        "name": "LrScheduleHandler",
        "args": {
          "print_lr": true
        }
      },
      {
        "name": "ValidationHandler",
        "args": {
          "epoch_level": true,
          "interval": "{num_interval_per_valid}"
        }
      },
      {
        "name": "CheckpointSaver",
        "rank": 0,
        "args": {
          "save_dir": "{MMAR_CKPT_DIR}",
          "save_dict": ["model", "optimizer", "lr_scheduler", "train_conf"],
          "save_final": true,
          "save_interval": 400
        }
      },
      {
        "name": "StatsHandler",
        "rank": 0,
        "args": {
          "tag_name": "train_loss",
          "output_transform": "lambda x: x['loss']"
        }
      },
      {
        "name": "TensorBoardStatsHandler",
        "rank": 0,
        "args": {
          "log_dir": "{MMAR_CKPT_DIR}",
          "tag_name": "train_loss",
          "output_transform": "lambda x: x['loss']"
        }
      }
    ],
    "post_transforms": [
      {
        "name": "Activationsd",
        "args": {
          "keys": "pred",
          "softmax": true
        }
      },
      {
        "name": "AsDiscreted",
        "args": {
          "keys": ["pred", "label"],
          "argmax": [true, false],
          "to_onehot": true,
          "n_classes": 2
        }
      }
    ],
    "metrics": [
      {
        "name": "Accuracy",
        "log_label": "train_acc",
        "is_key_metric": true,
        "args": {
          "output_transform": "lambda x: (x['pred'], x['label'])"
        }
      }
    ],
    "trainer": {
      "name": "SupervisedTrainer",
      "args": {
        "max_epochs": "{epochs}"
      }
    }
  },
  "validate": {
    "pre_transforms": [
      {
        "ref": "LoadImaged"
      },
      {
        "ref": "EnsureChannelFirstd"
      },
      {
        "ref": "ScaleIntensityRanged"
      },
      {
        "ref": "CropForegroundd"
      },
      {
        "ref": "ToTensord"
      }
    ],
    "dataset": {
      "name": "CacheDataset",
      "data_list_file_path": "{DATASET_JSON}",
      "data_file_base_dir": "{DATA_ROOT}",
      "data_list_key": "validation",
      "args": {
        "cache_num": 9,
        "cache_rate": 1.0,
        "num_workers": 4
      }
    },
    "dataloader": {
      "name": "DataLoader",
      "args": {
        "batch_size": 1,
        "shuffle": false,
        "num_workers": 4
      }
    },
    "inferer": {
      "name": "SlidingWindowInferer",
      "args": {
        "roi_size": [
          160,
          160,
          160
        ],
        "sw_batch_size": 4,
        "overlap": 0.5
      }
    },
    "handlers": [
      {
        "name": "StatsHandler",
        "rank": 0,
        "args": {
          "output_transform": "lambda x: None"
        }
      },
      {
        "name": "TensorBoardStatsHandler",
        "rank": 0,
        "args": {
          "log_dir": "{MMAR_CKPT_DIR}",
          "output_transform": "lambda x: None"
        }
      },
      {
        "name": "CheckpointSaver",
        "rank": 0,
        "args": {
          "save_dir": "{MMAR_CKPT_DIR}",
          "save_dict": ["model", "train_conf"],
          "save_key_metric": true
        }
      }
    ],
    "post_transforms": [
      {
        "ref": "Activationsd"
      },
      {
        "ref": "AsDiscreted"
      }
    ],
    "metrics": [
      {
        "name": "MeanDice",
        "log_label": "val_mean_dice",
        "is_key_metric": true,
        "args": {
          "include_background": false,
          "output_transform": "lambda x: (x['pred'], x['label'])"
        }
      },
      {
        "name": "Accuracy",
        "log_label": "val_acc",
        "args": {
          "output_transform": "lambda x: (x['pred'], x['label'])"
        }
      }
    ],
    "evaluator": {
      "name": "SupervisedEvaluator"
    }
  }
}

Component reference in config_train.json

In config_train.json, many transforms defined in the “validate” section are exact copies of those in the “train” section. The component reference mechanism reduces human effort in duplicating the transforms as well as simplifies the configs. This can be seen throughout the “pre_transforms” in the “validate” section above.

With this, “ref” in the “validate” section will point to a transform defined in the “train” section with the specified name.

Note

The component reference mechanism only works for transforms!

You can also assign an alias to the transform in the “train” section, and then use the alias in “ref” in the “validate” section. This feature could be useful in case the same transform type is used more than once in the “train” section – you can assign different aliases for them and reference them via the aliases in “validate” section.

In the following example, the “LoadImaged” transform is assigned an alias “loadImageExample”, which is referenced in the “validate” section:

Copy
Copied!

            
            {
 ...
 "train": {
   ...
   "pre_transforms": [
     {
       "name": "LoadImaged#loadImageExample",
       "args": {
         "keys": [
           "image",
           "label"
         ]
       }
     },
    ...
   ],
  ...
 "validate": {
   ...
   "pre_transforms": [
     {
       "ref": "loadImageExample"
     },
     ...
   ],
   ...
   }
}

Validation configuration

The validation config file config_validation.json defines the configuration of the validation workflow.

Validation configuration: segmentation model example

Here is the validation config of clara_pt_spleen_ct_segmentation:

Copy
Copied!

            
            {
  "multi_gpu": false,
  "amp": true,
  "dont_load_ts_model": false,
  "dont_load_ckpt_model": true,
  "model": [
    {
      "ts_path": "{MMAR_TORCHSCRIPT}",
      "disabled": "{dont_load_ts_model}"
    },
    {
      "ckpt_path": "{MMAR_CKPT}",
      "disabled": "{dont_load_ckpt_model}"
    }
  ],
  "pre_transforms": [
    {
      "name": "LoadImaged",
      "args": {
        "keys": [
          "image",
          "label"
        ]
      }
    },
    {
      "name": "EnsureChannelFirstd",
      "args": {
        "keys": [
          "image",
          "label"
        ]
      }
    },
    {
      "name": "ScaleIntensityRanged",
      "args": {
        "keys": "image",
        "a_min": -57,
        "a_max": 164,
        "b_min": 0.0,
        "b_max": 1.0,
        "clip": true
      }
    },
    {
      "name": "CropForegroundd",
      "args": {
        "keys": [
          "image",
          "label"
        ],
        "source_key": "image"
      }
    },
    {
      "name": "ToTensord",
      "args": {
        "keys": [
          "image",
          "label"
        ]
      }
    }
  ],
  "dataset": {
    "name": "Dataset",
    "data_list_file_path": "{DATASET_JSON}",
    "data_file_base_dir": "{DATA_ROOT}",
    "data_list_key": "validation"
  },
  "dataloader": {
    "name": "DataLoader",
    "args": {
      "batch_size": 1,
      "shuffle": false,
      "num_workers": 4
    }
  },
  "inferer": {
    "name": "SlidingWindowInferer",
    "args": {
      "roi_size": [
        160,
        160,
        160
      ],
      "sw_batch_size": 4,
      "overlap": 0.5
    }
  },
  "handlers": [
    {
      "name": "CheckpointLoader",
      "disabled": "{dont_load_ckpt_model}",
      "args": {
        "load_path": "{MMAR_CKPT}",
        "load_dict": ["model"]
      }
    },
    {
      "name": "StatsHandler",
      "rank": 0,
      "args": {
        "output_transform": "lambda x: None"
      }
    },
    {
      "name": "SegmentationSaver",
      "args": {
        "output_dir": "{MMAR_EVAL_OUTPUT_PATH}",
        "batch_transform": "lambda x: x['image_meta_dict']",
        "output_transform": "lambda x: x['pred']"
      }
    },
    {
      "name": "MetricsSaver",
      "args": {
        "save_dir": "{MMAR_EVAL_OUTPUT_PATH}",
        "metrics": ["val_mean_dice", "val_acc"],
        "metric_details": ["val_mean_dice"],
        "batch_transform": "lambda x: x['image_meta_dict']",
        "summary_ops": "*",
        "save_rank": 0
      }
    }
  ],
  "post_transforms": [
    {
      "name": "Activationsd",
      "args": {
        "keys": "pred",
        "softmax": true
      }
    },
    {
      "name": "AsDiscreted",
      "args": {
        "keys": ["pred", "label"],
        "argmax": [true, false],
        "to_onehot": true,
        "n_classes": 2
      }
    }
  ],
  "metrics": [
    {
      "name": "MeanDice",
      "log_label": "val_mean_dice",
      "is_key_metric": true,
      "args": {
        "include_background": false,
        "output_transform": "lambda x: (x['pred'], x['label'])"
      }
    },
    {
      "name": "Accuracy",
      "log_label": "val_acc",
      "args": {
        "output_transform": "lambda x: (x['pred'], x['label'])"
      }
    }
  ],
  "evaluator": {
    "name": "SupervisedEvaluator"
  }
}

Users can disable a component by adding ‘disabled’: true in that component, as follows:

Copy
Copied!

            
            {
  "name": "RandFlipd",
    "args": {
      "keys": [
        "image",
        "label"
      ],
      "prob": 0.5,
      "spatial_axis": 0
    },
    "disabled": true
}

This option is useful if users would like to temporarily disable a component when trying different combinations of components in experiments.

Converting from the old formats

With the introduction of Pytorch and Ignite as a back end and the usage of components from MONAI, changes are required to convert previous MMARs to work with v4.0 and later. See Converting from Clara 3.1 to Clara 4.0 for details.