Clara Train FAQ

Here is a list of frequently asked questions. For more questions including the ability to ask your own, see the NVIDIA Developer Forums:

1. Why should I use Clara Train?

  • Researchers can go back to being a data scientist and focus on the problem and not coding

  • NVIDIA’s Clara Train SDK is optimized as it is written by experienced software developers.

  • Having reproducible science by sharing simple configuration files

  • Training speed ups

  • Simple to deploy in a clinical setting by easily exporting model and importing them into Clara Deploy export tools

2. Does the order of handlers matter?

Yes, handlers are a list in Clara 4.0, and they will run one by one and may have some dependencies. Please make sure the “write” handler is before the “read” handler according to your training loop.

For example, ValidationHandler should be before CheckpointSaver, otherwise, the best validation metric model cannot be obtained. StatsHandler and TensorBoardHandler are at the end because they only read and print. One note is that “CheckpointSaver” should be before “StatsHandler” if you set “save_final=True” to save model. When an exception happens, ignite only triggers one handler and exits, ignite will add more support in their next version.

3. Is determinism supported?

Determinism can be set with a random seed. However, if you set “mode=trilinear” in AHNet, determinism is not supported. See PyTorch docs for more details: https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.interpolate

Other limitations to determinism may exist, for example the SegResNet implementation in MONAI defaults to upsample_mode: “nontrainable”, which cannot guarantee the model’s reproducibility: https://github.com/Project-MONAI/MONAI/blob/master/monai/networks/nets/segresnet.py#L45

In this case, using upsample_mode: “deconv” can allow for determinism to work.

4. How can I run the MMAR on a very small part of my dataset to quickly verify it?

You can set “factor” on the arg “scale” in your dataset in the configuration to do that:

  1. If factor < 1.0, randomly pick part of the datalist and set to Dataset, useful to quickly test the program.

  2. If factor > 1.0, repeat the datalist to enhance the Dataset.

Example of the usage in MMAR:

"dataset": {
   "name": "CacheDataset",
   "data_list_file_path": "{DATASET_JSON}",
   "data_file_base_dir": "{DATA_ROOT}",
   "data_list_key": "training",
   "scale": {
     "factor": 3.5,
     "random_pick": true,
     "seed": 123
   },
   "args": {
     "cache_num": 32,
     "cache_rate": 0.0,
     "num_workers": 4
   }
 }

5. How to enable sampling for classification task to balance the dataset?

Clara 4.0 supports 2 modes in sampling: “auto” - automatically compute weights based on the label count, and “element” - read the weight data from dataset.json directly. It will leverage the PyTorch WeightedRandomSampler to balance the dataset:

"dataset": {
   "name": "Dataset",
   "data_list_file_path": "{DATASET_JSON}",
   "data_file_base_dir": "{DATA_ROOT}",
   "data_list_key": "training",
   "sampling": {
     "mode": "auto",
     "num_samples": 50000,  # optional, default to the length of dataset on every rank
     "replacement": true,  # optional, default to True
     "label_key": "label",  # optional, default to "label"
     "weight_key": "weight"  # optional, default to "weight"
   }
 }

For more details, please check: https://github.com/Project-MONAI/MONAI/blob/master/monai/data/samplers.py#L60

6. How can I adjust the LR with the “ReduceLROnPlateau” scheduler based on validation metrics?

If you want to use validation metrics as the indicator for LR ReduceLROnPlateau scheduler, put the “LrScheduleHandler” handler in the “validate” section of config_train.json:

{
   "name": "LrScheduleHandler",
   "args": {
     "print_lr": true,
     "epoch_level": true,
     "step_transform": "lambda x: x.state.metrics['val_mean_dice']"
   }
}

7. How can I set different LR for different network layers?

There is a utility in MONAI to support setting different learning rates for different network layers: https://github.com/Project-MONAI/MONAI/blob/master/monai/optimizers/utils.py#L21. To use it in the MMAR config, you can refer to the following as an example:

"optimizer": {
  "name": "Adam",
  "layer_lrs": [
    {
      "layer_match": "lambda x: x.model[-1]",
      "match_type": "select",
      "lr": 1e-3
    },
    {
      "layer_match": "lambda x: 'conv.weight' in x",
      "match_type": "filter",
      "lr": 5e-4
    }
  ],
  "include_others": true,
  "args": {
    "lr": 1e-4
  }
},

8. How can I enable TF32 for Ampere GPUs?

TF32 is enabled by default on Ampere GPUs as long as it is not explicitly configured to false. There is a variable and it can be set in the MMAR config: “tf32”: true

Note that with TF32 on, running inference with a checkpoint model may result in minor differences compared to running inference using a torchscript model.

9. How can I enable auto mixed precision (AMP)?

For configuration, set in the MMAR config: “amp”: true

For more details on setting MMAR configs, see Model training and validation configurations. For more details on automatic mixed precision, see: https://developer.nvidia.com/automatic-mixed-precision

10. How does AMP affect performance?

If you set “mode=trilinear” in AHNet, AMP=True will be slower than AMP=False on A100 and V100, more details: https://github.com/Project-MONAI/MONAI/issues/1023

11. What can I do if AMP doesn’t show me any difference in the model memory footprint?

Docker comes with dlprof tool see details at https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/#profiling.

Run dlprof train.sh:

You will get a set of files, open nsys_profile.qdrep using the Nsight Systems GUI using vnc (outside your docker)

12. How can I save and load checkpoints of optimizer and lr_scheduler?

During training or finetuning, users can set the expected checkpoints to save in the CheckpointSaver handler of the trainer (not evaluator), for example:

{
   "name": "CheckpointSaver",
   "rank": 0,
   "args": {
     "save_dir": "{MMAR_CKPT_DIR}",
     "save_dict": ["model", "optimizer", "lr_scheduler"],
     "save_final": true,
     "save_interval": 400
   }
 }

And during finetuning, users can also load checkpoints for “models” or “optimizer” or “lr_scheduler”:

{
   "name": "CheckpointLoader",
   "args": {
     "load_path": "{MMAR_CKPT}",
     "load_dict": ["model", "optimizer"]
   }
 }

During evaluation, we do not have optimizer and lr_scheduler, and only “model” is available, so since we already set the “CheckpointLoader” handler, there is no need to set it in args again:

{
   "name": "CheckpointLoader",
   "args": {
     "load_path": "{MMAR_CKPT}"
   }
 }

13. Can I use the LoadImageD transform to load DICOM image or serials?

Yes, the LoadImageD transform can load DICOM images or serials, but it is an experimental feature. You are welcome to try it and provide feedback.

14. How can I invert all the spatial transforms on a model prediction and compute metrics or save to NIFTI or PNG files?

InvertibleTransform is a highlighted feature of MONAI v0.5, and almost every spatial transform and pad/crop transform can now support inverse operations, including random rotate, random flip, random crop/pad, resize, spacing, orientation, etc.

Users just need to set a “TransformInverter” handler to extract transform context information from specified input data (image or label) field and apply inverse transforms on the expected data (pred or label, etc.):

{
   "name": "TransformInverter",
   "args": {
     "output_keys": ["pred", "label", "pred_class1", "label_class1"],
     "batch_keys": "label",
     "postfix": "inverted",
     "nearest_interp": true
   }
}

Then specify the inverted data key when saving segmentation results to files or computing metrics:

 {
     "name": "MeanDice",
     "log_label": "val_mean_dice",
     "is_key_metric": true,
     "args": {
       "include_background": false,
       "output_transform": "lambda x: (x['pred_inverted'], x['label_inverted'])"
     }
 }
{
   "name": "SegmentationSaver",
   "args": {
     "output_dir": "{MMAR_EVAL_OUTPUT_PATH}",
     "batch_transform": "lambda x: x['image_meta_dict']",
     "output_transform": "lambda x: x['pred_inverted']",
     "resample": false
   }
}

15. How can I set “torch.backends.cudnn.benckmark=True” to accelerate training?

There is a variable for this feature that can be set in the MMAR config, set: “cudnn_benchmark”: true.

For more information on this variable, see: https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936

For more information on accelerating training, see this PyTorch performance tuning guide.

16. How can I set “find_unused_parameters=True” for distributed training if the network has some return values not included in loss computation?

There is a variable for this feature that can be set in the MMAR config, set: “ddp_find_unused_params”: true.

Please note that this is required to be true for DynUNet, DiNTS, and some other networks for multi-GPU training, but that setting will slow down the model training for most networks in general. For more information about distributed data parallel, see: https://pytorch.org/docs/stable/notes/ddp.html

17. How can I enable SyncBN for multi-gpu training?

According to the PyTorch tutorial, users can sync batch normalization during training: https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html#torch.nn.SyncBatchNorm.convert_sync_batchnorm

Clara users can achieve this by setting the config variable: “sync_batchnorm”: true.

18. Can I set several models, several losses, and several optimizers in a single MMAR?

Yes, you can set a list of losses, a list of models, and a list of optimizers for debug purposes, but you can only enable one in the list, so others should have “disabled=true” set.

19. Where does CheckpointSaver go for saving a final model?

If you want to a save final model, please set “save_final=True” for the CheckpointSaver of trainer, not evaluator. The evaluator just works during validation time, it depends on how you config validation intervals.

20. How can I save “config_train.json” into the checkpoint during training and load the model config from checkpoint during validation?

To avoid repeating the model config in config_validation.json, users can also save the “config_train.json” content into “model.pt” and load the model config from it during valuation.

Just put the “train_conf” key in “CheckpointSaver” of train and validate in config_train.json:

{
   "name": "CheckpointSaver",
   "rank": 0,
   "args": {
     "save_dir": "{MMAR_CKPT_DIR}",
     "save_dict": ["model", "train_conf"],
     "save_key_metric": true
   }
 }

Then put the path of checkpoint in the model config of config_validation.json:

"model": [
   {
     "ts_path": "{MMAR_TORCHSCRIPT}",
     "disabled": "{dont_load_ts_model}"
   },
   {
     "ckpt_path": "{MMAR_CKPT}",
     "disabled": "{dont_load_ckpt_model}"
   }
 ]

21. How can I apply “EarlyStop” logic based on the loss value or validation metrics during model training?

Clara 4.0 can achieve EarlyStop based on the “EarlyStopHandler” handler to stop training once a result is good enough.

To use loss value to execute EarlyStop, add the handler to the “train” of config_train.json (added “-” negative value because smaller loss is better):

{
   "name": "EarlyStopHandler",
   "args": {
     "patience": 100,
     "score_function": "lambda x: -x.state.output['loss']",
     "epoch_level": false
   }
}

To use validation metrics (for example: MeanDice) to execute EarlyStop, add the handler to the “validate” section of config_train.json:

{
   "name": "EarlyStopHandler",
   "args": {
     "patience": 1,
     "score_function": "lambda x: x.state.metrics['val_mean_dice']"
  }
}

22. How can I compute metrics on every class of the model output?

For example, the Prostate dataset has 3 output classes, and in order to compute MeanDice on all 3 classes, just add the “SplitChannelsD” transform in the post transform list, referring to the Brats MMAR for more details.

23. How can handlers be configured to execute only on one rank in distributed data parallel?

There is a parameter “rank” in handlers of MMAR, set to expected rank index. Please make sure you understand how the handler works, and whether it should run on every rank.

For example:

{
   "name": "TensorBoardStatsHandler",
   "rank": 0,
   "args": {
     "log_dir": "{MMAR_CKPT_DIR}",
     "tag_name": "train_loss",
     "output_transform": "lambda x: x['loss']"
   }
 }

24. How can I register new events and trigger them in my own custom components?

There are 2 args in SupervisedTrainer and SupervisedEvaluator: “event_names” and “event_to_attr”. User can register their own event names with the “event_names” arg:

"trainer": {
   "name": "SupervisedTrainer",
   "args": {
     "max_epochs": "{epochs}",
     "event_names": ["foo_event", "bar_event"]
   }
}

Then attach new handler logic to these events and trigger the events at some time, for example here we attached the “test1()” and “test2()” functions to the events and trigger these 2 custom events on the “EPOCH_COMPLETED” event:

class TestHandler:
    def attach(self, engine):
        engine.add_event_handler("foo_event", self.test1)
        engine.add_event_handler("gar_event", self.test2)
        engine.add_event_handler(Events.EPOCH_COMPLETED, self)


    def __call__(self, engine):
        engine.fire_event("foo_event")
        engine.fire_event("gar_event")

    def test1(self, engine):
        pass

    def test2(self, engine):
        pass

The other arg “event_to_attr” can help register new attributes to “engine.state” associated with the event names. For example, here we register 2 attributes and use them in handlers:

"trainer": {
   "name": "SupervisedTrainer",
   "args": {
     "max_epochs": "{epochs}",
     "event_names": ["foo_event", "gar_event"],
     "event_to_attr": {"foo_event": "foo", "gar_event": "gar"}
   }
}

 class TestHandler:
     def attach(self, engine):
         engine.add_event_handler("foo_event", self.test1)
         engine.add_event_handler("gar_event", self.test2)
         engine.add_event_handler(Events.EPOCH_COMPLETED, self)


     def __call__(self, engine):
         engine.fire_event("foo_event")
         engine.fire_event("gar_event")

     def test1(self, engine):
         print(engine.state.foo)

     def test2(self, engine):
         print(engine.state.gar)

25. In order to do transfer learning on a dataset with different class numbers, how can I load the model and finetune?

If the transfer learning task has different input or output data classes, we can skip the network layers that have incompatible shape when loading the model checkpoint:

{
   "name": "CheckpointLoader",
   "args": {
     "load_path": "{MMAR_CKPT}",
     "load_dict": ["model"],
     "strict": false,
     "strict_shape": false
   }
}

26. How can I save validation metrics details into reports?

During evaluation, users usually save the metrics of every input image, then analyze the bad cases to improve the deep learning pipeline. To save detailed information of metrics, MONAI provided a handler “MetricsSaver”, which can save the final metric values, raw metric of every model output channel of every input image, metrics summary report of operations: mean, median, max, min, 90percent, std, etc. The MeanDice reports of validation with prostate dataset are as below:

../_images/sample_validation_report.png

Just set the handler in “config_validation.json”:

{
   "name": "MetricsSaver",
   "args": {
     "save_dir": "{MMAR_EVAL_OUTPUT_PATH}",
     "metrics": ["val_mean_dice", "val_acc"],
     "metric_details": ["val_mean_dice"],
     "batch_transform": "lambda x: x['image_meta_dict']",
     "summary_ops": "*",
     "save_rank": 0
   }
}