Clara Train FAQ

Here is a list of frequently asked questions. For more questions including the ability to ask your own, see the NVIDIA Developer Forums:

  • Researchers can go back to being a data scientist and focus on the problem and not coding

  • NVIDIA’s Clara Train SDK is optimized as it is written by experienced software developers.

  • Having reproducible science by sharing simple configuration files

  • Training speed ups

  • Simple to deploy in a clinical setting by easily exporting model and importing them into Clara Deploy export tools

Yes, handlers are a list in Clara 4.0+, and they will run one by one and may have some dependencies. Please make sure the “write” handler is before the “read” handler according to your training loop.

For example, ValidationHandler should be before CheckpointSaver, otherwise, the best validation metric model cannot be obtained. StatsHandler and TensorBoardHandler are at the end because they only read and print. One note is that CheckpointSaver should be before StatsHandler if you set “save_final=True” to save model. When an exception happens, ignite only triggers one handler and exits, ignite will add more support in their next version.

Determinism can be set with a random seed. However, if you set “mode=trilinear” in AHNet, determinism is not supported. See PyTorch docs for more details: https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.interpolate

Other limitations to determinism may exist, for example the SegResNet implementation in MONAI defaults to upsample_mode: “nontrainable”, which cannot guarantee the model’s reproducibility: https://github.com/Project-MONAI/MONAI/blob/master/monai/networks/nets/segresnet.py#L45

In this case, using upsample_mode: “deconv” can allow for determinism to work.

You can set “factor” on the arg “scale” in your dataset in the configuration to do that:

  1. If factor < 1.0, randomly pick part of the datalist and set to Dataset, useful to quickly test the program.

  2. If factor > 1.0, repeat the datalist to enhance the Dataset.

Example of the usage in MMAR:

Copy
Copied!
            

"dataset": { "name": "CacheDataset", "data_list_file_path": "{DATASET_JSON}", "data_file_base_dir": "{DATA_ROOT}", "data_list_key": "training", "scale": { "factor": 3.5, "random_pick": true, "seed": 123 }, "args": { "cache_num": 32, "cache_rate": 1.0, "num_workers": 4 } }

Clara 4.1 supports 2 modes in sampling: “auto” - automatically compute weights based on the label count, and “element” - read the weight data from dataset.json directly. It will leverage the PyTorch WeightedRandomSampler to balance the dataset:

Copy
Copied!
            

"dataset": { "name": "Dataset", "data_list_file_path": "{DATASET_JSON}", "data_file_base_dir": "{DATA_ROOT}", "data_list_key": "training", "sampling": { "mode": "auto", "num_samples": 50000, # optional, default to the length of dataset on every rank "replacement": true, # optional, default to True "label_key": "label", # optional, default to "label" "weight_key": "weight" # optional, default to "weight" } }

For more details, please check: https://github.com/Project-MONAI/MONAI/blob/master/monai/data/samplers.py#L60

If you want to use validation metrics as the indicator for LR ReduceLROnPlateau scheduler, put the “LrScheduleHandler” handler in the “validate” section of config_train.json:

Copy
Copied!
            

{ "name": "LrScheduleHandler", "args": { "optimizer": "@optimizer", "print_lr": true, "epoch_level": true, "step_transform": "lambda x: x.state.metrics['val_mean_dice']" } }

There is a utility in MONAI to support setting different learning rates for different network layers: https://github.com/Project-MONAI/MONAI/blob/master/monai/optimizers/utils.py#L21. To use it in the MMAR config, you can refer to the following as an example:

Copy
Copied!
            

"params": "#monai.optimizers.generate_param_groups(network=@model, layer_matches=[lambda x: x.class_layers, lambda x: 'conv.weight' in x[0]], match_types=['select', 'filter'], lr_values=[1e-3, 1e-4])"

TF32 is enabled by default on Ampere GPUs as long as it is not explicitly configured to false. There is a variable and it can be set in the MMAR config: “tf32”: true

Note that with TF32 on, running inference with a checkpoint model may result in minor differences compared to running inference using a torchscript model.

For configuration, set in the MMAR config: “amp”: true

For more details on setting MMAR configs, see Model training and validation configurations. For more details on automatic mixed precision, see: https://developer.nvidia.com/automatic-mixed-precision

If you set “mode=trilinear” in AHNet, AMP=True will be slower than AMP=False on A100 and V100, more details: https://github.com/Project-MONAI/MONAI/issues/1023

Docker comes with dlprof tool see details at https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/#profiling.

Run dlprof train.sh:

You will get a set of files, open nsys_profile.qdrep using the Nsight Systems GUI using vnc (outside your docker)

During training or finetuning, users can set the expected checkpoints to save in the CheckpointSaver handler of the trainer (not evaluator), for example:

Copy
Copied!
            

{ "name": "CheckpointSaver", "rank": 0, "args": { "save_dir": "{MMAR_CKPT_DIR}", "save_dict": { "model": "@model", "optimizer: "@optimizer", "lr_scheduler": "@lr_scheduler"], }, "save_final": true, "save_interval": 400 } }

And during finetuning, users can also load checkpoints for “models” or “optimizer” or “lr_scheduler”:

Copy
Copied!
            

{ "name": "CheckpointLoader", "args": { "load_path": "{MMAR_CKPT}", "load_dict": { "model": "@model", "optimizer": "@optimizer" } } }

During evaluation, we do not have optimizer and lr_scheduler, and only “model” is available, so since we already set the “CheckpointLoader” handler, there is no need to set it in args again:

Copy
Copied!
            

{ "name": "CheckpointLoader", "args": { "load_path": "{MMAR_CKPT}" } }

Yes, the LoadImageD transform can load DICOM images or serials, but it is an experimental feature. You are welcome to try it and provide feedback.

“Invertd” was a highlighted feature of MONAI v0.6, and almost every spatial transform and pad/crop transform can now support inverse operations, including random rotate, random flip, random crop/pad, resize, spacing, orientation, etc.

Users just need to set an “Invertd” post transform to extract transform context information from the specified input data (image or label) field and apply inverse transforms on the expected data (pred or label, etc.):

Copy
Copied!
            

{ "name": "Invertd", "args": { "keys": "pred", "transform": "@pre_transforms", "orig_keys": "image", "meta_keys": "pred_meta_dict", "nearest_interp": false, "to_tensor": true, "device": "cuda" } }

There is a variable for this feature that can be set in the MMAR config, set: “cudnn_benchmark”: true.

For more information on this variable, see: https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936

For more information on accelerating training, see this PyTorch performance tuning guide.

There is a variable for this feature that can be set in the MMAR config, set: “ddp_find_unused_params”: true.

Please note that this is required to be true for DynUNet, DiNTS, and some other networks for multi-GPU training, but that setting will slow down the model training for most networks in general. For more information about distributed data parallel, see: https://pytorch.org/docs/stable/notes/ddp.html

According to the PyTorch tutorial, users can sync batch normalization during training: https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html#torch.nn.SyncBatchNorm.convert_sync_batchnorm

Clara users can achieve this by setting the config variable: “sync_batchnorm”: true.

Yes, you can set a list of losses, a list of models, and a list of optimizers for debug purposes, but you can only enable one in the list, so others should have “disabled=true” set.

If you want to a save final model, please set “save_final=True” for the CheckpointSaver of trainer, not evaluator. The evaluator just works during validation time, it depends on how you config validation intervals.

To avoid repeating the model config in config_validation.json, users can also save the “config_train.json” content into “model.pt” and load the model config from it during valuation.

Just put the “train_conf” key in “CheckpointSaver” of train and validate in config_train.json:

Copy
Copied!
            

{ "name": "CheckpointSaver", "rank": 0, "args": { "save_dir": "{MMAR_CKPT_DIR}", "save_dict": {"model": "@model", "train_conf": "@conf"}, "save_key_metric": true } }

Then put the path of checkpoint in the model config of config_validation.json:

Copy
Copied!
            

"model": [ { "ts_path": "{MMAR_TORCHSCRIPT}", "disabled": "{dont_load_ts_model}" }, { "ckpt_path": "{MMAR_CKPT}", "disabled": "{dont_load_ckpt_model}" } ]

The “EarlyStopHandler” handler can be set to stop training once a result is good enough.

To use loss value to execute EarlyStop, add the handler to the “train” of config_train.json (added “-” negative value because smaller loss is better):

Copy
Copied!
            

{ "name": "EarlyStopHandler", "args": { "trainer": "@trainer", "patience": 100, "score_function": "lambda x: -x.state.output['loss']", "epoch_level": false } }

To use validation metrics (for example: MeanDice) to execute EarlyStop, add the handler to the “validate” section of config_train.json:

Copy
Copied!
            

{ "name": "EarlyStopHandler", "args": { "trainer": "@trainer", "patience": 1, "score_function": "lambda x: x.state.metrics['val_mean_dice']" } }

For example, the Prostate dataset has 3 output classes, and in order to compute MeanDice on all 3 classes, just add the “SplitChannelsD” transform in the post transform list, referring to the Brats MMAR for more details.

There is a parameter “rank” in handlers of MMAR, set to expected rank index. Please make sure you understand how the handler works, and whether it should run on every rank.

For example:

Copy
Copied!
            

{ "name": "TensorBoardStatsHandler", "rank": 0, "args": { "log_dir": "{MMAR_CKPT_DIR}", "tag_name": "train_loss", "output_transform": "lambda x: x['loss']" } }

There are 2 args in SupervisedTrainer and SupervisedEvaluator: “event_names” and “event_to_attr”. You can register your own event names with the “event_names” arg:

Copy
Copied!
            

"trainer": { "name": "SupervisedTrainer", "args": { "max_epochs": "{epochs}", "event_names": ["foo_event", "bar_event"] } }

Then attach new handler logic to these events and trigger the events at some time, for example here we attached the “test1()” and “test2()” functions to the events and trigger these 2 custom events on the “EPOCH_COMPLETED” event:

Copy
Copied!
            

class TestHandler: def attach(self, engine): engine.add_event_handler("foo_event", self.test1) engine.add_event_handler("gar_event", self.test2) engine.add_event_handler(Events.EPOCH_COMPLETED, self) def __call__(self, engine): engine.fire_event("foo_event") engine.fire_event("gar_event") def test1(self, engine): pass def test2(self, engine): pass

The other arg “event_to_attr” can help register new attributes to “engine.state” associated with the event names. For example, here we register 2 attributes and use them in handlers:

Copy
Copied!
            

"trainer": { "name": "SupervisedTrainer", "args": { "max_epochs": "{epochs}", "event_names": ["foo_event", "gar_event"], "event_to_attr": {"foo_event": "foo", "gar_event": "gar"} } } class TestHandler: def attach(self, engine): engine.add_event_handler("foo_event", self.test1) engine.add_event_handler("gar_event", self.test2) engine.add_event_handler(Events.EPOCH_COMPLETED, self) def __call__(self, engine): engine.fire_event("foo_event") engine.fire_event("gar_event") def test1(self, engine): print(engine.state.foo) def test2(self, engine): print(engine.state.gar)

If the transfer learning task has different input or output data classes, we can skip the network layers that have incompatible shape when loading the model checkpoint:

Copy
Copied!
            

{ "name": "CheckpointLoader", "args": { "load_path": "{MMAR_CKPT}", "load_dict": {"model": "@model"}, "strict": false, "strict_shape": false } }

You can enable the network summary function by adding the following section into config_train.json or config_validation.json:

Copy
Copied!
            

{ "train": {...}, "validate" : {...}, "network_summary": { "network": "@model", "input_size": [2, 1, 96, 96, 96] } }

During evaluation, users usually save the metrics of every input image, then analyze the bad cases to improve the deep learning pipeline. To save detailed information of metrics, MONAI provided a handler “MetricsSaver”, which can save the final metric values, raw metric of every model output channel of every input image, metrics summary report of operations: mean, median, max, min, 90percent, std, etc. The MeanDice reports of validation with prostate dataset are as below:

sample_validation_report.png

Just set the handler in “config_validation.json”:

Copy
Copied!
            

{ "name": "MetricsSaver", "args": { "save_dir": "{MMAR_EVAL_OUTPUT_PATH}", "metrics": ["val_mean_dice", "val_acc"], "metric_details": ["val_mean_dice"], "batch_transform": "lambda x: x['image_meta_dict']", "summary_ops": "*", "save_rank": 0 } }

If you have needs for custom training logic that is not covered by the training workflow, you can bring your own workflow.

The temporary directory configured for caching must be accessible from where the training command is being run from. A workaround for this can be to add the following to set_env.sh like in the chest xray MMAR example:

Copy
Copied!
            

export TMPDIR=$(pwd)

There is an issue with allocating memory in PyTorch that is not easily repeatable which may occur training to get stuck. To avoid this, please set “num_workers” for your “dataloader” to 0 in your train config. If you have training that is running on an FL client that gets stuck, you may have to abort the task or abort the client run.

© Copyright 2021, NVIDIA. Last updated on Feb 2, 2023.