Essential concepts

This section describes essential concepts necessary for understanding Clara Train. Clara Train v4.0 and later uses MONAI’s training workflows which are based off of PyTorch Ignite’s engine. Check out an example notebook for getting started.


Clara Train is built using a component-based architecture with using components from MONAI for v4.0 and later:

As we add examples showing how to leverage components in specific ways, we are working on more detailed documentation on how the components impact the training workflow here.

The design principle of MONAI was to split PyTorch and ignite dependencies into different layers, so most of MONAI’s components just follow regular PyTorch APIs. This way for networks, loss functions, metrics, optimizers, and transforms, users can easily bring MONAI components into their regular PyTorch program or bring their own PyTorch components into a MONAI program.

From Clara v4.0 and later the training and validation workflows are based on PyTorch and PyTorch-Ignite:


With a standard workflow, it is much easier to develop Clara products. PyTorch-Ignite is very light and flexible with a fitting feature set, so the MONAI workflow was developed based on it.

As you can see in the above chart, there are 3 concepts at the core of ignite: engines, events, and handlers.

Engine is a trainer, validator or evaluator, and once initialized, an engine with PyTorch components and registered predefined events can execute a training loop. In the loop, it will trigger different events at different times, for example, when training is started, when an iteration is completed, when an exception is raised, and more. Then, the handlers which are attached to those specific events will be called.

For example, the log print logic defined in the StatsHandler will be called when an iteration is completed and when an epoch is completed.

Data pipelines are responsible for producing batched data items during training. Typically, two data pipelines are used: one for producing training data, another producing validation data.

A data pipeline contains a chain of transforms that are applied to the input image and label data to produce the data in the format required by the model.

See MONAI Datasets for more information.

Data pipelines contain chains of transformations.

For a list of available transforms, see the MONAI transforms section.

Here is a list of the events in MONAI:

Event name



triggered when engine’s run is started


triggered when the epoch is started


triggered before next batch is fetched


triggered after the batch is fetched


triggered when an iteration is started


triggered when network(image, label) completed (MONAI)


triggered when loss(pred, label) completed (MONAI)


triggered when loss.backward() completed (MONAI)


triggered when the iteration is ended


triggered when dataloader has no more data to provide


triggered when an exception is encountered


triggered when the run is about to end the current epoch


triggered when the run is about to end completely


triggered when the epoch is ended


triggered when engine’s run is completed

MONAI workflows use engine as context data to communicate with handlers, handlers accept engine as an argument. Here is all the available data in engine:


properties: network, # the model in use optimizer, # (only in trainer) the optimizer for training progress loss_function, # (only in trainer) loss function for training progress amp, # flag that whether we are in AMP mode should_terminate, # flag that whether we should terminate the program should_terminate_single_epoch, # flag that whether we should terminate current epoch state: rank, # rank index of current process iteration, # current iteration index, count from the first epoch epoch, # current epoch index max_epochs, # target epochs to complete current round epoch_length, # count of iterations in every epoch output, # output dict of current iteration, for trainer: ({"image": x, "label": x, "pred": x, "loss": x}) # for evaluator: ({"image": x, "label": x, "pred": x}) batch, # input dict for current iteration metrics, # dict of metrics values, if we run metrics on training data to check overfitting, # trainer will also have metrics results metric_details, # dict of the temp data of metrics if set `save_details=True`, # for example, MeanDice metric can save the mean dice of every channel in every image dataloader, # DataLoader object to provide data device, # device to run the program, cuda or cpu, etc. key_metric_name, # name of the key metric best_metric, # best metric value of the key metric best_metric_epoch, # epoch index that we got the best metric public functions: terminate(), # send terminate signal to the engine, will terminate completely after current iteration terminate_epoch(), # send terminate signal to the engine, will terminate current epoch after current iteration

Users can also add more data into engine.state at runtime.

Any of the already implemented and included PyTorch Ignite metrics may be used as well as custom metrics implemented in the same way.

© Copyright 2021, NVIDIA. Last updated on Feb 2, 2023.