NVIDIA Clara Train 4.1
1.0

Essential concepts

This section describes essential concepts necessary for understanding Clara Train. Clara Train v4.0 and later uses MONAI’s training workflows which are based off of PyTorch Ignite’s engine. Check out an example notebook for getting started.

Clara-monai.PNG

Clara Train is built using a component-based architecture with using components from MONAI for v4.0 and later:

As we add examples showing how to leverage components in specific ways, we are working on more detailed documentation on how the components impact the training workflow here.

The design principle of MONAI was to split PyTorch and ignite dependencies into different layers, so most of MONAI’s components just follow regular PyTorch APIs. This way for networks, loss functions, metrics, optimizers, and transforms, users can easily bring MONAI components into their regular PyTorch program or bring their own PyTorch components into a MONAI program.

From Clara v4.0 and later the training and validation workflows are based on PyTorch and PyTorch-Ignite:

MONAI_ignite_workflow.png

With a standard workflow, it is much easier to develop Clara products. PyTorch-Ignite is very light and flexible with a fitting feature set, so the MONAI workflow was developed based on it.

As you can see in the above chart, there are 3 concepts at the core of ignite: engines, events, and handlers.

Engine is a trainer, validator or evaluator, and once initialized, an engine with PyTorch components and registered predefined events can execute a training loop. In the loop, it will trigger different events at different times, for example, when training is started, when an iteration is completed, when an exception is raised, and more. Then, the handlers which are attached to those specific events will be called.

For example, the log print logic defined in the StatsHandler will be called when an iteration is completed and when an epoch is completed.

Data pipelines are responsible for producing batched data items during training. Typically, two data pipelines are used: one for producing training data, another producing validation data.

A data pipeline contains a chain of transforms that are applied to the input image and label data to produce the data in the format required by the model.

See MONAI Datasets for more information.

Data pipelines contain chains of transformations.

For a list of available transforms, see the MONAI transforms section.

Here is a list of the events in MONAI:

Event name

Description

STARTED

triggered when engine’s run is started

EPOCH_STARTED

triggered when the epoch is started

GET_BATCH_STARTED

triggered before next batch is fetched

GET_BATCH_COMPLETED

triggered after the batch is fetched

ITERATION_STARTED

triggered when an iteration is started

FORWARD_COMPLETED

triggered when network(image, label) completed (MONAI)

LOSS_COMPLETED

triggered when loss(pred, label) completed (MONAI)

BACKWARD_COMPLETED

triggered when loss.backward() completed (MONAI)

ITERATION_COMPLETED

triggered when the iteration is ended

DATALOADER_STOP_ITERATION

triggered when dataloader has no more data to provide

EXCEPTION_RAISED

triggered when an exception is encountered

TERMINATE_SINGLE_EPOCH

triggered when the run is about to end the current epoch

TERMINATE

triggered when the run is about to end completely

EPOCH_COMPLETED

triggered when the epoch is ended

COMPLETED

triggered when engine’s run is completed

MONAI workflows use engine as context data to communicate with handlers, handlers accept engine as an argument. Here is all the available data in engine:

Copy
Copied!
            

properties: network, # the model in use optimizer, # (only in trainer) the optimizer for training progress loss_function, # (only in trainer) loss function for training progress amp, # flag that whether we are in AMP mode should_terminate, # flag that whether we should terminate the program should_terminate_single_epoch, # flag that whether we should terminate current epoch state: rank, # rank index of current process iteration, # current iteration index, count from the first epoch epoch, # current epoch index max_epochs, # target epochs to complete current round epoch_length, # count of iterations in every epoch output, # output dict of current iteration, for trainer: ({"image": x, "label": x, "pred": x, "loss": x}) # for evaluator: ({"image": x, "label": x, "pred": x}) batch, # input dict for current iteration metrics, # dict of metrics values, if we run metrics on training data to check overfitting, # trainer will also have metrics results metric_details, # dict of the temp data of metrics if set `save_details=True`, # for example, MeanDice metric can save the mean dice of every channel in every image dataloader, # DataLoader object to provide data device, # device to run the program, cuda or cpu, etc. key_metric_name, # name of the key metric best_metric, # best metric value of the key metric best_metric_epoch, # epoch index that we got the best metric public functions: terminate(), # send terminate signal to the engine, will terminate completely after current iteration terminate_epoch(), # send terminate signal to the engine, will terminate current epoch after current iteration

Users can also add more data into engine.state at runtime.

Any of the already implemented and included PyTorch Ignite metrics may be used as well as custom metrics implemented in the same way.

© Copyright 2021, NVIDIA. Last updated on Feb 2, 2023.