Essential concepts
This section describes essential concepts necessary for understanding Clara Train. Clara Train v4.0 and later uses MONAI’s training workflows which are based off of PyTorch Ignite’s engine. Check out an example notebook for getting started.
Clara Train is built using a component-based architecture with using components from MONAI for v4.0 and later:
Training Data Pipeline
Validation Data Pipeline
Applications <https://docs.monai.io/en/latest/apps.html>
Transforms <https://docs.monai.io/en/latest/transforms.html>
Inference methods <https://docs.monai.io/en/latest/inferers.html>
Event handlers <https://docs.monai.io/en/latest/handlers.html>
Network architectures <https://docs.monai.io/en/latest/networks.html>
Loss functions <https://docs.monai.io/en/latest/losses.html>
Optimizers <https://docs.monai.io/en/latest/optimizers.html>
Visualizations <https://docs.monai.io/en/latest/visualize.html>
Utilities <https://docs.monai.io/en/latest/utils.html>
As we add examples showing how to leverage components in specific ways, we are working on more detailed documentation on how the components impact the training workflow here.
The design principle of MONAI was to split PyTorch and ignite dependencies into different layers, so most of MONAI’s components just follow regular PyTorch APIs. This way for networks, loss functions, metrics, optimizers, and transforms, users can easily bring MONAI components into their regular PyTorch program or bring their own PyTorch components into a MONAI program.
From Clara v4.0 and later the training and validation workflows are based on PyTorch and PyTorch-Ignite:
With a standard workflow, it is much easier to develop Clara products. PyTorch-Ignite is very light and flexible with a fitting feature set, so the MONAI workflow was developed based on it.
As you can see in the above chart, there are 3 concepts at the core of ignite: engines, events, and handlers.
Engine is a trainer, validator or evaluator, and once initialized, an engine with PyTorch components and registered predefined events can execute a training loop. In the loop, it will trigger different events at different times, for example, when training is started, when an iteration is completed, when an exception is raised, and more. Then, the handlers which are attached to those specific events will be called.
For example, the log print logic defined in the StatsHandler will be called when an iteration is completed and when an epoch is completed.
Data pipelines are responsible for producing batched data items during training. Typically, two data pipelines are used: one for producing training data, another producing validation data.
A data pipeline contains a chain of transforms that are applied to the input image and label data to produce the data in the format required by the model.
See MONAI Datasets for more information.
Data pipelines contain chains of transformations.
For a list of available transforms, see the MONAI transforms section.
Here is a list of the events in MONAI:
Event name |
Description |
---|---|
STARTED |
triggered when engine’s run is started |
EPOCH_STARTED |
triggered when the epoch is started |
GET_BATCH_STARTED |
triggered before next batch is fetched |
GET_BATCH_COMPLETED |
triggered after the batch is fetched |
ITERATION_STARTED |
triggered when an iteration is started |
FORWARD_COMPLETED |
triggered when network(image, label) completed (MONAI) |
LOSS_COMPLETED |
triggered when loss(pred, label) completed (MONAI) |
BACKWARD_COMPLETED |
triggered when loss.backward() completed (MONAI) |
ITERATION_COMPLETED |
triggered when the iteration is ended |
DATALOADER_STOP_ITERATION |
triggered when dataloader has no more data to provide |
EXCEPTION_RAISED |
triggered when an exception is encountered |
TERMINATE_SINGLE_EPOCH |
triggered when the run is about to end the current epoch |
TERMINATE |
triggered when the run is about to end completely |
EPOCH_COMPLETED |
triggered when the epoch is ended |
COMPLETED |
triggered when engine’s run is completed |
MONAI workflows use engine
as context data to communicate with handlers, handlers accept engine
as an argument.
Here is all the available data in engine
:
properties:
network, # the model in use
optimizer, # (only in trainer) the optimizer for training progress
loss_function, # (only in trainer) loss function for training progress
amp, # flag that whether we are in AMP mode
should_terminate, # flag that whether we should terminate the program
should_terminate_single_epoch, # flag that whether we should terminate current epoch
state:
rank, # rank index of current process
iteration, # current iteration index, count from the first epoch
epoch, # current epoch index
max_epochs, # target epochs to complete current round
epoch_length, # count of iterations in every epoch
output, # output dict of current iteration, for trainer: ({"image": x, "label": x, "pred": x, "loss": x})
# for evaluator: ({"image": x, "label": x, "pred": x})
batch, # input dict for current iteration
metrics, # dict of metrics values, if we run metrics on training data to check overfitting,
# trainer will also have metrics results
metric_details, # dict of the temp data of metrics if set `save_details=True`,
# for example, MeanDice metric can save the mean dice of every channel in every image
dataloader, # DataLoader object to provide data
device, # device to run the program, cuda or cpu, etc.
key_metric_name, # name of the key metric
best_metric, # best metric value of the key metric
best_metric_epoch, # epoch index that we got the best metric
public functions:
terminate(), # send terminate signal to the engine, will terminate completely after current iteration
terminate_epoch(), # send terminate signal to the engine, will terminate current epoch after current iteration
Users can also add more data into engine.state
at runtime.
Any of the already implemented and included PyTorch Ignite metrics may be used as well as custom metrics implemented in the same way.