NVIDIA Clara Train 4.1
1.0

Medical Model Archive (MMAR)

The MMAR (Medical Model ARchive) defines a standard structure for organizing all artifacts produced during the model development life cycle.

NVIDIA has example MMARs for spleen, brain tumor, and others that can be found at NGC.

You can experiment with different configurations for the same MMAR. Create a new MMAR by cloning from an existing MMAR by using the cp -r OS command.

MMAR defines the standard structure for storing artifacts (files) needed and produced by the model development workflow (training, validation, inference, etc.). The MMAR includes all the information about the model including configurations and scripts to provide a work space to perform all model development tasks:

Copy
Copied!
            

ROOT config config_train.json config_finetune.json config_inference.json config_validation.json config_validation_ckpt.json environment.json commands set_env.sh train.sh train with single GPU train_multi_gpu.sh train with 2 GPUs finetune.sh transfer learning with CKPT infer.sh inference with TS model validate.sh validate with TS model validate_ckpt.sh validate with CKPT validate_multi_gpu.sh validate with TS model on 2 GPUs validate_multi_gpu_ckpt.sh validate with CKPT on 2 GPUs export.sh export CKPT to TS model prepare_dataset.sh if want to use AIAA's continuous learning feature resources log.config ... docs license.txt Readme.md ... models model.pt model.ts final_model.pt eval all evaluation outputs: segmentation / classification results metrics reports, etc.

Note

“CKPT” means regular PyTorch “.pt” weights, “TS” means PyTorch TorchScript model

Clara workflows are made of different types of components. For each type, there are usually multiple choices. To put together a workflow, you specify and configure the components to be used.

Clara offers training and validation workflows. Workflow configurations are defined in JSON files: for example config_train.json for training workflows and config_validation.json for validation workflows.

Note

For federated learning, MMARs can be extended with additional configurations to work with NVIDIA FLARE through config_fed_server.json and config_fed_client.json. Additionally, if cross site model evaluation is configured to run, config_cross_validaiton.json is needed.

Training configuration

The training config file config_train.json defines the configuration of the training workflow. The config contains three sections: global variables, train, and validate.

You can define global variables in the configuration JSON file. These variables can be overwritten through environment.json and then the shell command or command line at the highest priority.

For an example to reference, see the config_train.json of the clara_pt_spleen_ct_segmentation MMAR.

The “train” section defines the components for the training process, including “loss”, “optimizer”, “lr_scheduler”, “model”, “pre_transforms”, “dataset”, “dataloader”, “inferer”, “handlers”, “post_transforms”, “metrics” and “trainer”. Each component is constructed by providing the component’s class “name” and the init arguments “args”.

Similarly, the “validate” section defines the components for validation process, including “pre_transforms”, “dataset”, “dataloader”, “inferer”, “handlers”, “post_transforms”, “metrics”, and “evaluator”. Each component is constructed the same way by providing the component class “name” and the corresponding init arguments “args”.

From Clara v4.0 on compared to v3.X, almost all components are now from the open source project MONAI. All the MONAI API docstrings are available at: https://docs.monai.io/en/latest/.

The training and validation workflows are based on MONAI workflows which are developed based on ignite.

The provided commands perform model development work based on the configurations in the config folder. The only command you may need to change is set_env.sh, where you can set the PYTHONPATH to the proper value.

You don’t need to change any other commands for default behavior, but you can and should study them to understand how they are defined.

Note

Please see the commands in the example MMARs for the most up to date features and settings that can be used.

train.sh

This command is used to do basic single-gpu training from scratch. When finished, you should see the following files in the “models” folder:

  • model.pt - the best model obtained

  • final_model.pt - the final model when the training is done. It is usually NOT the best model.

  • Event files - these are tensorboard events that you can view with tensorboard.

Example: train.sh

Copy
Copied!
            

1 #!/usr/bin/env bash 2 my_dir="$(dirname "$0")" 3 . $my_dir/set_env.sh 4 echo "MMAR_ROOT set to $MMAR_ROOT" 5 6 CONFIG_FILE=config/config_train.json 7 ENVIRONMENT_FILE=config/environment.json 8 python3 -u -m medl.apps.train \ 9 -m $MMAR_ROOT \ 10 -c $CONFIG_FILE \ 11 -e $ENVIRONMENT_FILE \ 12 --set \ 13 DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \ 14 epochs=1260 \ 15 multi_gpu=false \ 16 ${additional_options}

Note

Line numbers are not part of the command.


Explanation: train.sh

Copy
Copied!
            

Line 1: this is a bash script Lines 2 to 3: resolve and set the absolute directory path for MMAR_ROOT Line 6: set the training config file Line 7: set the environment file that defines commonly used variables such as DATA_ROOT. Lines 8 to 17: invokes the training program. Lines 9 to 11: set the arguments required by the training program Line 12: the --set directive allows certain training parameters to be overwritten Line 13: set the DATASET_JSON to use the dataset_0.json in the “config” folder of the MMAR. This overwrites the DATASET_JSON defined in the environment.json file. Lines 14 to 16: overwrite the training variables as defined in config_train.json

train_multi_gpu.sh

Example: train_multi_gpu.sh

Copy
Copied!
            

1 #!/usr/bin/env bash 2 3 my_dir="$(dirname "$0")" 4 . $my_dir/set_env.sh 5 6 echo "MMAR_ROOT set to $MMAR_ROOT" 7 additional_options="$*" 8 9 CONFIG_FILE=config/config_train.json 10 ENVIRONMENT_FILE=config/environment.json 11 12 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 \ 13 --master_addr="localhost" --master_port=1234 \ 14 -m medl.apps.train \ 15 -m $MMAR_ROOT \ 16 -c $CONFIG_FILE \ 17 -e $ENVIRONMENT_FILE \ 18 --write_train_stats \ 19 --set \ 20 print_conf=True \ 21 epochs=1260 \ 22 multi_gpu=True \ 23 ${additional_options}


Explanation: train_multi_gpu.sh

This command uses torch.distributed to launch multiple gpu training.

finetune.sh

This command is used to continue training from previous model. Before running this command, you must have a previous model in the models folder. The output of this command is the same as train.sh.

Example: finetune.sh

Copy
Copied!
            

1 #!/usr/bin/env bash 2 my_dir="$(dirname "$0")" 3 . $my_dir/set_env.sh 4 echo "MMAR_ROOT set to $MMAR_ROOT" 5 additional_options="$*" 6 7 CONFIG_FILE=config/config_finetune.json 8 ENVIRONMENT_FILE=config/environment.json 9 11 python3 -u -m medl.apps.train \ 12 -m $MMAR_ROOT \ 13 -c $CONFIG_FILE \ 14 -e $ENVIRONMENT_FILE \ 15 --write_train_stats \ 16 --set \ 17 print_conf=True \ 18 epochs=1260 \ 19 multi_gpu=False \ 20 ${additional_options}

export.sh

Export a trained model checkpoint from .pt to .ts so it can be used for validation.

Note

Training must have been done before you run this command.

Example: export.sh

Copy
Copied!
            

1 #!/usr/bin/env bash 2 my_dir="$(dirname "$(readlink -f "$0")")" 3 . $my_dir/set_env.sh 4 5 CONFIG_FILE=config/config_train.json 6 ENVIRONMENT_FILE=config/environment.json 7 export CKPT_DIR=$MMAR_ROOT/models 8 9 INPUT_CKPT="${MMAR_ROOT}/models/model.pt" 10 OUTPUT_CKPT="${MMAR_ROOT}/models/model.ts" 6 python3 -u -m medl.apps.export \ 7 -m $MMAR_ROOT \ 8 -c $CONFIG_FILE \ 9 -e $ENVIRONMENT_FILE \ 10 --model_path "${INPUT_CKPT}" \ 11 --output_path "${OUTPUT_CKPT}" \ 12 --input_shape "[1, 1, 160, 160, 160]"


Explanation: export.sh

Copy
Copied!
            

Lines 6 to 12: invoke the export program.

infer.sh

Perform inference against the model, based on the configuration of config_validation.json in the config folder. Inference output is saved in the eval folder.

Example: infer.sh

Copy
Copied!
            

1 #!/usr/bin/env bash 2 my_dir="$(dirname "$0")" 3 . $my_dir/set_env.sh 4 echo "MMAR_ROOT set to $MMAR_ROOT" 5 6 CONFIG_FILE=config/config_validation.json 7 ENVIRONMENT_FILE=config/environment.json 8 python3 -u -m medl.apps.evaluate \ 9 -m $MMAR_ROOT \ 10 -c $CONFIG_FILE \ 11 -e $ENVIRONMENT_FILE \ 12 --set \ 13 DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \ 14 output_infer_result=true \ 15 do_validation=false


Explanation: infer.sh

Copy
Copied!
            

Line 1: this is a bash script Lines 2 to 3: resolve and set the absolute directory path for MMAR_ROOT Line 6: set the validation config file Line 7: set the environment file that defines commonly used variables such as DATA_ROOT. Lines 8 to 15: invokes the evaluate program. Lines 9 to 11: set the arguments required by the program Line 12: the --set directive allows certain parameters to be overwritten Line 13: set the DATASET_JSON to use the dataset_0.json in the “config” folder of the MMAR. This overwrites the DATASET_JSON defined in the environment.json file.

validate.sh

Perform validation against the model, based on the configuration of config_validation.json in the config folder. Validation output is saved in the evalfolder.

Example: validate.sh

Copy
Copied!
            

1 #!/usr/bin/env bash 2 my_dir="$(dirname "$0")" 3 . $my_dir/set_env.sh 4 echo "MMAR_ROOT set to $MMAR_ROOT" 5 6 CONFIG_FILE=config/config_validation.json 7 ENVIRONMENT_FILE=config/environment.json 8 python3 -u -m medl.apps.evaluate \ 9 -m $MMAR_ROOT \ 10 -c $CONFIG_FILE \ 11 -e $ENVIRONMENT_FILE


Explanation: validate.sh

This command is very similar to infer.sh. The only differences are:

Note

infer.sh and validate.sh use the same evaluate program but different configurations.

The JSON files in the config folder define configurations of workflow tasks (training, inference, and validation).

config_train.json

This file defines components that make up the training workflow. It is used by all four training commands (single-gpu training and finetuning, multi-gpu training and finetuning). See Training configuration for details.

config_validation.json

This file defines configuration that is used for both validate.sh and infer.sh. The only difference between the two commands are the options of do_validation and output_infer_result. See Validation configuration for details on the configuration file for validation.

environment.json

This file defines the common parameters for all model work. The most important are DATA_ROOT and DATASET_JSON.

  • DATA_ROOT specifies the directory that contains the training data.

  • DATASET_JSON specifies the config file that contains the default training data split (usually dataset_0.json).

Note

Since MMAR does not contain training data, you must ensure that these two parameters are set to the right value. Do not change any other parameters.

Example: environment.json

Copy
Copied!
            

{ "DATA_ROOT": "/workspace/data/Task09_Spleen_nii", "DATASET_JSON": "/workspace/data/Task09_Spleen_nii/dataset_0.json", "PROCESSING_TASK": "segmentation", "MMAR_EVAL_OUTPUT_PATH": "eval", "MMAR_CKPT_DIR": "models", "MMAR_CKPT": "models/model.pt" "MMAR_TORCHSCRIPT": "models/models/ts" }

Variable

Description

DATA_ROOT

The location of training data

DATASET_JSON

The data split config file

PROCESSING_TASK

The task type of the training: segmentation or classification

MMAR_EVAL_OUTPUT_PATH

Directory for saving evaluation (validate or infer) results. Always the “eval” folder in the MMAR

MMAR_CKPT_DIR

Directory for saving training results. Always the “models” folder in the MMAR

Data split config file

This is a JSON file that defines the data split used for training and validation. For classification model, this file is usually named “plco.json”; for other models, it is usually named “dataset_0.json”.

The following is dataset_0.json of the model segmentation_ct_spleen:

Copy
Copied!
            

{ "description": "Spleen Segmentation", "labels": { "0": "background", "1": "spleen" }, "licence": "CC-BY-SA 4.0", "modality": { "0": "CT" }, "name": "Spleen", "numTest": 20, "numTraining": 41, "reference": "Memorial Sloan Kettering Cancer Center", "release": "1.0 06/08/2018", "tensorImageSize": "3D", "training": [ { "image": "imagesTr/spleen_29.nii.gz", "label": "labelsTr/spleen_29.nii.gz" }, … <<more data here>>….. { "image": "imagesTr/spleen_49.nii.gz", "label": "labelsTr/spleen_49.nii.gz" } ], "validation": [ { "image": "imagesTr/spleen_19.nii.gz", "label": "labelsTr/spleen_19.nii.gz" }, … <<more data here>>….. { "image": "imagesTr/spleen_9.nii.gz", "label": "labelsTr/spleen_9.nii.gz" } ] }

There is a lot of information in this file, but the only sections needed by the training and validation programs are the “training” and “validation” sections, which define sample/label pairs of data items for training and validation respectively.

Checking the dataset file

The container has a tool to check the dataset file to help catch potential problems.

Typing “check-dataset -h” in the docker will allow you to see the usage.

Some examples to use this tool are as follows (the paths should be changed to match your configuration and data locations):

  1. Decathlon dataset:

    Copy
    Copied!
                

    check-dataset --dataset /workspace/clara-mmars/pt/clara_pt_spleen_ct_annotation/config/dataset_0.json --folder /workspace/data/Task09_Spleen_nii --is_decathlon_dataset check-dataset --dataset /workspace/clara-mmars/pt/clara_pt_spleen_ct_segmentation/config/dataset_0.json --folder /workspace/data/Task09_Spleen_nii --is_decathlon_dataset check-dataset --dataset /workspace/clara-mmars/pt/clara_pt_prostate_mri_segmentation/config/dataset_0.json --folder /workspace/Downloads/MSDdata/Task05_Prostate/ --is_decathlon_dataset

  2. Data without test section:

    Copy
    Copied!
                

    check-dataset --dataset /workspace/clara-mmars/pt/clara_pt_deepgrow_2d_annotation/config/dataset_0.json --folder /workspace/data/52432/2D/ --sections training,validation check-dataset --dataset /workspace/clara-mmars/pt/clara_pt_chest_xray_classification/config/plco.json --folder /workspace/data/CXR/ --keys image --sections training,validation

  3. Test section does not have “label”:

    Copy
    Copied!
                

    check-dataset --dataset /workspace/clara-mmars/pt/clara_pt_brain_mri_segmentation/config/seg_brats18_datalist_0.json --folder /workspace/data/brats2018challenge/ --sections training,validation --keys_to_check image,label check-dataset --dataset /workspace/clara-mmars/pt/clara_pt_brain_mri_segmentation/config/seg_brats18_datalist_0.json --folder /workspace/data/brats2018challenge/ --sections testing --keys_to_check image

Note that the command to check the spleen or prostate datasets from the Medical Decathlon Dataset needs to have the flag “–is_decathlon_dataset”.

If checking testing data, usually you do not have “label” so you need to specify “–keys_to_check image”

If checking for classification tasks, your label is not a file so you need to specify “–keys_to_check image”

In the models folder, model.pt is the best model resulting from training. final_model.pt is created when the training is finished normally. final_model is a snapshot of the model at the last moment. It is usually not the best model that can be obtained. Both model.pt and final_model.pt can be used for further training or fine-tuning. Here are two typical use cases:

  • Continued training: Use the final_model.pt as the starting point for fine-tuning if you think the model has not converged due to improper configuration with the number of epochs not set high enough.

  • Transfer learning: Use the model.pt as the starting point for fine-tuning on a different dataset, which may be your own dataset, to obtain the model that is best for your data. This is also called adaptation.

An MMAR is a self-contained workspace for model development work. If you want to experiment with different configurations for the same MMAR, you should create a new MMAR by cloning from an existing MMAR.

The MMAR API is a new feature in Clara Train 4.1 to allow for creating an MMAR from python code. Especially for users who already know how to write normal training and validation pipelines, the MMAR API can be a convenient way to create an MMAR by defining all the components and necessary information like root path and data list file name.

Normal version

The following code shows an example of how to define a single component, in this case the loss:

Copy
Copied!
            

loss = Component( name=DiceLoss, args={"to_onehot_y": True, "softmax": True} )

Note

You will need to import DiceLoss with from monai.losses import DiceLoss with the code above, but you can also use the string “DiceLoss” for the name, in which case there is no need to import it.

If the component needs other objects as arguments, you can use the “@” notation:

Copy
Copied!
            

lr_scheduler = Component( name=StepLR, args={"optimizer": "@optimizer", "step_size": 5000, "gamma": 0.1} )

There are two kinds of variables: variables in config (for config_train.json) and environment variables (for environment.json).

To pass in config variables and environment variables as arguments in the component, use curly brackets to wrap the variable name:

Copy
Copied!
            

train_model = Component( name=UNet, args={ "spatial_dims": 3, "in_channels": "{INPUT_CHANNELS}","out_channels": "{OUTPUT_CHANNELS}", "channels": [16, 32, 64, 128, 256], "strides": [2, 2, 2, 2], "num_res_units": 2, "norm": "batch" } )

Config related variables should be added into the created config instance (TrainConfig below), then all components can be added as well:

Copy
Copied!
            

# create train config based on above components train_conf = TrainConfig() # set variables in config train_conf.add_vars( epochs=1260, num_interval_per_valid=20, learning_rate=2e-4, multi_gpu=False, amp=True, cudnn_benchmark=False ) train_conf.add_loss(loss) train_conf.add_optimizer(optimizer) train_conf.add_model(train_model)

When all components are added into the corresponding config instances, the MMAR can be defined:

Copy
Copied!
            

# create MMAR mmar = MMAR(root_path="spleen_segmentation", datalist_name="dataset_0.json") mmar.set_train_config(train_conf) mmar.set_validate_config(eval_conf)

Other parts of the MMAR such as resources, environment variables, data list, and commands need to be set as well. See the following examples for more details:

Easy version

The normal version of using the MMAR API is similar to directly writing the MMAR, but the easy version provides the following ways to simplify the MMAR creation process:

  1. For a single component that includes other objects, you can also input the object directly:

Copy
Copied!
            

lr_scheduler = Component( name=StepLR, args={"optimizer": optimizer, "step_size": 5000, "gamma": 0.1} )

2. For variables, MMAR API provides a class called ConfVars that you can use to create an environment variable instance and a configuration variable instance to contain all the necessary variables:

Copy
Copied!
            

# train configuration variables train_conf_vars = ConfVars( { "epochs": 2000, "learning_rate": 2e-4, "num_interval_per_valid": 10, "multi_gpu": False, "amp": True, "determinism": {"random_seed": 0}, "cudnn_benchmark": False, "dont_load_ckpt_model": True, } )

Then, the variables can be used in other components by calling the get method:

Copy
Copied!
            

train_model = Component( name=UNet, args={ "spatial_dims": 3, "in_channels": env_vars.get("INPUT_CHANNELS"),"out_channels": env_vars.get("OUTPUT_CHANNELS"), "channels": [16, 32, 64, 128, 256], "strides": [2, 2, 2, 2], "num_res_units": 2, "norm": "batch", }, )

  1. The following two special components defined in the API can be used for convenience:

Copy
Copied!
            

class OptimizerComponent(Component): def __init__( self, name: Optional[Union[str, object]] = None, vars: Optional[Dict[str, Any]] = None, args: Optional[Dict[str, Any]] = None, data: Optional[Any] = None, ): super().__init__(name, vars, args, data) if "args" not in self._config: self._config["args"] = {} if "params" not in self._config["args"]: self._config["args"]["params"] = "#@model.parameters()" class CheckpointSaverComponent(Component): def __init__( self, name: Optional[Union[str, object]] = None, vars: Optional[Dict[str, Any]] = None, args: Optional[Dict[str, Any]] = None, data: Optional[Any] = None, ): super().__init__(name, vars, args, data) if "args" not in self._config: self._config["args"] = {} if "save_dict" not in self._config["args"]: self._config["args"]["save_dict"] = {} if "train_conf" not in self._config["args"]["save_dict"]: self._config["args"]["save_dict"]["train_conf"] = "@conf"

4. After defining all necessary components, instead of calling different add methods, you can also define a dictionary that contains all the components and use it in the create_train_config and create_validate_config functions:

Copy
Copied!
            

val_section = { "pre_transforms": val_pre_transforms, "dataset": val_dataset, "dataloader": val_dataloader, "inferer": val_inferer, "iteration": val_iteration, "handlers": val_handlers, "post_transforms": post_transforms, "key_metric": val_mean_dice, "evaluator": train_validator, } train_conf = create_train_config( train_section=train_section, val_section=val_section, conf_vars=train_conf_vars )

5. For the commands and resources, if providing a template folder, the method mmar.set_commands_and_resources_from_template_dir can be used to set them directly:

Copy
Copied!
            

# create MMAR mmar = MMAR(root_path="spleen_annotation", datalist_name="dataset_0.json") mmar.set_train_config(train_conf) mmar.set_validate_config(eval_conf) mmar.set_commands_and_resources_from_template_dir(template_dir="./template") mmar.set_environment(env=env_vars.var_dict) mmar.set_datalist(datalist=DATALIST)

6. For custom code, if a user provides a directory to their custom files, they can use the method add_custom_files so all the files in the source directory will be copied over to the destination MMAR’s “custom” folder:

Copy
Copied!
            

mmar.add_custom_files(source_dir="./custom")


© Copyright 2021, NVIDIA. Last updated on Feb 2, 2023.