Medical Model Archive

The MMAR (Medical Model ARchive) defines a standard structure for organizing all artifacts produced during the model development life cycle.

NVIDIA has example MMARs for liver, spleen, and others that can be found at NGC.

You can experiment with different configurations for the same MMAR. Create a new MMAR by cloning from an existing MMAR by using the cp -r OS command.

MMAR defines the standard structure for storing artifacts (files) needed and produced by the model development workflow (training, validation, inference, etc.). The MMAR includes all the information about the model, and the work space to perform all model development tasks:

ROOT
    config
        config_train.json
        config_validation.json
        environment.json
    commands
        set_env.sh
        train.sh
        train_finetune.sh
        train_2gpu.sh
        train_2gpu_finetune.sh
        infer.sh
        validate.sh
        export.sh
    resources
        log.config
        ...
    docs
        license.txt
        Readme.md
        ...
    models (all forms of model: checkpoint, frozen graphs, saved model, TRTIS manifest)
        model.ckpt.meta, model.ckpt.index, model.ckpt.data
        tensorboard event files
        model.frn.pb, model.trt.pb

Commands

The provided commands perform model development work based on the configurations in the config folder. The only command you may need to change is set_env.sh, where you can set the PYTHONPATH to the proper value.

You don’t need to change any other commands for default behavior, but you can and should study them to understand how they are defined.

train.sh

This command is used to do basic single-gpu training from scratch. When finished, you should see the following files in the “models” folder:

  • model.ckpt - the best model obtained

  • model_final.ckpt - the final model when the training is done. It is usually NOT the best model.

  • Event files - these are tensorboard events that you can view with tensorboard.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_train.json
7    ENVIRONMENT_FILE=config/environment.json
8    python3 -u  -m nvmidl.apps.train \
9       -m $MMAR_ROOT \
10       -c $CONFIG_FILE \
11       -e $ENVIRONMENT_FILE \
12       --set \
13       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
14       epochs=1260 \
15       learning_rate=0.0001 \
16       num_training_epoch_per_valid=20 \
17       multi_gpu=false

Note

Line numbers are not part of the command.

Explanation

Line 1:     this is a bash script
Lines 2 to 3:     resolve and set the absolute directory path for MMAR_ROOT
Line 6:     set the training config file
Line 7:     set the environment file that defines commonly used variables
            such as DATA_ROOT.
Lines 8 to 17:     invokes the training program.
Lines 9 to 11:     set the arguments required by the training program
Line 12:     the --set directive allows certain training parameters to be overwritten
Line 13:     set the DATASET_JSON to use the dataset_0.json in the “config” folder of
             the MMAR. This overwrites the DATASET_JSON defined in the environment.json file.
Lines 14 to 17: overwrite the training variables as defined in config_train.json

train_finetune.sh

This command is used to continue training from previous checkpoint (model.ckpt). Before running this command, you must have a previously generated checkpoint in the models folder. The output of this command is the same as train.sh.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_train.json
7    ENVIRONMENT_FILE=config/environment.json
8    python3 -u  -m nvmidl.apps.train \
9       -m $MMAR_ROOT \
10       -c $CONFIG_FILE \
11       -e $ENVIRONMENT_FILE \
12       --set \
13       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
14       PRETRAIN_WEIGHTS_FILE="" \
15       epochs=1000 \
16       learning_rate=0.0001 \
17       num_training_epoch_per_valid=20 \
18       MMAR_CKPT=$MMAR_ROOT/models/model.ckpt \
19       multi_gpu=false

Explanation

This command is very similar to train.sh. The only differences are:

Line 14: set PRETRAIN_WEIGHTS_FILE to empty string. When fine-tuning a model, no need to download the pretrained weights from the web.
Line 18: set the pre-trained model’s checkpoint location

train_2gpu.sh

This command does horovod-based training with 2 GPUs from scratch. The output of this command is the same as train.sh. You can use this as an example for multi-gpu training. Please see horovod for tips. In general, the learning rate should be scaled up based on the number of GPUs.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_train.json
7    ENVIRONMENT_FILE=config/environment.json
8    mpirun -np 2 -H localhost:2 -bind-to none -map-by slot \
9       -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1
10    -mca btl ^openib --allow-run-as-root \
11       python3 -u  -m nvmidl.apps.train \
12       -m $MMAR_ROOT \
13       -c $CONFIG_FILE \
14       -e $ENVIRONMENT_FILE \
15       --set \
16       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
17       epochs=1250 \
18       learning_rate=0.0003 \
19       num_training_epoch_per_valid=10 \
20       multi_gpu=true

Explanation

This file is very similar to train.sh. The differences are:

Lines 8 to 20:  Run 2-GPU training with mpirun program, which is a 3rd-party program that manages cross-process communication.
Lines 8 to 10:  set arguments of mpirun for running 2 processes
Line 20:        multi_gpu must be set to true
Lines 11 to 20: the training program setup - same as in train.sh. Note that the learning rate is scaled up, as suggested by horovod.

train_2gpu_finetune.sh

This command does horovod-based training with 2 GPUs from previous checkpoint. The output of this command is the same as train.sh.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT is set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_train.json
7    ENVIRONMENT_FILE=config/environment.json
8    mpirun -np 2 -H localhost:2 -bind-to none -map-by slot \
9       -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1
10    -mca btl ^openib --allow-run-as-root \
11       python3 -u  -m nvmidl.apps.train \
12       -m $MMAR_ROOT \
13       -c $CONFIG_FILE \
14       -e $ENVIRONMENT_FILE \
15       --set \
16       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
17       PRETRAIN_WEIGHTS_FILE="" \
18       MMAR_CKPT=$MMAR_ROOT/models/model.ckpt \
19       learning_rate=0.0003 \
20       num_training_epoch_per_valid=10 \
21       epochs=1000 \
22       multi_gpu=true

Explanation

This command is basically a combination of train_finetune.sh and train_2gpu.sh:

Lines 8 to 22: run the training program with mpirun;
Lines 11 to 22: run the training program with parameters for 2 GPU.

export.sh

Export a trained model checkpoint to frozen graphs. Two frozen graphs will be generated into the models folder.

Note

Training must have been done before you run this command.

  • model.fzn.pb - the regular frozen graph

  • model.trt.pb - TRT-optimized frozen graph

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    # Data list containing all data
5    export CKPT_DIR=$MMAR_ROOT/models
6    python3 -u -m nvmidl.apps.export \
7       --model_file_format CKPT \
8       --model_file_path $CKPT_DIR \
9       --model_name model \
10       --input_node_names "NV_MODEL_INPUT" \
11       --output_node_names NV_MODEL_OUTPUT \
12       --trt_min_seg_size 50

Explanation

Line 5:     set the location of the “models” directory. Checkpoint must have been created there.
Lines 6 to 12:  invoke the export program.
Line 7:     set the source model format: it is a checkpoint format (CKPT)
Line 8:     set the path to the checkpoint
Line 9:     set the base name of the model’s checkpoint file
Lines 10 and 11:  set the input and output node names
Line 12:    set the minimum segmentation size for TensorRT optimization

infer.sh

Perform inference against the model, based on the configuration of config_validation.json in the config folder. Inference output is saved in the eval folder.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_validation.json
7    ENVIRONMENT_FILE=config/environment.json
8    python3 -u  -m nvmidl.apps.evaluate \
9       -m $MMAR_ROOT \
10       -c $CONFIG_FILE \
11       -e $ENVIRONMENT_FILE \
12       --set \
13       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
14       output_infer_result=true \
15       do_validation=false

Explanation

Line 1:   this is a bash script
Lines 2 to 3:  resolve and set the absolute directory path for MMAR_ROOT
Line 6:   set the validation config file
Line 7:   set the environment file that defines commonly used variables such as DATA_ROOT.
Lines 8 to 15:  invokes the evaluate program.
Lines 9 to 11:  set the arguments required by the program
Line 12:  the --set directive allows certain parameters to be overwritten
Line 13:  set the DATASET_JSON to use the dataset_0.json in the “config” folder of the MMAR. This overwrites the DATASET_JSON defined in the environment.json file.
Lines 14 to 15:  overwrite default values of the evaluation variables
Line 14:  instructs the program to generate inference results
Line 15:  instructs the program not to do validation

validate.sh

Perform validation against the model, based on the configuration of config_validation.json in the config folder. Validation output is saved in the evalfolder.

Example

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/set_env.sh
4    echo "MMAR_ROOT set to $MMAR_ROOT"
5    # Data list containing all data
6    CONFIG_FILE=config/config_validation.json
7    ENVIRONMENT_FILE=config/environment.json
8    python3 -u  -m nvmidl.apps.evaluate \
9        -m $MMAR_ROOT \
10       -c $CONFIG_FILE \
11       -e $ENVIRONMENT_FILE \
12       --set \
13       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json \
14       do_validation=true \
15       output_infer_result=false

Explanation

This command is very similar to infer.sh. The only differences are:

Line 14: instructs the program to do validation
Line 15: instructs the program not to generate inference results

Note

infer.sh and validate.sh use the same evaluate program.

Configuration

The JSON files in the config folder define configurations of workflow tasks (training, inference, and validation).

config_train.json

This file defines components that make up the training workflow. It is used by all four training commands (single-gpu training and finetuning, multi-gpu training and finetuning). See Training configuration for details.

config_validation.json

This file defines configuration that is used for both validate.sh and infer.sh. The only difference between the two commands are the options of do_validation and output_infer_result. See Validation configuration for details on the configuration file for validation.

environment.json

This file defines the common parameters for all model work. The most important are DATA_ROOT and DATASET_JSON.

  • DATA_ROOT specifies the directory that contains the training data.

  • DATASET_JSON specifies the config file that contains the default training data split (usually dataset_0.json).

Note

Since MMAR does not contain training data, you must ensure that these two parameters are set to the right value. Do not change any other parameters.

Example

{
   "DATA_ROOT": "/workspace/data/Task09_Spleen_nii",
   "DATASET_JSON": "/workspace/data/Task09_Spleen_nii/dataset_0.json",
   "PROCESSING_TASK": "segmentation",
   "MMAR_EVAL_OUTPUT_PATH": "eval",
   "MMAR_CKPT_DIR": "models",
   "PRETRAIN_WEIGHTS_FILE": "/var/tmp/resnet50_weights_tf_dim_ordering_tf_kernels.h5"
}

Variable

Description

DATA_ROOT

The location of training data

DATASET_JSON

The data split config file

PROCESSING_TASK

The task type of the training: segmentation or classification

MMAR_EVAL_OUTPUT_PATH

Directory for saving evaluation (validate or infer) results. Always the “eval” folder in the MMAR

MMAR_CKPT_DIR

Directory for saving training results. Always the “models” folder in the MMAR

PRETRAIN_WEIGHTS_FILE

Location of the pre-trained weights file. NOTE: if the file does not exist and is needed, the training program will download it from predefined URL from the web.

Data split config file

This is a JSON file that defines the data split used for training and validation. For classification model, this file is usually named “plco.json”; for other models, it is usually named “dataset_0.json”.

The following is dataset_0.json of the model segmentation_ct_spleen:

{
   "description": "Spleen Segmentation",
   "labels": {
       "0": "background",
       "1": "spleen"
   },
   "licence": "CC-BY-SA 4.0",
   "modality": {
       "0": "CT"
   },
   "name": "Spleen",
   "numTest": 20,
   "numTraining": 41,
   "reference": "Memorial Sloan Kettering Cancer Center",
   "release": "1.0 06/08/2018",
   "tensorImageSize": "3D",
   "training": [
       {
           "image": "imagesTr/spleen_29.nii.gz",
           "label": "labelsTr/spleen_29.nii.gz"
       },
… <<more data here>>…..
       {
           "image": "imagesTr/spleen_49.nii.gz",
           "label": "labelsTr/spleen_49.nii.gz"
       }
   ],
   "validation": [
       {
           "image": "imagesTr/spleen_19.nii.gz",
           "label": "labelsTr/spleen_19.nii.gz"
       },

… <<more data here>>…..

      {
           "image": "imagesTr/spleen_9.nii.gz",
           "label": "labelsTr/spleen_9.nii.gz"
       }
   ]
}

There is a lot of information in this file, but the only sections needed by the training and validation programs are the “training” and “validation” sections, which define sample/label pairs of data items for training and validation respectively.

Using model.ckpt or model_final.ckpt

In the models folder, model.ckpt is the best model resulting from training. model_final.ckpt is created when the training is finished normally. Model_final is a snapshot of the model at the last moment. It is usually not the best model that can be obtained. Both model.ckpt and model_final.ckpt can be used for further training or fine-tuning. Here are two typical use cases:

  • Continued training: Use the model_final.ckpt as the starting point for fine-tuning if you think the model has not converged due to improper configuration with the number of epochs not set high enough.

  • Transfer learning: Use the model.ckpt as the starting point for fine-tuning on a different dataset, which may be your own dataset, to obtain the model that is best for your data. This is also called adaptation.

Cloning MMAR

A MMAR is a self-contained workspace for model development work. If you want to experiment with different configurations for the same MMAR, you should create a new MMAR by cloning from an existing MMAR. We will provide a “mmar-clone” command in the future, but before that you can easily use the “cp -r” OS command to do it yourself.