Clara Train FAQ - NVIDIA Docs

Here is a list of frequently asked questions. For more questions including the ability to ask your own, see the NVIDIA Developer Forums:

1. Why should I use Clara Train?

Researchers can go back to being a data scientist and focus on the problem and not coding
NVIDIA’s Clara Train SDK is optimized as it is written by experienced software developers.
Having reproducible science by sharing simple configuration files
Training speed ups:
- Multiple GPU acceleration with MPI and Horvord
- AMP trains faster and reduces memory foot print up to 40%
- Smart caching
Simple to deploy in a clinical setting by easily exporting model and importing them into clara deploy export tools
Additional Features

2. How can I debug my data preparation / augmentation output?

You can use the save as nifti transformation at the end of your pre-transforms to see samples of the output. Also make sure to have the interrupt flag set to true so the training stops and waits for a key press to generate new samples:

Copy
Copied!

            
            {   "name": "SaveAsNifti",
   "args": {
       "fields": ["image","label"],
       "out_dir": "/data/_debugPatches/",
       "interrupt": true
   }
},

3. For multi-label segmentation problems, how do you make the network focus on the labels and not the background?

Set the skip background to true in the dice loss as below or you can write your own loss function as per the documentation:

Copy
Copied!

            
            "loss": {
  "name": "Dice",
  "args": {
    "skip_background": true
  }
},

4. What is the difference between pre-transforms in the train and validate sections in the config_train.json training configuration?

Pre-transforms for the validate section should be a subset of the train pre-transforms. Simply remove the augmentation transforms from the train pre-transforms.

5. Should image_pipeline be the same in the train and validate sections in the config_train.json training configuration?

Most of the time, yes, except when using a pipeline with caching. Caching should NOT be used in the validation pipeline as it will give the wrong values by only running on the cached data and not the entire validation dataset.

6. What are the differences between the different cropping transformations?

FastPosNegRatioCropROI performs augmentations in addition to just cropping, and uses pos and neg to determine whether or not to crop to foreground based only on label, or crop anywhere in the image. Image is never used to determine foreground and background, only the label. FastPosNegRatioCropROI does not have batches_to_gen_at_once caching.

7. What are some supported 3D data transforms?

TransformVolumeCropROI is a 3D elastic transform that has parameters to do:

3d crop
3d rotation
3d scaling
Sampling ration

NPResize3D
NPRandomFlip3D
NPRandomZoom3D

Other transforms can be found in the API docs: ai4med.components.transforms

8. How do I use batches_to_gen_at_once to speed up training in the CropByPosNegRatio cropping transformations?

This is solely used to reduce overhead for generating the locations for cropping, and only makes sense as an integer greater than 1. It is not directly related to pos or neg because the same ratio for what the center will be still remains the same. The overhead is because of the need to load the volume and perform calculations in order to determine where to pick centers from, so instead of re-loading the volume over and over each time the data will be used, setting batches_to_gen_at_once to a high number (realistically as high as the number of times this piece of data will be used total if memory is not an issue, usually memory is not an issue because all that is being cached is the coordinates for the center) will allow for one load of the volume to pre-generate many centers in advance, thus speeding up the cropping later on each time the data is used. A pre-generated center is consumed for producing each subsequent crop region, until all are used up, then another batches_to_gen_at_once of centers will be generated and cached.

9. How can I fix the error below?

Copy
Copied!

            
            Exception: <class 'ValueError'>: Cannot feed value of shape (1, 128, 128, 128) for Tensor 'NV_MODEL_INPUT:0', which has shape '(?, 1, 128, 128, 128)'

You are missing batching. You should either have batching done by a transform or by the pipeline.

10. How can I show train / validation dice per label on tensorboard?

Change the metrics in the train.json file, below is an example for a total of 4 labels, 3 organs and the background. Also note the stopping flag in the validation is focusing on the 3rd label not the average:

Copy
Copied!

            
            "aux_ops": [
    {
        "name": "DiceMaskedOutput",
        "args": {
            "is_onehot_targets": false,
            "skip_background": false,
            "is_independent_predictions": false,
            "tags": [
                "dice",
                "dice_d00",
                "dice_d01",
                "dice_d02",
                "dice_d03"
            ]
        },
        "do_summary": true,
        "do_print": false
    }
],

Similarly in the validation section:

Copy
Copied!

            
            "validate": {
    "metrics": [
    {
        {"name": "ComputeAverageDice", "args": {"name": "mean_dice", "field": "model", "is_key_metric": true}},
        {"name": "ComputeAverage", "args": {"name": "val_dice", "field": "dice"}},
        {"name": "ComputeAverage", "args": {"name": "val_dice_00", "field": "dice_d00"}},
        {"name": "ComputeAverage", "args": {"name": "val_dice_01", "field": "dice_d01"}},
        {"name": "ComputeAverage", "args": {"name": "val_dice_02", "field": "val_dice_d02"}},
        {"name": "ComputeAverage", "args": {"stopping_metric": true, "name": "val_dice_03", "field": "dice_d03"}
    }
],

11. How can I have a metric just on the loss?

Add custom code:

Copy
Copied!

            
            class NegLoss(AuxiliaryOperation):

    def __init__(self, tag: str, do_summary=True, do_print=True):
        AuxiliaryOperation.__init__(self, do_summary, do_print)
        self.tag = tag

    def get_output_tensors(self, predictions, label, build_ctx: BuildContext):
        loss = build_ctx.must_get(BuildContext.KEY_LOSS)
        neg_loss = -loss
        return {self.tag: neg_loss}

Then in the metric section of “validate”:

Copy
Copied!

            
            "metrics":
[
    {
        "name": "ComputeAverage",
        "args": {
            "name": "negloss",
            "is_key_metric": true,
            "field": "negloss"
        }
    }
],

and the following in the “train” section:

Copy
Copied!

            
            "aux_ops": [
  {
    "path": "your.path.to.NegLoss",
    "args": {
          "tag": "negloss"
   }
  }
],

12. What can I do if AMP doesn’t show me any difference in the model memory footprint?

Docker comes with dlprof tool see details at https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/#profiling.

Run dlprof train.sh:

You will get a set of files, open nsys_profile.qdrep using the Nsight Systems GUI using vnc (outside your docker)

13. How can I fix the failure below in training which occurs right after loading the data list file?

Copy
Copied!

            
            File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 729088, sblock->base_addr = 0, stored_eof = 102853048)

This happens when the pretrained model file doesn’t download properly for some reason (training killed). The next time training is run, the program tries to read it and gives this error. The fix is to delete the file. Next time, it will be downloaded again. The path is in environment.json, called “PRETRAIN_WEIGHTS_FILE”.