Clara Train FAQ

Here is a list of frequently asked questions. For more questions including the ability to ask your own, see the NVIDIA Developer Forums:

  • Researchers can go back to being a data scientist and focus on the problem and not coding

  • NVIDIA’s Clara Train SDK is optimized as it is written by experienced software developers.

  • Having reproducible science by sharing simple configuration files

  • Training speed ups:

    • Multiple GPU acceleration with MPI and Horvord

    • AMP trains faster and reduces memory foot print up to 40%

    • Smart caching

  • Simple to deploy in a clinical setting by easily exporting model and importing them into clara deploy export tools

  • Additional Features

You can use the save as nifti transformation at the end of your pre-transforms to see samples of the output. Also make sure to have the interrupt flag set to true so the training stops and waits for a key press to generate new samples:

Copy
Copied!
            

{ "name": "SaveAsNifti", "args": { "fields": ["image","label"], "out_dir": "/data/_debugPatches/", "interrupt": true } },

Set the skip background to true in the dice loss as below or you can write your own loss function as per the documentation:

Copy
Copied!
            

"loss": { "name": "Dice", "args": { "skip_background": true } },

Pre-transforms for the validate section should be a subset of the train pre-transforms. Simply remove the augmentation transforms from the train pre-transforms.

Most of the time, yes, except when using a pipeline with caching. Caching should NOT be used in the validation pipeline as it will give the wrong values by only running on the cached data and not the entire validation dataset.

FastPosNegRatioCropROI performs augmentations in addition to just cropping, and uses pos and neg to determine whether or not to crop to foreground based only on label, or crop anywhere in the image. Image is never used to determine foreground and background, only the label. FastPosNegRatioCropROI does not have batches_to_gen_at_once caching.

TransformVolumeCropROI is a 3D elastic transform that has parameters to do:

  • 3d crop

  • 3d rotation

  • 3d scaling

  • Sampling ration

  • NPResize3D

  • NPRandomFlip3D

  • NPRandomZoom3D

Other transforms can be found in the API docs: ai4med.components.transforms

This is solely used to reduce overhead for generating the locations for cropping, and only makes sense as an integer greater than 1. It is not directly related to pos or neg because the same ratio for what the center will be still remains the same. The overhead is because of the need to load the volume and perform calculations in order to determine where to pick centers from, so instead of re-loading the volume over and over each time the data will be used, setting batches_to_gen_at_once to a high number (realistically as high as the number of times this piece of data will be used total if memory is not an issue, usually memory is not an issue because all that is being cached is the coordinates for the center) will allow for one load of the volume to pre-generate many centers in advance, thus speeding up the cropping later on each time the data is used. A pre-generated center is consumed for producing each subsequent crop region, until all are used up, then another batches_to_gen_at_once of centers will be generated and cached.

Copy
Copied!
            

Exception: <class 'ValueError'>: Cannot feed value of shape (1, 128, 128, 128) for Tensor 'NV_MODEL_INPUT:0', which has shape '(?, 1, 128, 128, 128)'

You are missing batching. You should either have batching done by a transform or by the pipeline.

Change the metrics in the train.json file, below is an example for a total of 4 labels, 3 organs and the background. Also note the stopping flag in the validation is focusing on the 3rd label not the average:

Copy
Copied!
            

"aux_ops": [ { "name": "DiceMaskedOutput", "args": { "is_onehot_targets": false, "skip_background": false, "is_independent_predictions": false, "tags": [ "dice", "dice_d00", "dice_d01", "dice_d02", "dice_d03" ] }, "do_summary": true, "do_print": false } ],

Similarly in the validation section:

Copy
Copied!
            

"validate": { "metrics": [ { {"name": "ComputeAverageDice", "args": {"name": "mean_dice", "field": "model", "is_key_metric": true}}, {"name": "ComputeAverage", "args": {"name": "val_dice", "field": "dice"}}, {"name": "ComputeAverage", "args": {"name": "val_dice_00", "field": "dice_d00"}}, {"name": "ComputeAverage", "args": {"name": "val_dice_01", "field": "dice_d01"}}, {"name": "ComputeAverage", "args": {"name": "val_dice_02", "field": "val_dice_d02"}}, {"name": "ComputeAverage", "args": {"stopping_metric": true, "name": "val_dice_03", "field": "dice_d03"} } ],

Add custom code:

Copy
Copied!
            

class NegLoss(AuxiliaryOperation): def __init__(self, tag: str, do_summary=True, do_print=True): AuxiliaryOperation.__init__(self, do_summary, do_print) self.tag = tag def get_output_tensors(self, predictions, label, build_ctx: BuildContext): loss = build_ctx.must_get(BuildContext.KEY_LOSS) neg_loss = -loss return {self.tag: neg_loss}

Then in the metric section of “validate”:

Copy
Copied!
            

"metrics": [ { "name": "ComputeAverage", "args": { "name": "negloss", "is_key_metric": true, "field": "negloss" } } ],

and the following in the “train” section:

Copy
Copied!
            

"aux_ops": [ { "path": "your.path.to.NegLoss", "args": { "tag": "negloss" } } ],

Docker comes with dlprof tool see details at https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/#profiling.

Run dlprof train.sh:

You will get a set of files, open nsys_profile.qdrep using the Nsight Systems GUI using vnc (outside your docker)

Copy
Copied!
            

File "h5py/h5f.pyx", line 88, in h5py.h5f.open OSError: Unable to open file (truncated file: eof = 729088, sblock->base_addr = 0, stored_eof = 102853048)

This happens when the pretrained model file doesn’t download properly for some reason (training killed). The next time training is run, the program tries to read it and gives this error. The fix is to delete the file. Next time, it will be downloaded again. The path is in environment.json, called “PRETRAIN_WEIGHTS_FILE”.

© Copyright 2020, NVIDIA. Last updated on Feb 2, 2023.