Clara Train FAQ
Here is a list of frequently asked questions. For more questions including the ability to ask your own, see the NVIDIA Developer Forums:
Researchers can go back to being a data scientist and focus on the problem and not coding
NVIDIA’s Clara Train SDK is optimized as it is written by experienced software developers.
Having reproducible science by sharing simple configuration files
Training speed ups:
Multiple GPU acceleration with MPI and Horvord
AMP trains faster and reduces memory foot print up to 40%
Smart caching
Simple to deploy in a clinical setting by easily exporting model and importing them into clara deploy export tools
You can use the save as nifti transformation at the end of your pre-transforms to see samples of the output. Also make sure to have the interrupt flag set to true so the training stops and waits for a key press to generate new samples:
{ "name": "SaveAsNifti",
"args": {
"fields": ["image","label"],
"out_dir": "/data/_debugPatches/",
"interrupt": true
}
},
Set the skip background to true in the dice loss as below or you can write your own loss function as per the documentation:
"loss": {
"name": "Dice",
"args": {
"skip_background": true
}
},
Pre-transforms for the validate section should be a subset of the train pre-transforms. Simply remove the augmentation transforms from the train pre-transforms.
Most of the time, yes, except when using a pipeline with caching. Caching should NOT be used in the validation pipeline as it will give the wrong values by only running on the cached data and not the entire validation dataset.
FastPosNegRatioCropROI performs augmentations in addition to just cropping, and uses pos and neg to determine whether or not to crop to foreground based only on label, or crop anywhere in the image. Image is never used to determine foreground and background, only the label. FastPosNegRatioCropROI does not have batches_to_gen_at_once caching.
TransformVolumeCropROI is a 3D elastic transform that has parameters to do:
3d crop
3d rotation
3d scaling
Sampling ration
NPResize3D
NPRandomFlip3D
NPRandomZoom3D
Other transforms can be found in the API docs: ai4med.components.transforms
This is solely used to reduce overhead for generating the locations for cropping, and only makes sense as an integer greater than 1. It is not directly related to pos or neg because the same ratio for what the center will be still remains the same. The overhead is because of the need to load the volume and perform calculations in order to determine where to pick centers from, so instead of re-loading the volume over and over each time the data will be used, setting batches_to_gen_at_once to a high number (realistically as high as the number of times this piece of data will be used total if memory is not an issue, usually memory is not an issue because all that is being cached is the coordinates for the center) will allow for one load of the volume to pre-generate many centers in advance, thus speeding up the cropping later on each time the data is used. A pre-generated center is consumed for producing each subsequent crop region, until all are used up, then another batches_to_gen_at_once of centers will be generated and cached.
Exception: <class 'ValueError'>: Cannot feed value of shape (1, 128, 128, 128) for Tensor 'NV_MODEL_INPUT:0', which has shape '(?, 1, 128, 128, 128)'
You are missing batching. You should either have batching done by a transform or by the pipeline.
Change the metrics in the train.json file, below is an example for a total of 4 labels, 3 organs and the background. Also note the stopping flag in the validation is focusing on the 3rd label not the average:
"aux_ops": [
{
"name": "DiceMaskedOutput",
"args": {
"is_onehot_targets": false,
"skip_background": false,
"is_independent_predictions": false,
"tags": [
"dice",
"dice_d00",
"dice_d01",
"dice_d02",
"dice_d03"
]
},
"do_summary": true,
"do_print": false
}
],
Similarly in the validation section:
"validate": {
"metrics": [
{
{"name": "ComputeAverageDice", "args": {"name": "mean_dice", "field": "model", "is_key_metric": true}},
{"name": "ComputeAverage", "args": {"name": "val_dice", "field": "dice"}},
{"name": "ComputeAverage", "args": {"name": "val_dice_00", "field": "dice_d00"}},
{"name": "ComputeAverage", "args": {"name": "val_dice_01", "field": "dice_d01"}},
{"name": "ComputeAverage", "args": {"name": "val_dice_02", "field": "val_dice_d02"}},
{"name": "ComputeAverage", "args": {"stopping_metric": true, "name": "val_dice_03", "field": "dice_d03"}
}
],
Add custom code:
class NegLoss(AuxiliaryOperation):
def __init__(self, tag: str, do_summary=True, do_print=True):
AuxiliaryOperation.__init__(self, do_summary, do_print)
self.tag = tag
def get_output_tensors(self, predictions, label, build_ctx: BuildContext):
loss = build_ctx.must_get(BuildContext.KEY_LOSS)
neg_loss = -loss
return {self.tag: neg_loss}
Then in the metric section of “validate”:
"metrics":
[
{
"name": "ComputeAverage",
"args": {
"name": "negloss",
"is_key_metric": true,
"field": "negloss"
}
}
],
and the following in the “train” section:
"aux_ops": [
{
"path": "your.path.to.NegLoss",
"args": {
"tag": "negloss"
}
}
],
Docker comes with dlprof tool see details at https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/#profiling.
Run dlprof train.sh:
You will get a set of files, open nsys_profile.qdrep using the Nsight Systems GUI using vnc (outside your docker)
13. How can I fix the failure below in training which occurs right after loading the data list file?
File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 729088, sblock->base_addr = 0, stored_eof = 102853048)
This happens when the pretrained model file doesn’t download properly for some reason (training killed). The next time training is run, the program tries to read it and gives this error. The fix is to delete the file. Next time, it will be downloaded again. The path is in environment.json, called “PRETRAIN_WEIGHTS_FILE”.