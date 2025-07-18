Is there a dependency of batch size on the accuracy of the model? How should I choose the appropriate batch size for my training?

As a common practice, a small batch size or single GPU is preferred for a small dataset; while a large batch size or multiple GPUs is preferred for a large dataset.

I am seeing lower accuracy with multi-GPU vs. single GPU. Can multi-GPU training affect the accuracy of the model? How do I improve the accuracy in multi-GPU training?

To improve the accuracy in a multi-gpu environment, learning rate parameters need to be higher, for example max_learning_rate. Multi-gpu is preferred only when the training dataset is large.

Distribute the dataset class: How do I balance the weight between classes if the dataset has significantly higher samples for one class versus another?

To account for imbalance, increase the class_weight for classes with fewer samples. You can also try disabling enable_autoweighting; in this case initial_weight is used to control cov/regression weighting. It is important to keep the number of samples of different classes balanced, which helps improve mAP.

How do I save checkpoints in TAO?

The train command for every DNN, supports saving checkpoints by default. By default, checkpoints are saved for every 10th epoch. For DetectNet_v2, the interval at which this checkpoint may be saved is configured using the checkpoint_interval parameter in the training_config section of a DetectNet_v2 training configuration file.

In DetectNet_V2, are there any parameters that can help improve AP (average precision) on training small objects?

Following parameters can help you improve AP on smaller objects:

Increase num_layers of resnet

class_weight for small objects

Increase the coverage_radius_x and coverage_radius_y parameters of the bbox_rasterizer_config section for the small objects class

Decrease minimum_detection_ground_truth_overlap

Lower minimum_height to cover more small objects for evaluation.

Why do I get this error when running tasks in the PyTorch container?

pytorch_lightning.utilities.exceptions.MisconfigurationException: you restored a checkpoint with current_epoch=10 but the Trainer(max_epochs=1)

After you have already trained a model for a number epochs, you can not continue the training by setting the number of epochs (max_epochs) to a lower number than the one already trained for.

Can I run TAO on systems without elevated user privileges?

Running TAO via the TAO Launcher requires the user to have docker-ce installed since the launcher interacts with the docker service on the local host to run the commands. Installing docker requires elevated user privileges to run as root. If you don’t have elevated user privileges on your compute machine, you may run TAO using singularity. This requires you to bypass using the tlt-launcher and interact directly with the component dockers. For information on which tasks are implemented in different dockers, run the tao info --verbose command. Once you have derived the task-to-docker mapping, you may run the tasks by following the steps below.

Pull the required docker using the following singularity command:

singularity pull tao-toolkit-tf:v3.21.08-py3.sif docker://nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3 Note For this command to work, the latest version of singularity must be installed.



Instantiate the docker using the following command:

singularity run --nv -B /path/to/workspace:/path/to/workspace tao-toolkit-tf:v3.21.08-py3.sif



Run the commands inside the container without the tao prefix. For example, to run a detectnet_v2 training in the tao-toolkit-tf container, the command would be as follows:

detectnet_v2 train -e /path/to/workspace/specs/file.txt \ -k $KEY \ -r /path/to/workspace/results \ -n name_of_final_model \ --gpus $NUM_GPUS



Can I run TAO without network?

Please see https://github.com/NVIDIA-AI-IOT/tao_toolkit_recipes/blob/main/tao_training_without_network/Guide.md