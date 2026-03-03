tao model visual_changenet train -e <experiment_spec_file> task = segment [ results_dir = <global_results_dir> ] [ model.<model_option> = <model_option_value> ] [ dataset.<dataset_option> = <dataset_option_value> ] [ train.<train_option> = <train_option_value> ] [ train.gpu_ids = <gpu indices> ] [ train.num_gpus = <number of gpus> ]

Required Arguments

The following arguments are required.

-e , --experiment_spec_file : The path to the experiment spec file.

task : The task (‘segment’ or ‘classify’) for the visual_changenet training. Default: segment.

Optional Arguments

You can set optional arguments to override the option values in the experiment spec file.

Note For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids , which default to 1 and [0] , respectively. If both are passed, but are inconsistent, for example num_gpus = 1 , gpu_ids = [0, 1] , then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2.

In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set this variable:

CLI Launcher : You may set the environment variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in ths section Running the launcher. { "Envs" : [ { "variable" : "OMP_NUM_THREADSR" , "value" : "1" } }

Docker: You may set environment variables in Docker by setting the -e flag in the Docker command line. docker run -it --rm --gpus all \ -e OMP_NUM_THREADS = 1 \ -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e

Checkpointing and Resuming Training

At every train.checkpoint_interval , a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth . Checkpoints are saved in train.results_dir , like this:

$ ls /results/train 'model_epoch_000.pth' 'model_epoch_001.pth' 'model_epoch_002.pth' 'model_epoch_003.pth' 'model_epoch_004.pth'

The latest checkpoint will also be saved as changenet_model_segment_latest.pth . Training will automatically resume from changenet_model_segment_latest.pth if it exists in train.results_dir . This will be superseded by train.resume_training_checkpoint_path if it is provided.

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either