Note

For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids, which default to 1 and [0], respectively. If both are passed, but are inconsistent, for example num_gpus = 1, gpu_ids = [0, 1], then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2.

In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set this variable:

  • CLI Launcher:

    You may set the environment variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in ths section Running the launcher.

    {
        "Envs": [
            {
                "variable": "OMP_NUM_THREADSR",
                "value": "1"
            }
    
    }
    
  • Docker:

    You may set environment variables in Docker by setting the -e flag in the Docker command line.

    docker run -it --rm --gpus all \
        -e OMP_NUM_THREADS=1 \
        -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e