Note

For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus and gpu_ids, which default to 1 and [0], respectively. If both are passed, but inconsistent, for example num_gpus = 1, gpu_ids = [0, 1]`, then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2.

In some cases, you may encounter an issue with multi-GPU training resulting in a segmentation fault. You may circumvent this by setting the OMP_NUM_THREADS enviroment variable to 1. Depending upon your model of execution, you may use the following methods to set this variable

CLI Launcher

You may set this env variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in this section

{
    "Envs": [
        {
            "variable": "OMP_NUM_THREADSR",
            "value": "1"
        }
    ]
}

Docker

You may set environment variables in the docker by setting the -e flag in the docker command line.

docker run -it --rm --gpus all \
    -e OMP_NUM_THREADS=1 \
    -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e