Troubleshooting Guide

To share feedback or ask questions about this release, access our NVIDIA TAO Developer Forum.

NGC

  • Before pulling assets from NGC, ensure you run the following commands and follow the required prompts

    ngc config set
    docker login nvcr.io
    
  • When running ngc config set, the NGC CLI may not prompt the user to configure the team and org. In this case, users may run into an error when downloading models saying

    Missing org - If apikey is set, org is also required.
    

    Please maintain a back-up of your existing NGC API key from the ngc config at ~/.ngc/config and clear the ngc config by running the following command.

    ngc config clear
    

TAO Launcher

The launcher CLI abstracts the user’s interaction with the container and brings out the entrypoints inside the respective dockers.

  • When installing the TAO Toolkit Launcher, make sure to install the nvidia-pyindex before installing nvidia-tao. The TAO Toolkit Launcher package is not hosted on PyPI. Otherwise you may run into this error.

    Collecting nvidia-tao
    Downloading nvidia-tao-0.0.1.dev4.tar.gz (4.0 kB)
    Complete output (15 lines):
    Traceback (most recent call last):
    ..
    ..
    RuntimeError:
    ###########################################################################################
    The package you are trying to install is only a placeholder project on PyPI.org repository.
    This package is hosted on NVIDIA Python Package Index.
    
    This package can be installed as:
    ```
    $ pip install nvidia-pyindex
    $ pip install nvidia-tao
    ```
    ###########################################################################################
    ----------------------------------------
    
  • Make sure to set your python to python3 when running the launcher. The TAO Toolkit Launcher is strictly a python3 package.

  • When installing the TAO Toolkit Launcher to your host machine’s native python3 as opposed to the recommended route of using virtual environment, you may get an error saying that tao binary wasn’t found. This is because the path to your tao binary installed by pip wasn’t added to the PATH environment variable in your local machine. In this case, please run the following command:

    export PATH=$PATH:/home/$USER/.local/bin
    
  • Make sure to have all the paths required by the TAO docker to be exposed to it via the ~/.tao_mounts.json. The launcher by default, does not have any paths mapped to it.

  • When running the TAO Toolkit Launcher, for CV applications, we recommend setting the TAO Toolkit Launcher to run as the user’s host account so that the user may have permissions to edit the results directories and collaterals generated by the TAO dockers. The dockers by default are instantiated as root, so the users will need sudo access to edit the results path etc. For more information on configuring the user, please refer to the Configuring the launcher section.

  • When running any TAO command for the first time, the launcher pulls the container from the docker registry. This process can take a few minutes. The log will look as follows

    2021-02-24 08:16:04,270 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn't exist locally/the manifest has changed. Pulling a new docker.
    2021-02-24 08:16:04,270 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you're doing this for the first time. Please wait here.
    

MaskRCNN

In the MaskRCNN notebook, two dataset-conversion scripts are provided: download_and_preprocess_coco.sh and create_coco_tf_record.py. As currently structured, the conversion scripts expect the standard COCO JSON labels (instances and captions) and data splits (train/val/test). If you don’t have the “captions” JSON label for your data, you may modify the scripts by uncommenting some unused variables.

When running tao-converter, if you encounter the following error, ensure you are using the correct output node names (generate_detections, mask_fcn_logits/BiasAdd):

[ERROR] UffParser: Output error: Output mask_head/mask_fcn_logits/BiasAdd not found
[ERROR] Failed to parse the model, please check the encoding key to make sure it's correct

The old node names in Developer Preview are generate_detections, mask_head/mask_fcn_logits/BiasAdd.

DetectNet_v2

  • When running tao detectnet_v2 train ... if you encounter errors similar to those show below, where the error shows missing variables in the checkpoint, delete the latest .ckzip file and restart the training with the same command.

    Traceback (most recent call last):
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata)
    tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
    (0) Not found: Key cost_sums/cyclist-bbox not found in checkpoint [[{{node save/RestoreV2}}]]
    (1) Not found: Key cost_sums/cyclist-bbox not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2/_877]]
    

    This error can be raised for the following reasons:

    • The checkpoint wasn’t saved properly.

    • The backend framework version used to generate the checkpoint was mismatched from the version used to load this checkpoint.

    • The experiment configuration has changed from what was stored in the checkpoint to the training graph that was initialized. For example, the checkpoint was generated in TAO Toolkit 2.0 but was resumed in 3.0.

    If you are starting an experiment with a different configuration (e.g. a different number of classes or training dataset size) ensure that the directory pointed to by the -r parameter of the tao detectnet_v2 train command is empty, with no previously generated checkpoints from older experiments. Resume must be used only for completing a previously launched job with the same command and same version of TAO Toolkit.

  • When resuming a DetectNet_v2 training from a checkpoint, maintain the same number of GPUs and the same command line to restart the training.

Natural Language Processing

When running tao <nlp_task> train with Megatron-BERT and the BioMegatron-BERT model, resuming training from a checkpoint can sometimes produce the following error:

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
  return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 368, in <lambda>
  lambda: hydra.run(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 110, in run
  _ = ret.return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
  raise self._return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
  ret.return_value = task_function(task_cfg)
File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/nlp/intent_slot_classification/scripts/train.py", line 132, in main
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
  self._run(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in _run
  self.checkpoint_connector.restore_model()
File "/home/jenkins/agent/workspace/tlt-pytorch-main-nightly/tlt_utils/connectors/checkpoint_connector.py", line 67, in restore_model
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/nlp_model.py", line 248, in on_load_checkpoint
  set_checkpoint_version(checkpoint['checkpoint_version'])
File "/opt/conda/lib/python3.8/site-packages/megatron/checkpointing.py", line 34, in set_checkpoint_version
  assert _CHECKPOINT_VERSION is None, \
AssertionError: checkpoint version already set

The <nlp_task> can be one of the following tasks:

  • intent_slot_classification

  • token_classification

  • punctuation_capitalization

  • text_classification

  • question_answering

This error is triggered when you try to resume training from a job that has already completed the trainer.max_epochs (i.e. the trainer.max_epochs for the new job is greater than the trainer.max_epochs for the previously completed job).

In this situation, point the new training to a new results directory using the -r option.