Troubleshooting Guide#

To share feedback or ask questions about this release, access our NVIDIA TAO Developer Forum.

Finetuning Microservices#

  • If there are any issues with minikube after meeting the pre-requisites, run the following command to restart the minikube cluster:

    minikube start --driver=docker --container-runtime=docker --gpus=all --ports=32080:32080
    
  • For any dataset, experiment, or workspace-related errors, look at the logs of the tao-api service pod by running the following commands:

    kubectl logs -f <pod name starting with tao-api-app-pod>
    kubectl logs -f <pod name starting with tao-api-workflow-pod>
    
  • For any job-related errors, look at the logs of the job by running the following command, and override any configuration if necessary:

    kubectl logs -f tao-api-sts-<job_id>-0
    

    The job logs are also automatically uploaded to your cloud workspace under /results/<job_id>/microservices_log.txt.

    Additionally, you can view the logs through the Jobs API endpoint at /api/v1/orgs/<org_name>/<experiments/datasets>/<experiment/dataset_id>/jobs/<job_id>/logs for debugging purposes.

NGC#

  • Before pulling assets from NGC, ensure you run the following commands and follow the required prompts

    ngc config set
    docker login nvcr.io
    
  • When running ngc config set, the NGC CLI may not prompt the user to configure the team and org. In this case, users may run into an error when downloading models saying

    Missing org - If apikey is set, org is also required.
    

    Please maintain a back-up of your existing NGC API key from the ngc config at ~/.ngc/config and clear the ngc config by running the following command.

    ngc config clear
    

TAO Launcher#

The launcher CLI abstracts the user’s interaction with the container and brings out the entrypoints inside the respective dockers.

  • Make sure to set your python to python3 when running the launcher. The TAO Launcher is strictly a python3 package.

  • When installing the TAO Launcher to your host machine’s native python3 as opposed to the recommended route of using virtual environment, you may get an error saying that tao binary wasn’t found. This is because the path to your tao binary installed by pip wasn’t added to the PATH environment variable in your local machine. In this case, please run the following command:

    export PATH=$PATH:/home/$USER/.local/bin
    
  • Make sure to have all the paths required by the TAO docker to be exposed to it via the ~/.tao_mounts.json. The launcher by default, does not have any paths mapped to it.

  • When running the TAO Launcher, for CV applications, we recommend setting the TAO Launcher to run as the user’s host account so that the user may have permissions to edit the results directories and collaterals generated by the TAO dockers. The dockers by default are instantiated as root, so the users will need sudo access to edit the results path etc. For more information on configuring the user, please refer to the Configuring the launcher section.

  • When running any TAO command for the first time, the launcher pulls the container from the docker registry. This process can take a few minutes. The log will look as follows

    2021-02-24 08:16:04,270 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn't exist locally/the manifest has changed. Pulling a new docker.
    2021-02-24 08:16:04,270 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you're doing this for the first time. Please wait here.
    

Other Issues#

  • If you encounter a error similar to tao-client: command not found after installing any python package in TAO Toolkit, this is likely because the path to the installed binary is not in your system’s PATH environment variable. To resolve this, you can add the user-local bin directory to your PATH by running:

    export PATH=$PATH:$HOME/.local/bin
    

    For a permanent solution, add the above line to your shell’s configuration file (e.g. ~/.bashrc for bash or ~/.zshrc for zsh):

    echo 'export PATH=$PATH:$HOME/.local/bin' >> ~/.bashrc
    source ~/.bashrc
    

    We strongly recommend using a Python virtual environment like Miniconda (https://docs.conda.io/en/latest/miniconda.html) to keep dependencies isolated and avoid conflicts with system packages.