Troubleshooting#

Known issues#

  • Non-verbal human sounds (E.g. “hmmmm…”) and non-human audio do not translate well into facial expressions, resulting in random lip motions. This is an area identified for future improvement.

  • When deploying with the provided Helm chart, the initialization container responsible for TensorRT engine generation may log errors but does not exit with a non-zero status code. This prevents Kubernetes from detecting the failure at the init container stage. Instead, the pod appears to initialize successfully, but the runtime container subsequently fails with CrashLoopBackOff due to missing engine files. To resolve this issue, it is recommended to delete the affected pod to trigger a complete restart of the initialization process.

  • Kubernetes/microk8s GPU detection: Pods may remain in Pending status due to the GPU Operator’s nvidia-container-toolkit-daemonset crashing with a symlink creation error. This affects various GPU types (observed on L40, B200, and other GPUs) and is a known GPU Operator issue #430. Workaround: Patch the ClusterPolicy to set DISABLE_DEV_CHAR_SYMLINK_CREATION=true. See Kubernetes Pod Stuck in Pending Status for the fix command. Docker deployments are not affected.

Lip Sync Issues with Burst Mode (Tokkio)#

If you experience lip sync problems, animation jitter, or AnimGraph buffer overflow errors when using Audio2Face-3D with Tokkio or Unreal Engine, the issue may be caused by burst mode streaming.

Symptoms:

  • Lip movements not synchronized with audio

  • Animation appears jerky or skips frames

  • AnimGraph buffer overflow warnings in logs

Cause:

When burst_mode is enabled in advanced_config.yaml, blendshapes are sent as fast as possible (~20-30ms total for an entire audio clip). This can overwhelm the downstream animation system’s buffer.

Solution:

Disable burst mode and use rate-limited streaming in advanced_config.yaml:

pipeline_parameters:
  burst_mode: false  # Use rate-limited streaming
  blendshape_streaming_fps: 90  # Recommended for Tokkio compatibility

The default blendshape_streaming_fps: 90 (11ms delay per frame) matches Audio2Face-3D 1.3 behavior and provides optimal compatibility with Tokkio pipelines.

Q&A#

The Avatar doesn’t close its mouth at the end of the audio clip. How can I fix it?

There are multiple requirements for the mouth to close, we suggest you to try in this order:

  • The audio clip need to end with some silence, you need to append some silence at the end if that’s not the case.

  • The emotions must be reset to neutral, you need to send a neutral emotion (E.g. send an empty map or joy=0) and some silent audio if that’s not the case.

Note

If you set the emotion to neutral at the very end of the audio clip, you will not see any change because some smoothing applies in Audio2Face-3D to make the emotion change less abrupt. So you will need to push some silence audio at the end.

E.g.: In the Audio2Face-3D NIM at the end of an audio clip, we set all emotions to 0 and send 1.5 seconds of silence.

If preferred_emotion_strength from EmotionPostProcessingParameters is close to zero then the neutral emotion will not be taken into account. You might need to increase that value to allow the mouth to close.

  • The face parameter selected might prevent the mouth from closing, you can try updating them to make sure the mouth closes.

  • The blend shape multipliers and offset might prevent the mouth from closing, you can try updating them to make sure the mouth closes.

Reducing VRAM Usage for Edge Deployments#

By default, the Audio2Face-3D NIM may consume a significant amount of VRAM (e.g., around 5-6GB). For deployments on edge devices or systems with limited VRAM (e.g., targeting <1GB), you can reduce VRAM usage by configuring the TensorRT (TRT) models to use a single stream. This can lower VRAM usage to approximately 0.9GB.

Follow these steps to modify the configuration files:

  1. Modify `advanced_configs.yaml`: Update the trt_model_generation section to set the min_shape, optimal_shape, and maximum_shape to 1 for both A2E (Audio2Emotion) and A2F (Audio2Face) models.

    trt_model_generation:
      a2e:
        min_shape: 1
        optimal_shape: 1
        maximum_shape: 1
      a2f:
        precision: "fp16"
        min_shape: 1
        optimal_shape: 1
        maximum_shape: 1
    
  2. Modify `deployment_configs.yaml`: Update the common section to set stream_number to 1.

    common:
      stream_number: 1
    

These changes will ensure that the TRT engines are built and deployed for a single stream, significantly reducing VRAM footprint.

Note

If you have previously generated TensorRT engines, you must delete them from the /tmp/a2x directory within the container. The generation script caches engines and will skip regeneration if it finds existing ones. Deleting these cached engines ensures that new engines are built with the updated single-stream configuration.

gRPC Response Status Code#

Users will receive a gRPC error message with the code nvidia_ace::status::v1::Status_Code_ERROR in the following cases, among others:

  • The gRPC request is missing a data field or audio header.

  • The audio header is invalid (E.g. unsupported audio channels, sample rate, bit depth).

  • The audio buffer is empty in the data field.

  • The id (request_id, stream_id, target_object_id) exceeds maxLenUUID or contains unsupported characters.

  • The number of concurrent streams has reached streamNumber limit.

  • The FPS is too low.

Otherwise, the gRPC request will return with the code nvidia_ace::status::v1::Status_Code_SUCCESS.

Audio2Face-3D NIM Public API#

Here are some common errors you may encounter when accessing the Audio2Face-3D NIM Public API

grpc error messages

Potential Reasons

Possible Solutions

grpc_message:’failed to establish link to worker’, grpc_status:13

No server/instance available at the moment of requesting

The current deployment has reached it’s capacity with instance_count * max_occurrence settings.

grpc_message:’failed to open stateful work request: rpc error: code = Unauthenticated desc = invalid response from UAM’, grpc_status:13

API key is invalid

Check the API Key, and generate a new one if this continue to happen.

grpc_message:’failed to open stateful work request: rpc error: code = Internal desc = There was a server error trying to handle an exception’, grpc_status:13

API key does not have enough credit / API key doesn’t have access to this function

Check the function is a public function, or if your API key is related to an account with access to the function.

GPU Device ID Mismatch Warning (nim_list_model_profiles)#

When running nim_list_model_profiles, you may see all profiles listed as “Incompatible with system” even on supported GPUs:

$ docker run -it --rm --gpus all --entrypoint nim_list_model_profiles \
    nvcr.io/nim/nvidia/audio2face-3d:2.0

WARNING: No compatible profiles found using selection criteria. Falling back to hardware-based filtering.
SYSTEM INFO
- Free GPUs:
  -  [2b85:10de] (0) NVIDIA GeForce RTX 5090 [current utilization: 1%]
MODEL PROFILES
- Compatible with system:
- Incompatible with system:
    - ... batch_size:100|character:james|gpu:RTX5090|gpu_device:2e02:10de|model_type:tensorrt|precision:fp16
    - ... (all profiles listed as incompatible)

Cause:

The nim_list_model_profiles tool performs strict hardware matching against the gpu_device tag in the model manifest. NVIDIA releases multiple device ID variants of the same GPU architecture (OEM variants, regional SKUs, different production batches). For example, the RTX 5090 may have device ID 2b85:10de on your system, while the manifest profiles were built with device ID 2e02:10de.

This warning does not prevent the NIM from running.

The NIM uses a FallbackProfileSelector that detects your GPU via nvidia-smi and maps it to the appropriate profile by GPU name (not device ID). To verify your GPU works correctly, run the NIM normally:

$ docker run -it --rm --network=host --gpus all \
    -e NGC_API_KEY=$NGC_API_KEY \
    nvcr.io/nim/nvidia/audio2face-3d:2.0

How the NIM selects profiles at runtime:

  1. The NIM detects your GPU using nvidia-smi --query-gpu=name

  2. It maps the GPU name to a manifest profile key (see table below)

  3. It preferentially selects the james character profile to match the default stylization config

  4. If no exact match is found, it falls back to TagsBasedProfileSelector for any available profile

  5. After downloading TRT models, the NIM auto-updates the config to match the selected profile

GPU name to profile mappings:

The NIM uses nvidia-smi to detect your GPU and maps it to a manifest profile key:

Detected GPU

Manifest Profile Key

Notes

RTX 30 series (3080, 3090, etc.)

A10G

Uses A10G profile as fallback

RTX 40 series (4080, 4090, etc.)

RTX4090

RTX 5080

RTX5080

RTX 5090

RTX5090

RTX 6000 Ada

RTX6000

RTX PRO 6000 Blackwell

rtx6000-blackwell-sv

L4

L4

L40S

L40S

A30

A30

A10 / A10G

A10G

B200

B200

A100 / H100

None

No pre-generated profile available

GPUs with no dedicated profile (A100, H100, etc.):

GPUs like A100 and H100 are not in the map_gpu_to_manifest_key() function and have no pre-generated profiles in the model manifest. Auto profile selection will not find a match for these GPUs.

To run on these GPUs, set NIM_DISABLE_MODEL_DOWNLOAD=true and generate TRT engines locally:

$ docker run -it --rm --network=host --gpus all \
    -e NGC_API_KEY=$NGC_API_KEY \
    -e NIM_DISABLE_MODEL_DOWNLOAD=true \
    nvcr.io/nim/nvidia/audio2face-3d:2.0

For more details on automatic profile selection, see Support Matrix.

TensorRT Engine Build Fails for Diffusion Models (Insufficient Memory)#

When building TensorRT engines for diffusion models (e.g., multi_v3.2) with high batch sizes, you may encounter the following error:

[TRT] Tactic Device request: 15743MB Available: 9885MB. Device memory is insufficient to use tactic.
[TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 16508051968
Error[10]: IBuilder::buildSerializedNetwork: Error Code 10: Internal Error
    (Could not find any implementation for node...)
Engine could not be created from network
Building engine failed

Cause:

Diffusion models require significantly more GPU memory than regression models during both TensorRT engine generation and inference. When the configured maximum_shape (batch size) is too high for the available GPU memory, TensorRT cannot find a valid implementation and fails to build the engine.

For example, an RTX 3080 with 12GB VRAM cannot build a diffusion model engine with batch size 128.

Solution:

Reduce the maximum batch size in advanced_config.yaml to match your GPU’s memory capacity:

trt_model_generation:
  a2f:
    precision: "fp16"
    min_shape: 1
    optimal_shape: 8
    maximum_shape: 8  # Reduced for diffusion models

A batch size of 8 is recommended for diffusion models on most consumer GPUs. High batch sizes (e.g., 128) provide diminishing returns for diffusion models because they already generate many frames per inference, so fewer batches are generally required to maximize GPU usage.

Refer to the Support Matrix for recommended batch sizes per GPU. Diffusion models typically support lower batch sizes than regression models on the same hardware (e.g., batch size 35 vs 80 on RTX 4090).

Note

  • After modifying batch size settings, delete any cached TensorRT engines from /tmp/a2x inside the container to force regeneration with the new configuration.

  • We recommend TensorRT 10.13+ with CUDA 12.8 or 12.9 for optimal compatibility.

Note

NVIDIA Fabric Manager Requirement

NVIDIA Fabric Manager is required on systems with multiple GPUs that are connected using NVLink or NVSwitch technology. This typically applies to:

  • Multi-GPU systems with NVLink bridges (e.g., DGX systems, HGX platforms)

  • Systems with NVSwitch fabric interconnects

  • Hosts running NVIDIA B200, H100, A100, V100, or other datacenter GPUs with NVLink

Fabric Manager is not required for:

  • Single GPU systems

  • Multi-GPU systems without NVLink/NVSwitch (PCIe-only configurations)

For installation instructions, refer to the official NVIDIA Fabric Manager documentation.

Verify Fabric Manager status after installation:

$ sudo systemctl status nvidia-fabricmanager

CUDA Error 802 “system not yet initialized” on B200 Multi-GPU Systems#

When running Audio2Face-3D on a B200 system with multiple GPUs (e.g., 8x B200 with NVSwitch), TensorRT engine generation may fail with:

Cuda failure at /_src/samples/common/safeCommon.h:287: system not yet initialized

nvidia-smi works normally and shows all GPUs, but any CUDA compute operation (including trtexec and cuInit()) fails with error code 802 (CUDA_ERROR_SYSTEM_NOT_READY).

Cause:

Multi-GPU B200 systems (NVL5+) use NVSwitch to interconnect GPUs. Three components are required to initialize the NVSwitch fabric before CUDA can use the GPUs:

  • NVIDIA FabricManager – manages the NVSwitch interconnect fabric

  • NVLink Subnet Manager (nvlsm) – required on NVL5+ systems; FabricManager will not start without it

  • InfiniBand diagnostics (``ibstat``) and kernel module ib_umad – required by the FabricManager start script

Without these services running, the NVIDIA driver loads (so nvidia-smi works via NVML), but the CUDA compute path cannot initialize.

Diagnosis:

$ systemctl status nvidia-fabricmanager
# If "Unit nvidia-fabricmanager.service could not be found", FabricManager is not installed.

$ python3 -c "import ctypes; lib=ctypes.cdll.LoadLibrary('libcuda.so.1'); print('cuInit:', lib.cuInit(0))"
# cuInit: 802 confirms CUDA cannot initialize.

Solution:

Install FabricManager, nvlsm, and InfiniBand diagnostics matching your driver version, load the required kernel module, and start the service:

$ sudo apt-get update
$ sudo apt-get install -y nvidia-fabricmanager-<driver-branch> nvlsm infiniband-diags
$ sudo modprobe ib_umad
$ echo "ib_umad" | sudo tee /etc/modules-load.d/ib_umad.conf
$ sudo systemctl enable --now nvidia-fabricmanager

Replace <driver-branch> with your NVIDIA driver branch number (e.g., 570, 580).

Note

The FabricManager package version must match your NVIDIA driver version. If apt reports dependency conflicts (e.g., the available FabricManager version is newer than your installed driver), upgrade the driver to match. It is recommended to install driver and FabricManager from the same apt repository to ensure version consistency:

$ sudo apt-get install -y nvidia-driver-<branch>-server-open nvidia-fabricmanager-<branch> nvlsm infiniband-diags
$ sudo reboot

After reboot, load ib_umad and enable FabricManager as shown above.

After FabricManager is running, verify CUDA works:

$ python3 -c "import ctypes; lib=ctypes.cdll.LoadLibrary('libcuda.so.1'); ret=lib.cuInit(0); print('cuInit:', ret)"
cuInit: 0

$ # Verify all GPUs are visible:
$ python3 -c "import ctypes; lib=ctypes.cdll.LoadLibrary('libcuda.so.1'); lib.cuInit(0); c=ctypes.c_int(); lib.cuDeviceGetCount(ctypes.byref(c)); print('GPUs:', c.value)"
GPUs: 8

Note

This issue applies to NVL5+ multi-GPU NVSwitch systems only (e.g., DGX, HGX, or cloud instances with 8x B200). Single-GPU B200 systems do not require FabricManager.

Sample Scripts Fail with “Connection refused” Traceback#

When running a2f_3d.py, nim_performance_test.py, or other sample scripts without the NIM container running, a long Python traceback appears:

grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses; last error: UNKNOWN:
        ipv4:127.0.0.1:52000: Failed to connect to remote host: connect: Connection refused (111)"

This is not a bug in the script – it means the gRPC server on port 52000 is not reachable.

Solution:

  1. Start the NIM container and wait for the Application is ready to receive API requests log message.

  2. Verify the service is healthy before running scripts:

    $ curl -s http://localhost:8000/v1/health/ready
    {"ready": true}
    
  3. Re-run the sample script.

Invalid Model Name Silently Falls Back to Default (james_v2.3.1)#

When an invalid, misspelled, or unrecognized model name is provided, the container logs an error but does not stop. It silently falls back to the default model james_v2.3.1 and starts the inference server normally. For example, providing james_2.3 (missing the v) instead of james_v2.3.1:

PERF_A2F_MODEL value 'james_2.3' is not valid. Must be one of: claire_v2.3.1, mark_v2.3,
james_v2.3.1, multi_v3.1, multi_v3.2_mark, multi_v3.2_james, multi_v3.2_claire

This behavior applies regardless of how the model is specified (PERF_A2F_MODEL, NIM_MANIFEST_PROFILE, or stylization config). The container proceeds to download or generate TRT engines for james_v2.3.1 and starts the inference server with that model.

How to verify which model is running:

Check the startup logs for the IProc line that confirms the active model:

[IProc] [info] Using A2F TRT engine for model 'james_v2.3.1': /tmp/a2x/james_v2.3.1.trt

Solution:

Ensure the model name exactly matches one of the supported names: james_v2.3.1, claire_v2.3.1, mark_v2.3, multi_v3.2_james, multi_v3.2_claire, or multi_v3.2_mark.

Cached TRT Engines Don’t Match Requested Model#

When running with NIM_DISABLE_MODEL_DOWNLOAD=true and a cached volume mount, you may see:

✗ User wants multi_v3.2 but multi_v3.2.trt not found in cache!
Available TRTs: ['a2e.trt', 'mark_v2.3.trt']
Error using FallbackProfileSelector: Required TRT engine not found: multi_v3.2.trt
NIMProfileIDNotFound: Could not match a profile in manifest

Cause:

The cache directory contains TRT engines generated for a different model. When the container detects existing .trt files, it attempts to match them against the requested PERF_A2F_MODEL rather than generating the missing engine.

Solution:

Delete the cached engines before switching models:

$ rm $LOCAL_NIM_CACHE/*.trt

Then re-run the container. The missing engines will be regenerated automatically.

Alternatively, omit the -v volume mount to avoid caching entirely – engines will be generated fresh on each launch into the container’s ephemeral /tmp/a2x.

NIM Fails When Using a Profile for a Different GPU#

When manually setting NIM_MANIFEST_PROFILE to a profile ID that was built for a different GPU (e.g., using an L40S profile on an RTX 5090), the NIM will download TensorRT engines compiled for the wrong GPU architecture and fail at runtime. Errors may include TensorRT deserialization failures or CUDA errors, since TRT engines are not portable across GPU architectures.

Cause:

Each profile in the profile table contains TensorRT engines pre-compiled for a specific GPU. TensorRT engines are architecture-specific – an engine built for Ada Lovelace (L40S) cannot run on Blackwell (RTX 5090), and vice versa.

Solution:

Use nim_list_model_profiles to find the correct profile for your GPU:

$ docker run -it --rm --gpus all --entrypoint nim_list_model_profiles \
    nvcr.io/nim/nvidia/audio2face-3d:2.0

This lists all available profiles with tags showing the target gpu and character for each. Look for profile lines where the gpu tag matches your hardware. For example, on an RTX 5090 you would look for lines containing gpu:RTX5090:

5c09e7a7... batch_size:100|character:james|gpu:RTX5090|model_type:tensorrt|precision:fp16
4e1be9a8... batch_size:100|character:claire|gpu:RTX5090|model_type:tensorrt|precision:fp16

Copy the hash from the row that matches your GPU and desired character model, then set it:

$ export NIM_MANIFEST_PROFILE=<hash from the correct gpu row>

Note

If nim_list_model_profiles shows all profiles as “Incompatible with system”, this is a device ID mismatch warning and does not prevent the NIM from running. See GPU Device ID Mismatch Warning (nim_list_model_profiles) for details.

You can also verify your GPU name manually:

$ nvidia-smi --query-gpu=name --format=csv,noheader

Then match the result to the GPU Type column in the profile table.

Alternatively, omit NIM_MANIFEST_PROFILE entirely and let the NIM auto-select the correct profile for your GPU. Auto-selection is recommended unless you need a specific character model other than james.

Permission Denied on Cache Directory (/tmp/a2x)#

When using a volume mount for the model cache (-v "$LOCAL_NIM_CACHE:/tmp/a2x"), the container may fail with a permission error if the host directory is not writable by UID 1000 (the container user). This can manifest in two ways depending on the operation:

When copying downloaded TRT engines to the cache:

PermissionError: [Errno 13] Permission denied: '/tmp/a2x/a2e.trt'

This occurs during shutil.copy when the NIM downloads TRT engines via a manifest profile and tries to copy them into /tmp/a2x.

When generating TRT engines locally:

Engine could not be created from network
Building engine failed
Engine generation failed.

This occurs when trtexec cannot write the generated .trt file to /tmp/a2x. The error message from trtexec does not indicate that permissions are the root cause.

Why this happens:

The container runs as UID 1000. If your host user has a different UID, a directory created with mkdir will be owned by your UID and the default permissions (typically 775) only grant read/execute to “others” – which includes UID 1000 inside the container. For example:

drwxrwxr-x  2 local-user local-user  4096 Feb 28 00:03 a2f-cache
# UID 1000 (container) is "other" → only r-x, no write

Diagnosis:

Check if the container user can write to the mounted directory:

$ docker run --rm --entrypoint bash \
    -v "$LOCAL_NIM_CACHE:/tmp/a2x" \
    nvcr.io/nim/nvidia/audio2face-3d:2.0 \
    -c "id && touch /tmp/a2x/write_test && echo 'WRITE OK' && rm /tmp/a2x/write_test || echo 'WRITE FAILED - check permissions'"

Solution:

Grant write access to UID 1000. Choose one of:

# Option 1: Transfer ownership to the container UID (recommended for production)
$ sudo chown 1000:1000 $LOCAL_NIM_CACHE

# Option 2: Open permissions (quick prototyping only)
$ chmod 777 $LOCAL_NIM_CACHE

Note

If your host user is already UID 1000 (check with id -u), chmod 755 is sufficient since you are the owner. The permission issue only arises when the host UID differs from 1000.

Alternatively, omit the -v volume mount entirely to let the container write to its ephemeral /tmp/a2x (engines will not persist across restarts).

Other causes of engine build failure:

CUDA Error: No CUDA-Capable Device Detected#

If you encounter the following error when starting the A2F server:

[A2X SDK] [ERROR] CUDA error: no CUDA-capable device is detected
[A2X SDK] [ERROR] Error allocating CUDA memory
[A2F SDK] [ERROR] Unable to initialize emotion database matrix

This typically occurs when the A2F server process was terminated inside a long-running container (e.g., using pkill after 24+ hours of operation), leaving the CUDA context in a corrupted state. Even though nvidia-smi may show the GPU as available, the CUDA runtime cannot initialize a new context.

Recovery Steps:

Restart the Docker container to reset the CUDA state:

$ docker restart <container_name_or_id>

After the container restarts, the A2F server should start normally.

Stopping and Restarting the A2F Server#

When you need to stop or restart the A2F server, use one of the following recommended methods to ensure proper CUDA resource cleanup.

Recommended: Restart the Container

The safest way to stop and restart the A2F server is to restart the entire container:

# From the host machine
$ docker restart <container_name_or_id>

This ensures all CUDA resources are properly released before the new server instance starts.

Alternative: Stop and Start the Container

# Stop the container
$ docker stop <container_name_or_id>

# Start a new container
$ docker run ... # (your original run command)

Not Recommended: Killing the Process In-Container

Avoid terminating the A2F server process inside a long-running container:

# AVOID these approaches in long-running containers
$ docker exec <container_id> kill -9 <a2f_pid>
$ docker exec <container_id> pkill -f a2f_pipeline

In long-running containers (e.g., after 24+ hours of operation), terminating the A2F process can leave the CUDA context in a corrupted state, even with graceful termination signals. This is because the CUDA runtime state may have accumulated resources that cannot be properly cleaned up without a full container restart.

Note

If you have terminated the A2F process inside a long-running container and encounter CUDA errors, restart the container using docker restart to recover.

If container restart is not an option and you need to restart the process in-container, you can add the following NVIDIA device node mounts when starting the container to resolve the CUDA context issue:

$ docker run -d --gpus all --network=host \
    -v /dev/nvidia-caps:/dev/nvidia-caps:ro \
    --device /dev/nvidia-modeset:/dev/nvidia-modeset \
    -v "$LOCAL_NIM_CACHE:/tmp/a2x" \
    nvcr.io/nim/nvidia/audio2face-3d:2.0

Note

These device node paths may not exist on all systems. Only add them if they are present on your host.

Kubernetes Pod Stuck in Pending Status#

When deploying Audio2Face-3D on Kubernetes, the pod may get stuck in Pending status indefinitely:

$ sudo microk8s kubectl get pods
NAME                                  READY   STATUS    RESTARTS   AGE
a2f-a2f-deployment-7d5c8f4b8d-gls4c   0/1     Pending   0          15h

Diagnosing the Issue:

First, check why the pod is pending by examining the pod events:

$ sudo microk8s kubectl describe pod <pod-name>

Look for messages in the Events section such as:

  • 0/1 nodes are available: 1 Insufficient nvidia.com/gpu - GPU resources not detected by Kubernetes

  • FailedScheduling with no nodes available to schedule pods - node issues

  • Unschedulable - general scheduling failure

Common Causes and Solutions:

  1. GPU Operator pods not ready:

    After enabling the GPU addon, wait for all GPU Operator pods to reach Running or Completed status before deploying. This may take a few minutes:

    $ sudo microk8s kubectl get pods -n gpu-operator-resources
    NAME                                                          READY   STATUS      RESTARTS   AGE
    gpu-operator-node-feature-discovery-worker-842s5              1/1     Running     0          92s
    gpu-operator-7df865f694-8jfcp                                 1/1     Running     0          92s
    gpu-operator-node-feature-discovery-master-86d46c7595-x9t86   1/1     Running     0          92s
    nvidia-container-toolkit-daemonset-wcd5k                      1/1     Running     0          54s
    gpu-feature-discovery-htlsp                                   1/1     Running     0          73s
    nvidia-cuda-validator-8sx8l                                   0/1     Completed   0          37s
    nvidia-device-plugin-daemonset-chgzd                          1/1     Running     0          73s
    nvidia-operator-validator-f8cdh                               1/1     Running     0          55s
    nvidia-dcgm-exporter-r472p                                    1/1     Running     0          73s
    

    If any pods are stuck in error state, restart the GPU operator/addon:

    # For microk8s:
    $ sudo microk8s disable gpu
    $ sudo microk8s enable gpu --version v23.6.1
    
  2. GPU resources not advertised by the node:

    Verify that Kubernetes sees GPU resources on the node:

    $ sudo microk8s kubectl describe node | grep -A10 "Allocatable:"
    

    You should see nvidia.com/gpu: 1 (or more) in the output. If not, the GPU operator is not properly detecting your GPU.

  3. GPU Operator container-toolkit-daemonset CrashLoopBackOff:

    The nvidia-container-toolkit-daemonset may crash with the following error on various GPUs (observed on L40, B200, RTX 5090, and others):

    failed to create device node nvidiactl: failed to determine major: invalid device node
    Failed to create symlinks under /dev/char that point to all possible NVIDIA character devices.
    

    This is GPU Operator issue #430 affecting systemd cgroup management. Fix by patching the ClusterPolicy:

    sudo microk8s kubectl patch clusterpolicy cluster-policy --type=merge -p '{
      "spec": {
        "validator": {
          "driver": {
            "env": [
              {
                "name": "DISABLE_DEV_CHAR_SYMLINK_CREATION",
                "value": "true"
              }
            ]
          }
        }
      }
    }'
    

    After patching, the GPU Operator pods will restart and GPUs will become visible to Kubernetes.

  4. Local path provisioner not configured:

    The helm chart may require a storage provisioner. Set it up with:

    $ curl https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.23/deploy/local-path-storage.yaml | sed 's/^  name: local-path$/  name: mdx-local-path/g' | sudo microk8s kubectl apply -f -
    
  5. GPU resource request mismatch:

    This is uncommon with the v2.0 chart (which requests 1 GPU by default), but if your helm values have been customized to request more GPUs than available, the pod will remain pending. Check your values.yaml and ensure the GPU request matches available resources.

kubectl Connection Refused on localhost:8080#

When running sudo kubectl (standalone) instead of sudo microk8s kubectl, you may see:

The connection to the server localhost:8080 was refused - did you specify the right host or port?

Cause:

The standalone kubectl (installed via snap install kubectl) and microk8s kubectl use different kubeconfig files. The MicroK8s setup script writes the kubeconfig to $HOME/.kube/config for the current user, but sudo kubectl runs as root and looks at /root/.kube/config, which does not exist. Without a valid kubeconfig, kubectl falls back to localhost:8080, while MicroK8s runs its API server on localhost:16443.

Solution:

Use microk8s kubectl as shown throughout the deployment guide:

$ sudo microk8s kubectl get pods

If you prefer using the standalone kubectl with sudo, copy the kubeconfig for root:

$ sudo mkdir -p /root/.kube
$ sudo microk8s config | sudo tee /root/.kube/config > /dev/null

To use standalone kubectl without sudo, ensure the user kubeconfig is in place:

$ mkdir -p $HOME/.kube
$ sudo microk8s config > $HOME/.kube/config
$ sudo chown $(id -u):$(id -g) $HOME/.kube/config

MicroK8s Fails to Start (Containerd Config Version Mismatch)#

When starting MicroK8s, all services may fail because containerd cannot start:

containerd config version `3` is not supported, the max version is `2`,
use `containerd config default` to generate a new config or manually revert to version `2`
containerd: failed to load TOML from /var/snap/microk8s/<revision>/args/containerd.toml:
unsupported config version `3`

The snap.microk8s.daemon-containerd service exits immediately, and snap.microk8s.daemon-kubelite fails to start because it depends on containerd. Running microk8s status reports “microk8s is not running”.

Cause:

The containerd config template (containerd-template.toml) shipped with the MicroK8s snap revision uses version = 3, but the bundled containerd binary only supports up to version 2. This typically happens when MicroK8s is installed without pinning a specific snap revision (e.g., using --channel=1.31/stable instead of --revision=5891).

Diagnosis:

$ sudo journalctl -u snap.microk8s.daemon-containerd --no-pager -n 10
# Look for "config version `3` is not supported"

Solution:

Fix the config template and restart MicroK8s:

$ sudo sed -i 's/^version = 3/version = 2/' /var/snap/microk8s/current/args/containerd-template.toml
$ sudo microk8s stop && sudo microk8s start

Note

If MicroK8s remains stuck after fixing containerd (e.g., k8s-dqlite shows repeated errors like failed to list /registry/serviceaccounts/), the datastore may have been corrupted during the failed startup attempts. In this case, a clean reinstall is the fastest recovery:

$ sudo snap remove microk8s --purge
$ sudo snap install microk8s --revision=5891 --classic

Then follow the setup steps in Kubernetes Deployment.

Snap Tools Fail with “Home Directories Outside of /home”#

When running microk8s, helm, kubectl, or other snap-installed tools, you may see:

Sorry, home directories outside of /home needs configuration.
See https://snapcraft.io/docs/home-outside-home for details.

Cause:

Snap’s confinement restricts access to home directories outside the standard /home path. This occurs when the home directory is symlinked, mounted elsewhere (e.g., /localhome/username, /data/home/username), or uses a non-standard path. The error affects all snap-installed tools, not just microk8s.

Solution:

Configure snap to allow your home directory path:

$ sudo snap set system homedirs=/your/actual/home/path

See https://snapcraft.io/docs/home-outside-home for details.

Workarounds (without reconfiguring snap):

  • For microk8s commands: prefix with sudo (e.g., sudo microk8s kubectl get pods).

  • For helm fetch / helm pull: download charts with curl instead:

    $ curl -o audio2face-3d-2.0.tgz \
        -u "\$oauthtoken:${NGC_API_KEY}" \
        https://helm.ngc.nvidia.com/nim/nvidia/charts/audio2face-3d-2.0.tgz
    
  • For ngc CLI: if installed via snap, use pip install ngc-cli as an alternative.

Kubernetes Pod Fails with “CUDA-capable device(s) is/are busy or unavailable”#

When deploying Audio2Face-3D on Kubernetes, the pod may get stuck in Init:Error or Init:CrashLoopBackOff with the following error in the init container logs:

CUDA error: no CUDA-capable device is detected
CUDA-capable device(s) is/are busy or unavailable

This can occur even when the GPU Operator is running correctly and nvidia-smi works on the host.

Cause:

Starting with nvidia-container-toolkit 1.18.0, the default runtime mode changed from legacy to jit-cdi (just-in-time CDI spec generation). In legacy mode, the nvidia-container-runtime-hook allowed containers to access GPUs via the NVIDIA_VISIBLE_DEVICES environment variable without an explicit resource request. In jit-cdi mode, device requests are mapped to CDI specifications at container creation time, and pods must explicitly request nvidia.com/gpu resources in their spec for GPU access to work.

nvidia-container-toolkit Version

Default Runtime Mode

Result

< 1.18.0 (e.g., 1.17.8)

legacy (nvidia-container-runtime-hook)

Pod works without explicit GPU request

>= 1.18.0 (e.g., 1.18.1)

jit-cdi (just-in-time CDI spec generation)

Pod fails unless GPU explicitly requested

You can check your host’s nvidia-container-toolkit version with:

$ dpkg -l | grep nvidia-container-toolkit
$ nvidia-container-toolkit --version

Note

The GPU Operator daemonset container versions may be identical across machines — the difference is in the host-level nvidia-container-toolkit package installed via apt or your system’s package manager.

Solution:

The Audio2Face-3D v2.0 Helm chart already includes nvidia.com/gpu: 1 in its resource limits, so no additional override is needed with the current chart version.

If you are using an older chart version that does not include GPU resource limits, override Helm values to explicitly request GPU resources. Create a file values-gpu.yaml:

a2f:
  resources:
    limits:
      nvidia.com/gpu: 1

Then install or upgrade with the override:

$ sudo microk8s helm upgrade --install a2f-3d-nim audio2face-3d/ -f values-gpu.yaml

Alternatively, use --set on the command line:

$ sudo microk8s helm upgrade --install a2f-3d-nim audio2face-3d/ \
    --set 'a2f.resources.limits.nvidia\.com/gpu=1'

After applying the fix, verify the pod spec includes GPU resources:

$ sudo microk8s kubectl get pod -l app=a2f -o yaml | grep -A5 "resources:"
resources:
  limits:
    nvidia.com/gpu: "1"