Advanced Usage#

Persistent Storage#

The NIM service chart supports two optional persistent volume claims for storage that survives pod restarts and helm uninstall. Both are disabled by default and are annotated with helm.sh/resource-policy: keep so they are retained when the Helm release is removed.

When the end-to-end demo chart is used, configure these keys under nvidia-active-speaker-detection-h4m-service in the values file. (For more information, refer to Common Helm Configuration).

For the operator, equivalent settings are available under spec.parameters.nimModelCache and spec.parameters.nimLogs in the NvidiaActiveSpeakerDetectionMediaFunction custom resource. For details, refer to Configuration Reference.

Model Cache#

Caches NGC model artifacts locally so models are not re-downloaded on every deployment. When the chart creates the PVC, its name is <appName>-model-cache. The container mounts the model cache at /opt/nim/.cache by default (mountPath).

nimModelCache:
  enabled: true
  create: true
  size: "10Gi"
  storageClassName: ""
  mountPath: "/opt/nim/.cache"

When model cache is enabled, NIM_CACHE_PATH is set to the configured mountPath on the NIM pod.

NIM Log Files#

Persists time-stamped NIM log files under the configured directory. When the chart creates the PVC, its name is <appName>-nim-logs. The default mount path is /workspace/nim-logs (mountPath).

nimLogs:
  enabled: true
  create: true
  size: "5Gi"
  storageClassName: ""
  mountPath: "/workspace/nim-logs"

Using a Pre-Existing Persistent Volume Claim#

To attach a PVC that already exists in the namespace instead of creating one with the chart, set create: false. Ensure a PVC named <appName>-model-cache or <appName>-nim-logs exists before deploying:

nimModelCache:
  enabled: true
  create: false

StorageClass#

By default, storageClassName is set to "", which uses the cluster’s default StorageClass. To use a specific StorageClass, set storageClassName to the name of an existing StorageClass in your cluster under the nimModelCache or nimLogs block.

If no default StorageClass is configured in your cluster and storageClassName is left empty, the PVC remains in Pending state. In that case, either set storageClassName to a valid class or pre-create the PVC and use create: false.

Note

The model cache PVC requires a StorageClass that supports ReadWriteOnce access mode. When using a shared filesystem, ensure only one pod writes to the cache concurrently.

Troubleshooting#

End-to-End Demo Chart and NIM Service Chart#

Symptom

Likely Cause

Fix

ImagePullBackOff

Image pull secret missing or incorrect.

kubectl get secret <image.secret>.

Pod crash / NGC errors

Model pull secret missing or invalid.

kubectl get secret <ngcModelDownload.secretName>; confirm key is NGC_API_KEY.

Pod Pending

Node selector, GPU, or resource constraints.

kubectl describe pod <pod>; check node labels and capacity.

Pod Pending

Insufficient hugepages.

kubectl describe node <node>; check hugepages availability.

No output

Multicast IP addresses or ports misaligned.

Ensure “sender → NIM service → receiver IP address and port” chain is consistent.

Rivermax errors

Rivermax license secret missing.

kubectl get secret rivermax-license.

Startup probe failures

Model download slow or NGC key invalid.

kubectl logs deploy/<appName>; increase startup probe failureThreshold.

PVC Pending

No default StorageClass.

Set nimModelCache.storageClassName or create a matching StorageClass.

Kubernetes Operator#

Symptom

Likely Cause

Fix

ImagePullBackOff on controller or NIM pod

Image pull secret missing or incorrect.

Check imagePullSecrets / mediaFunction.imagePullSecrets; kubectl describe pod <pod>.

Custom resource Provisioned false

Invalid spec, missing secrets, or scheduling.

kubectl describe nvidiaactivespeakerdetectionmediafunction <name> -n <namespace>; check operator logs.

NIM pod crash / NGC errors

Model pull secret missing or invalid.

kubectl get secret <spec.parameters.ngcModelDownload.secretName>; confirm key matches secretKey.

Pod Pending

Node selector, GPU, or hugepages.

kubectl describe pod <pod>; verify node labels and capacity.

Rivermax errors

License secret missing.

Confirm Rivermax license secret is mounted at /opt/mellanox/rivermax.

CRD not found

Operator chart not installed or failed.

helm status ai4m-asd-operator; reinstall chart.

On Red Hat OpenShift, replace kubectl with oc.

See Also#