Troubleshooting#

Troubleshooting guide for Nsight Operator.

General Errors#

Sometimes you may find that a pod is not injected with a sidecar container as expected. Check the following items:

  1. The nsight-injector webhook Pod in the nsight-operator namespace is in Running state and its logs do not contain errors:

    kubectl get pods -n nsight-operator -l app.kubernetes.io/name=nsight-injector
    kubectl logs -n nsight-operator -l app.kubernetes.io/name=nsight-injector
    
  2. The target Pod has (or inherits) the nvidia-nsight-profile=enabled label, and was created after the webhook was installed. Existing Pods are not mutated retroactively; see Enabling Profiling on Target Resources for restart procedures.

  3. The target Pod’s main process matches an injectionIncludePatterns regex and does not match any injectionExcludePatterns (including the cluster-wide default exclusions for shells and common utilities).

For deeper diagnosis, see Sidecar Injection Issues.

GPU Metrics Collection Error#

  1. Check that no other applications collect GPU metrics on a target Pod. For example it can be:

    • Other injection with the enabled --gpu-metrics-devices option. In that case, you can use a report from that injection or modify the configurations to ensure only one Pod is running with the GPU metrics option.

    • If you have a GPU operator installed, it has a nvidia-dcgm-exporter (documentation) DaemonSet which collects GPU metrics. If you are not using it, you can temporarily disable it:

    kubectl -n gpu-operator-resources patch daemonset nvidia-dcgm-exporter -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
    

    To enable it back:

    kubectl -n gpu-operator-resources patch daemonset nvidia-dcgm-exporter --type json -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'
    

Sidecar Injection Issues#

If a Pod that should be profiled does not start with a sidecar or a readiness waiter init container, work through the checks below.

1. Verify the Pod is eligible#

The admission webhook mutates a Pod only if all of these are true:

  • The Pod is created after the webhook was installed. Existing Pods are never retroactively mutated. Restart the owning Deployment / StatefulSet (see Enabling Profiling on Target Resources) to trigger new Pods.

  • The Pod’s namespace is not excluded by the cluster-wide filter. By default, kube-system, kube-node-lease, and kube-public are excluded.

  • The Pod, or a NsightOperatorProfileConfig in its namespace, has a rule that matches via one of: the nvidia-nsight-profile=enabled label, a namespaceSelector, an objectSelector, or a CEL matchConditions expression.

  • The main process command line matches at least one injectionIncludePatterns regex and does not match any injectionExcludePatterns. Note that the cluster-wide defaultInjectionExcludePatterns list (shells, coreutils, nsys, ncu) is always applied even when you override injectionExcludePatterns; see Configuration Values.

Quickly inspect the current rules:

kubectl get nsightoperatorprofileconfig -A
kubectl describe nsightoperatorprofileconfig <name> -n <namespace>

2. Verify the webhook is running#

kubectl get pods -n nsight-operator -l app.kubernetes.io/name=nsight-injector
kubectl logs -n nsight-operator -l app.kubernetes.io/name=nsight-injector --tail=200

Common failures:

  • ImagePullBackOff on the webhook image – check imagePullSecrets in Helm values, especially in air-gapped clusters.

  • CrashLoopBackOff with certificate errors – see the next section.

3. Verify the webhook certificate#

The admission webhook requires a valid serving certificate that the Kubernetes API server trusts. The operator creates a self-signed certificate Secret and patches the MutatingWebhookConfiguration with the corresponding CA bundle.

kubectl get mutatingwebhookconfigurations | grep nsight
kubectl describe mutatingwebhookconfiguration <name> | grep -A5 caBundle
kubectl get secret -n nsight-operator | grep injector

If the caBundle is empty or out of sync with the Secret, delete the webhook Secret and restart the operator controller Pod so the certificate is regenerated:

kubectl delete secret nsight-injector -n nsight-operator
kubectl rollout restart deployment -n nsight-operator nsight-operator

4. Verify CEL expressions and selectors#

If you use matchConditions with CEL expressions, errors in the expression cause the webhook to reject the Pod (fail-open or fail-closed depending on configuration). Check the webhook logs for cel or expression messages.

Simulate a CEL expression against a Pod manifest using kubectl apply --dry-run=server:

kubectl apply --dry-run=server -f mypod.yaml -o yaml

5. Inspect a mutated Pod#

Once a Pod is admitted, verify that it actually received the injection:

kubectl get pod <name> -n <namespace> -o yaml | grep -E "nsight|nsys"

You should see:

  • A readiness waiter init container (if enabled, see Configuration Values).

  • A shared volume named nsight-injector-config or similar.

  • Additional environment variables prefixed NVDT_ on the main container, injected by the process hook.

If none of these are present, the webhook did not mutate the Pod. Revisit step 1.

Gateway Connectivity and Authentication#

The nsight_operator.py CLI connects to the NsightGateway (Envoy) over HTTP or HTTPS. Connection failures usually fall into one of the categories below.

1. Gateway is not reachable#

Check that the gateway Pod is Running and the service has endpoints:

kubectl get pods -n <namespace> -l app.kubernetes.io/component=gateway
kubectl get endpoints -n <namespace> nsight-operator-gateway

ClusterIP (default): You must port-forward or use autoconfigure which sets port-forwarding up automatically:

kubectl port-forward -n <namespace> svc/nsight-operator-gateway 8888:8888 &

LoadBalancer: Wait for the external IP to be assigned:

kubectl get svc -n <namespace> nsight-operator-gateway -w

If the external IP stays <pending>, your cloud provider’s load balancer controller is not running or is out of quota – check the controller logs.

NodePort: Ensure the node security group / firewall allows the assigned port:

kubectl get svc -n <namespace> nsight-operator-gateway \
    -o jsonpath='{.spec.ports[0].nodePort}'

2. TLS trust errors#

If you see SSL: CERTIFICATE_VERIFY_FAILED or similar errors with HTTPS:

  • For self-signed certificates, prefer autoconfigure – it reads the CA certificate from the Kubernetes TLS Secret automatically. See TLS Configuration.

  • For manually configured CLIs, add the CA to your local trust store or set REQUESTS_CA_BUNDLE to point at the CA file.

  • Verify the certificate CN / SAN matches the host you are using to connect (the LoadBalancer hostname, NodePort IP, or localhost for port-forward).

3. Authentication errors#

401 Unauthorized:

  • For API-key auth, ensure you configured the CLI with --apikey matching the value in the NsightGateway.spec.authentication.apikey field or its referenced Secret. The header must be Authorization: Bearer <apikey>.

  • For OAuth2 / JWT, run python3 nsight_operator.py login again. Tokens expire; running login refreshes them.

403 Forbidden:

  • Your token is valid but missing a required scope. For OAuth2 configurations, verify scopes in the gateway CR includes openid profile email (the defaults).

OIDC clock skew:

  • JWT tokens are time-sensitive. If the client’s and gateway’s clocks differ by more than 60 seconds, validation fails. Ensure NTP is running on both ends.

4. autoconfigure fails#

autoconfigure requires kubectl access to the target namespace. Common failures:

  • forbidden errors – your kubeconfig does not have get / list permissions on NsightGateway, Secret, or Service.

  • not found errors – the gateway service name differs from the default nsight-operator-gateway. Pass -s <servicename> to autoconfigure.

5. MinIO storage proxy is not reachable#

When autoconfigure is used with the integrated MinIO backend, storage flows through the gateway on port 9000. If download fails with a connection refused error, check that:

  • The gateway’s port-forwarding includes both 8888 and 9000 (the CLI handles this automatically for ClusterIP gateways).

  • For LoadBalancer gateways, port 9000 is exposed by the Service. If you customized spec.service, include an additional port for MinIO.

Storage and Download Issues#

Profiling results are uploaded by each collection agent to the cloud storage backend configured by NsightCloudStorageConfig. If nsight_operator.py ls or download fails, check the items below.

1. Reports are not yet available#

Reports are uploaded after profiler-stop, and the upload can take a few seconds. If ls shows no files immediately after stopping, wait 5-10 seconds and try again.

2. endpoint_url mismatch#

When you use configure (not autoconfigure), you must manually set NSIGHT_CLOUD_STORAGE_CONFIG_FILE and point its endpoint_url at a reachable MinIO or S3 host. See Configuring Storage Access for Downloads.

The most common failure is leaving the in-cluster MinIO service name (e.g. nsight-operator-minio.nsight-operator.svc:9000) in the config file and trying to download from outside the cluster. Either port-forward the gateway (which proxies MinIO on port 9000) and edit endpoint_url to http://localhost:9000, or re-run with autoconfigure which does this automatically.

3. MinIO credentials#

The storage config Secret contains the access key and secret key used by the coordinator, injector, and CLI. If a download fails with InvalidAccessKeyId or SignatureDoesNotMatch, decode the Secret and confirm that the keys match the MinIO StatefulSet:

kubectl get secret nsight-operator-cloud-storage-secret -n <namespace> \
    -o jsonpath='{.data.storage-config\.yaml}' | base64 -d

If the Secret was re-created but MinIO was not restarted, the pod still holds the old credentials – restart MinIO:

kubectl rollout restart statefulset -n <namespace> nsight-operator-minio

4. local storage type#

When spec.storage_type: local is set on NsightCloudStorageConfig, reports are written to the Pod’s local filesystem, not to a shared backend. The CLI cannot download them directly. Use kubectl cp:

kubectl cp <namespace>/<pod>:<path-in-pod> <local-path>

The path inside the Pod is configurable via NsightCloudStorageConfig.spec.storageDir and defaults to /mnt/nv/reports.

5. External S3 credentials#

When using external S3 (not operator-managed MinIO), the NsightCloudStorageConfig references a user-provided Secret named by spec.secretRef.name. Verify that the Secret exists in the same namespace as the CR and that its storage-config.yaml key contains valid JSON/YAML:

kubectl get secret <secret-name> -n <namespace> \
    -o jsonpath='{.data.storage-config\.yaml}' | base64 -d

Required fields include storage_type, bucket_name, aws_access_key_id, aws_secret_access_key, region_name, and local_cache_dir.

6. Disk full on MinIO#

Operator-managed MinIO deployments use ephemeral storage (emptyDir) by default. Long profiling sessions can exhaust node disk space. Either enable persistent storage (cloudStorage.minio.persistence.enabled: true – see Storage Configuration) or periodically run session-end and delete old sessions to reclaim space.

GPU Metrics#

Nsight Systems can collect GPU metrics alongside traces. With the current chart, enabling GPU metrics requires granting extra privileges to profiled containers and ensuring no other GPU metrics collectors are active on the same GPUs during profiling windows.

Prerequisites#

  • Ensure no conflicting GPU metrics collectors are running simultaneously on the same node/GPUs. If you have the NVIDIA GPU Operator installed, temporarily disable its nvidia-dcgm-exporter DaemonSet during profiling windows.

  • Update node configuration if needed (e.g. kernel.perf_event_paranoid). See machineConfig in Configuration Values.

Security Context#

GPU metrics often require elevated privileges. In the manifest-based example below, securityContext.privileged: true is set on the test container. If your environment disallows privileged, add capabilities.add: ["SYS_ADMIN"] instead.

Conflict Avoidance (DCGM exporter)#

If the NVIDIA GPU Operator is installed, its nvidia-dcgm-exporter DaemonSet also collects metrics. To avoid conflicts during profiling runs, temporarily disable it before profiling and re-enable afterward:

kubectl -n gpu-operator-resources patch daemonset nvidia-dcgm-exporter -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
# ... run profiling ...
kubectl -n gpu-operator-resources patch daemonset nvidia-dcgm-exporter --type json -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'

Notes#

  • Some environments may require privileged: true rather than only SYS_ADMIN to collect the full set of GPU metrics.

  • Do not run multiple profilers with GPU metrics on the same GPU concurrently.

GPU Metrics DaemonSet and Config (YAML)#

Apply the following manifests to deploy a simple GPU-metrics collector and a matching NsightOperatorProfileConfig that enables GPU metrics:

apiVersion: nvidia.com/v1
kind: NsightOperatorProfileConfig
metadata:
  name: gpu-metrics-config
spec:
  nsightToolConfigs:
    - name: "gpu-metrics"
      nsightToolArgs: "--gpu-metrics-devices=all"
      injectionIncludePatterns:
        - ".*sleep infinity.*"
  injectionRules:
    - name: "gpu-metrics"
      nsightToolConfigRef: "gpu-metrics"
      matchConditions:
        - name: "gpu-metrics"
          expression: |
            ((has(object.metadata.generateName) &&
            object.metadata.generateName.contains('nsight-operator-gpu-metrics-collector')) ||
            (has(object.metadata.name) &&
            object.metadata.name.contains('nsight-operator-gpu-metrics-collector')))
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nsight-operator-gpu-metrics-collector
  labels:
    app: nsight-operator-gpu-metrics-collector
spec:
  selector:
    matchLabels:
      app: nsight-operator-gpu-metrics-collector
  template:
    metadata:
      labels:
        app: nsight-operator-gpu-metrics-collector
        nvidia-nsight-profile: enabled
    spec:
      runtimeClassName: nvidia
      containers:
      - name: gpu-metrics-ubuntu-container
        image: nvidia/cuda:13.0.0-base-ubuntu24.04
        command: ["sleep", "infinity"]
        securityContext:
          privileged: true
      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

Tip

If you are on OpenShift, you may need to add an SCC annotation to the Pod template to allow privileged or SYS_ADMIN capability.

OpenShift#

Slower First-Time Pod Startup#

Initialization of Pods with the profiler injected can be slower on OpenShift clusters during the first-time setup (post-configuration). This is due to the more complex mechanism required for node configuration, specifically the updating of kernel.perf_event_paranoid. Subsequent Pod starts are not affected.

Security Context Constraints (SCCs)#

OpenShift enforces Security Context Constraints in addition to standard Kubernetes SecurityContext policies. Because Nsight Operator injects volumes and (optionally) elevated capabilities, you may need to grant SCC privileges to the operator’s ServiceAccount and, for GPU metrics workloads, to the target Pod’s ServiceAccount.

Grant the anyuid SCC to the operator ServiceAccount:

oc adm policy add-scc-to-user anyuid \
    system:serviceaccount:nsight-operator:nsight-operator

For GPU metrics workloads, grant the privileged SCC (or create a custom SCC with the SYS_ADMIN capability only):

oc adm policy add-scc-to-user privileged \
    system:serviceaccount:<target-namespace>:<target-sa>

Alternatively, annotate the Pod template with the required SCC:

spec:
  template:
    metadata:
      annotations:
        openshift.io/required-scc: privileged

For the GPU-metrics DaemonSet example in GPU Metrics, the target Pod template already sets securityContext.privileged: true – an SCC that permits privileged containers must be bound to the DaemonSet’s ServiceAccount.

Node Configuration with MachineConfig#

The operator updates kernel.perf_event_paranoid on nodes via a DaemonSet that writes the sysctl value directly. On OpenShift, some clusters prefer to manage all node parameters declaratively via MachineConfig. To prevent the operator from modifying node configuration, set machineConfig: null in your Helm values and manage the sysctl yourself:

# MachineConfig example (apply cluster-wide)
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-perf-event-paranoid
  labels:
    machineconfiguration.openshift.io/role: worker
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - path: /etc/sysctl.d/99-perf-event-paranoid.conf
        mode: 0644
        contents:
          source: data:,kernel.perf_event_paranoid%20%3D%202

Image Pull Secrets#

If your OpenShift cluster uses a disconnected registry, add imagePullSecrets for every sub-chart (nsight-coordinator, nsight-gateway, cloudStorage.minio, etc.) and mirror all operator images.

Collecting Logs for Support#

When reporting an issue, include logs and CR definitions from all involved components. The commands below collect a minimum useful bundle.

Replace <ns> with the namespace where Nsight Operator is installed (nsight-operator by default) or, for multi-tenant setups, the tenant namespace of interest.

Operator Controller#

kubectl get pods -n <ns> -l app.kubernetes.io/name=nsight-operator
kubectl logs -n <ns> -l app.kubernetes.io/name=nsight-operator --tail=500

Also include events:

kubectl get events -n <ns> --sort-by=.lastTimestamp | tail -n 100

Injector Webhook#

kubectl logs -n <ns> -l app.kubernetes.io/name=nsight-injector --tail=500

If injection is failing for specific Pods, capture the admission webhook configuration too:

kubectl get mutatingwebhookconfigurations -o yaml \
    > mutatingwebhookconfigurations.yaml

Coordinator#

kubectl logs -n <ns> -l app.kubernetes.io/component=coordinator --tail=500

Gateway#

kubectl logs -n <ns> -l app.kubernetes.io/component=gateway --tail=500

Gateway access logs are the fastest way to confirm whether the CLI actually reached the gateway, and which route was taken (/coordinator/, /analysis/, /streamer/, etc.).

Analysis Service#

kubectl logs -n <ns> -l app.kubernetes.io/component=analysis --tail=500

OTel Collector#

kubectl logs -n <ns> -l app.kubernetes.io/component=otel-collector --all-containers=true --tail=500

Target Pod / Collection Agent#

The Nsight Systems agent logs are printed by the target container process. If logOutput is set on the nsight tool config (see NsightOperatorProfileConfig), logs may also appear at the configured path inside the Pod.

kubectl logs -n <target-ns> <pod> --all-containers=true --tail=500

Custom Resource Snapshots#

kubectl get nsightcoordinator,nsightgateway,nsightanalysis,\
nsightcloudstorageconfig,nsightotelcollector,otlpproxyconfig,\
nsightstreamer,nsighttenantoperator,nsightcloudui -A -o yaml \
    > nsight-crs.yaml
kubectl get nsightoperatorprofileconfig -A -o yaml \
    > nsight-profile-configs.yaml

CLI Configuration (redacted)#

The CLI stores its connection info in ~/.nsight-cloud.conf. Include a redacted copy – strip API keys, tokens, and client secrets before sharing.

Version Information#

helm list -A | grep nsight
kubectl get crds | grep nvidia.com
python3 nsight_operator.py --version   # if available