Troubleshooting#
Troubleshooting guide for Nsight Operator.
General Errors#
Sometimes you may find that a pod is not injected with a sidecar container as expected. Check the following items:
The
nsight-injectorwebhook Pod in thensight-operatornamespace is inRunningstate and its logs do not contain errors:kubectl get pods -n nsight-operator -l app.kubernetes.io/name=nsight-injector kubectl logs -n nsight-operator -l app.kubernetes.io/name=nsight-injector
The target Pod has (or inherits) the
nvidia-nsight-profile=enabledlabel, and was created after the webhook was installed. Existing Pods are not mutated retroactively; see Enabling Profiling on Target Resources for restart procedures.The target Pod’s main process matches an
injectionIncludePatternsregex and does not match anyinjectionExcludePatterns(including the cluster-wide default exclusions for shells and common utilities).
For deeper diagnosis, see Sidecar Injection Issues.
GPU Metrics Collection Error#
Check that no other applications collect GPU metrics on a target Pod. For example it can be:
Other injection with the enabled
--gpu-metrics-devicesoption. In that case, you can use a report from that injection or modify the configurations to ensure only one Pod is running with the GPU metrics option.If you have a GPU operator installed, it has a
nvidia-dcgm-exporter(documentation) DaemonSet which collects GPU metrics. If you are not using it, you can temporarily disable it:
kubectl -n gpu-operator-resources patch daemonset nvidia-dcgm-exporter -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
To enable it back:
kubectl -n gpu-operator-resources patch daemonset nvidia-dcgm-exporter --type json -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'
Sidecar Injection Issues#
If a Pod that should be profiled does not start with a sidecar or a readiness waiter init container, work through the checks below.
1. Verify the Pod is eligible#
The admission webhook mutates a Pod only if all of these are true:
The Pod is created after the webhook was installed. Existing Pods are never retroactively mutated. Restart the owning Deployment / StatefulSet (see Enabling Profiling on Target Resources) to trigger new Pods.
The Pod’s namespace is not excluded by the cluster-wide filter. By default,
kube-system,kube-node-lease, andkube-publicare excluded.The Pod, or a NsightOperatorProfileConfig in its namespace, has a rule that matches via one of: the
nvidia-nsight-profile=enabledlabel, anamespaceSelector, anobjectSelector, or a CELmatchConditionsexpression.The main process command line matches at least one
injectionIncludePatternsregex and does not match anyinjectionExcludePatterns. Note that the cluster-widedefaultInjectionExcludePatternslist (shells, coreutils,nsys,ncu) is always applied even when you overrideinjectionExcludePatterns; see Configuration Values.
Quickly inspect the current rules:
kubectl get nsightoperatorprofileconfig -A
kubectl describe nsightoperatorprofileconfig <name> -n <namespace>
2. Verify the webhook is running#
kubectl get pods -n nsight-operator -l app.kubernetes.io/name=nsight-injector
kubectl logs -n nsight-operator -l app.kubernetes.io/name=nsight-injector --tail=200
Common failures:
ImagePullBackOffon the webhook image – checkimagePullSecretsin Helm values, especially in air-gapped clusters.CrashLoopBackOff with certificate errors – see the next section.
3. Verify the webhook certificate#
The admission webhook requires a valid serving certificate that the Kubernetes
API server trusts. The operator creates a self-signed certificate Secret and
patches the MutatingWebhookConfiguration with the corresponding CA bundle.
kubectl get mutatingwebhookconfigurations | grep nsight
kubectl describe mutatingwebhookconfiguration <name> | grep -A5 caBundle
kubectl get secret -n nsight-operator | grep injector
If the caBundle is empty or out of sync with the Secret, delete the
webhook Secret and restart the operator controller Pod so the certificate is
regenerated:
kubectl delete secret nsight-injector -n nsight-operator
kubectl rollout restart deployment -n nsight-operator nsight-operator
4. Verify CEL expressions and selectors#
If you use matchConditions with CEL expressions, errors in the expression
cause the webhook to reject the Pod (fail-open or fail-closed depending on
configuration). Check the webhook logs for cel or expression messages.
Simulate a CEL expression against a Pod manifest using kubectl apply --dry-run=server:
kubectl apply --dry-run=server -f mypod.yaml -o yaml
5. Inspect a mutated Pod#
Once a Pod is admitted, verify that it actually received the injection:
kubectl get pod <name> -n <namespace> -o yaml | grep -E "nsight|nsys"
You should see:
A readiness waiter init container (if enabled, see Configuration Values).
A shared volume named
nsight-injector-configor similar.Additional environment variables prefixed
NVDT_on the main container, injected by the process hook.
If none of these are present, the webhook did not mutate the Pod. Revisit step 1.
Gateway Connectivity and Authentication#
The nsight_operator.py CLI connects to the NsightGateway
(Envoy) over HTTP or HTTPS. Connection failures usually fall into one of the
categories below.
1. Gateway is not reachable#
Check that the gateway Pod is Running and the service has endpoints:
kubectl get pods -n <namespace> -l app.kubernetes.io/component=gateway
kubectl get endpoints -n <namespace> nsight-operator-gateway
ClusterIP (default): You must port-forward or use autoconfigure which
sets port-forwarding up automatically:
kubectl port-forward -n <namespace> svc/nsight-operator-gateway 8888:8888 &
LoadBalancer: Wait for the external IP to be assigned:
kubectl get svc -n <namespace> nsight-operator-gateway -w
If the external IP stays <pending>, your cloud provider’s load balancer
controller is not running or is out of quota – check the controller logs.
NodePort: Ensure the node security group / firewall allows the assigned port:
kubectl get svc -n <namespace> nsight-operator-gateway \
-o jsonpath='{.spec.ports[0].nodePort}'
2. TLS trust errors#
If you see SSL: CERTIFICATE_VERIFY_FAILED or similar errors with HTTPS:
For self-signed certificates, prefer
autoconfigure– it reads the CA certificate from the Kubernetes TLS Secret automatically. See TLS Configuration.For manually configured CLIs, add the CA to your local trust store or set
REQUESTS_CA_BUNDLEto point at the CA file.Verify the certificate CN / SAN matches the host you are using to connect (the LoadBalancer hostname, NodePort IP, or
localhostfor port-forward).
3. Authentication errors#
401 Unauthorized:
For API-key auth, ensure you configured the CLI with
--apikeymatching the value in theNsightGateway.spec.authentication.apikeyfield or its referenced Secret. The header must beAuthorization: Bearer <apikey>.For OAuth2 / JWT, run
python3 nsight_operator.py loginagain. Tokens expire; runningloginrefreshes them.
403 Forbidden:
Your token is valid but missing a required scope. For OAuth2 configurations, verify
scopesin the gateway CR includesopenid profile email(the defaults).
OIDC clock skew:
JWT tokens are time-sensitive. If the client’s and gateway’s clocks differ by more than 60 seconds, validation fails. Ensure NTP is running on both ends.
4. autoconfigure fails#
autoconfigure requires kubectl access to the target namespace. Common
failures:
forbiddenerrors – your kubeconfig does not haveget/listpermissions onNsightGateway,Secret, orService.not founderrors – the gateway service name differs from the defaultnsight-operator-gateway. Pass-s <servicename>toautoconfigure.
5. MinIO storage proxy is not reachable#
When autoconfigure is used with the integrated MinIO backend, storage
flows through the gateway on port 9000. If download fails with a
connection refused error, check that:
The gateway’s port-forwarding includes both
8888and9000(the CLI handles this automatically for ClusterIP gateways).For LoadBalancer gateways, port
9000is exposed by the Service. If you customizedspec.service, include an additional port for MinIO.
Storage and Download Issues#
Profiling results are uploaded by each collection agent to the cloud storage
backend configured by NsightCloudStorageConfig.
If nsight_operator.py ls or download fails, check the items below.
1. Reports are not yet available#
Reports are uploaded after profiler-stop, and the upload can take a few
seconds. If ls shows no files immediately after stopping, wait 5-10
seconds and try again.
2. endpoint_url mismatch#
When you use configure (not autoconfigure), you must manually set
NSIGHT_CLOUD_STORAGE_CONFIG_FILE and point its endpoint_url at a
reachable MinIO or S3 host. See Configuring Storage Access for Downloads.
The most common failure is leaving the in-cluster MinIO service name
(e.g. nsight-operator-minio.nsight-operator.svc:9000) in the config file
and trying to download from outside the cluster. Either port-forward the
gateway (which proxies MinIO on port 9000) and edit endpoint_url to
http://localhost:9000, or re-run with autoconfigure which does this
automatically.
3. MinIO credentials#
The storage config Secret contains the access key and secret key used by the
coordinator, injector, and CLI. If a download fails with
InvalidAccessKeyId or SignatureDoesNotMatch, decode the Secret and
confirm that the keys match the MinIO StatefulSet:
kubectl get secret nsight-operator-cloud-storage-secret -n <namespace> \
-o jsonpath='{.data.storage-config\.yaml}' | base64 -d
If the Secret was re-created but MinIO was not restarted, the pod still holds the old credentials – restart MinIO:
kubectl rollout restart statefulset -n <namespace> nsight-operator-minio
4. local storage type#
When spec.storage_type: local is set on
NsightCloudStorageConfig, reports
are written to the Pod’s local filesystem, not to a shared backend. The CLI
cannot download them directly. Use kubectl cp:
kubectl cp <namespace>/<pod>:<path-in-pod> <local-path>
The path inside the Pod is configurable via
NsightCloudStorageConfig.spec.storageDir and defaults to
/mnt/nv/reports.
5. External S3 credentials#
When using external S3 (not operator-managed MinIO), the
NsightCloudStorageConfig references a user-provided Secret named by
spec.secretRef.name. Verify that the Secret exists in the same namespace
as the CR and that its storage-config.yaml key contains valid JSON/YAML:
kubectl get secret <secret-name> -n <namespace> \
-o jsonpath='{.data.storage-config\.yaml}' | base64 -d
Required fields include storage_type, bucket_name,
aws_access_key_id, aws_secret_access_key, region_name, and
local_cache_dir.
6. Disk full on MinIO#
Operator-managed MinIO deployments use ephemeral storage (emptyDir) by
default. Long profiling sessions can exhaust node disk space. Either enable
persistent storage (cloudStorage.minio.persistence.enabled: true – see
Storage Configuration) or periodically run session-end and delete
old sessions to reclaim space.
GPU Metrics#
Nsight Systems can collect GPU metrics alongside traces. With the current chart, enabling GPU metrics requires granting extra privileges to profiled containers and ensuring no other GPU metrics collectors are active on the same GPUs during profiling windows.
Prerequisites#
Ensure no conflicting GPU metrics collectors are running simultaneously on the same node/GPUs. If you have the NVIDIA GPU Operator installed, temporarily disable its
nvidia-dcgm-exporterDaemonSet during profiling windows.Update node configuration if needed (e.g.
kernel.perf_event_paranoid). SeemachineConfigin Configuration Values.
Security Context#
GPU metrics often require elevated privileges. In the manifest-based example
below, securityContext.privileged: true is set on the test container. If your
environment disallows privileged, add capabilities.add: ["SYS_ADMIN"]
instead.
Conflict Avoidance (DCGM exporter)#
If the NVIDIA GPU Operator is installed, its nvidia-dcgm-exporter DaemonSet
also collects metrics. To avoid conflicts during profiling runs, temporarily
disable it before profiling and re-enable afterward:
kubectl -n gpu-operator-resources patch daemonset nvidia-dcgm-exporter -p '{"spec": {"template": {"spec": {"nodeSelector": {"non-existing": "true"}}}}}'
# ... run profiling ...
kubectl -n gpu-operator-resources patch daemonset nvidia-dcgm-exporter --type json -p='[{"op": "remove", "path": "/spec/template/spec/nodeSelector/non-existing"}]'
Notes#
Some environments may require
privileged: truerather than onlySYS_ADMINto collect the full set of GPU metrics.Do not run multiple profilers with GPU metrics on the same GPU concurrently.
GPU Metrics DaemonSet and Config (YAML)#
Apply the following manifests to deploy a simple GPU-metrics collector and a
matching NsightOperatorProfileConfig that enables GPU metrics:
apiVersion: nvidia.com/v1
kind: NsightOperatorProfileConfig
metadata:
name: gpu-metrics-config
spec:
nsightToolConfigs:
- name: "gpu-metrics"
nsightToolArgs: "--gpu-metrics-devices=all"
injectionIncludePatterns:
- ".*sleep infinity.*"
injectionRules:
- name: "gpu-metrics"
nsightToolConfigRef: "gpu-metrics"
matchConditions:
- name: "gpu-metrics"
expression: |
((has(object.metadata.generateName) &&
object.metadata.generateName.contains('nsight-operator-gpu-metrics-collector')) ||
(has(object.metadata.name) &&
object.metadata.name.contains('nsight-operator-gpu-metrics-collector')))
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nsight-operator-gpu-metrics-collector
labels:
app: nsight-operator-gpu-metrics-collector
spec:
selector:
matchLabels:
app: nsight-operator-gpu-metrics-collector
template:
metadata:
labels:
app: nsight-operator-gpu-metrics-collector
nvidia-nsight-profile: enabled
spec:
runtimeClassName: nvidia
containers:
- name: gpu-metrics-ubuntu-container
image: nvidia/cuda:13.0.0-base-ubuntu24.04
command: ["sleep", "infinity"]
securityContext:
privileged: true
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
Tip
If you are on OpenShift, you may need to add an SCC annotation to the Pod
template to allow privileged or SYS_ADMIN capability.
OpenShift#
Slower First-Time Pod Startup#
Initialization of Pods with the profiler injected can be slower on OpenShift
clusters during the first-time setup (post-configuration). This is due to the
more complex mechanism required for node configuration, specifically the
updating of kernel.perf_event_paranoid. Subsequent Pod starts are not
affected.
Security Context Constraints (SCCs)#
OpenShift enforces Security Context Constraints in addition to standard
Kubernetes SecurityContext policies. Because Nsight Operator injects
volumes and (optionally) elevated capabilities, you may need to grant SCC
privileges to the operator’s ServiceAccount and, for GPU metrics workloads,
to the target Pod’s ServiceAccount.
Grant the anyuid SCC to the operator ServiceAccount:
oc adm policy add-scc-to-user anyuid \
system:serviceaccount:nsight-operator:nsight-operator
For GPU metrics workloads, grant the privileged SCC (or create a
custom SCC with the SYS_ADMIN capability only):
oc adm policy add-scc-to-user privileged \
system:serviceaccount:<target-namespace>:<target-sa>
Alternatively, annotate the Pod template with the required SCC:
spec:
template:
metadata:
annotations:
openshift.io/required-scc: privileged
For the GPU-metrics DaemonSet example in GPU Metrics,
the target Pod template already sets securityContext.privileged: true
– an SCC that permits privileged containers must be bound to the
DaemonSet’s ServiceAccount.
Node Configuration with MachineConfig#
The operator updates kernel.perf_event_paranoid on nodes via a DaemonSet
that writes the sysctl value directly. On OpenShift, some clusters prefer to
manage all node parameters declaratively via MachineConfig. To prevent
the operator from modifying node configuration, set machineConfig: null
in your Helm values and manage the sysctl yourself:
# MachineConfig example (apply cluster-wide)
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: 99-perf-event-paranoid
labels:
machineconfiguration.openshift.io/role: worker
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- path: /etc/sysctl.d/99-perf-event-paranoid.conf
mode: 0644
contents:
source: data:,kernel.perf_event_paranoid%20%3D%202
Image Pull Secrets#
If your OpenShift cluster uses a disconnected registry, add
imagePullSecrets for every sub-chart (nsight-coordinator,
nsight-gateway, cloudStorage.minio, etc.) and mirror all operator
images.
Collecting Logs for Support#
When reporting an issue, include logs and CR definitions from all involved components. The commands below collect a minimum useful bundle.
Replace <ns> with the namespace where Nsight Operator is installed
(nsight-operator by default) or, for multi-tenant setups, the tenant
namespace of interest.
Operator Controller#
kubectl get pods -n <ns> -l app.kubernetes.io/name=nsight-operator
kubectl logs -n <ns> -l app.kubernetes.io/name=nsight-operator --tail=500
Also include events:
kubectl get events -n <ns> --sort-by=.lastTimestamp | tail -n 100
Injector Webhook#
kubectl logs -n <ns> -l app.kubernetes.io/name=nsight-injector --tail=500
If injection is failing for specific Pods, capture the admission webhook configuration too:
kubectl get mutatingwebhookconfigurations -o yaml \
> mutatingwebhookconfigurations.yaml
Coordinator#
kubectl logs -n <ns> -l app.kubernetes.io/component=coordinator --tail=500
Gateway#
kubectl logs -n <ns> -l app.kubernetes.io/component=gateway --tail=500
Gateway access logs are the fastest way to confirm whether the CLI actually
reached the gateway, and which route was taken (/coordinator/,
/analysis/, /streamer/, etc.).
Analysis Service#
kubectl logs -n <ns> -l app.kubernetes.io/component=analysis --tail=500
OTel Collector#
kubectl logs -n <ns> -l app.kubernetes.io/component=otel-collector --all-containers=true --tail=500
Target Pod / Collection Agent#
The Nsight Systems agent logs are printed by the target container process. If
logOutput is set on the nsight tool config (see
NsightOperatorProfileConfig), logs may also appear at the
configured path inside the Pod.
kubectl logs -n <target-ns> <pod> --all-containers=true --tail=500
Custom Resource Snapshots#
kubectl get nsightcoordinator,nsightgateway,nsightanalysis,\
nsightcloudstorageconfig,nsightotelcollector,otlpproxyconfig,\
nsightstreamer,nsighttenantoperator,nsightcloudui -A -o yaml \
> nsight-crs.yaml
kubectl get nsightoperatorprofileconfig -A -o yaml \
> nsight-profile-configs.yaml
CLI Configuration (redacted)#
The CLI stores its connection info in ~/.nsight-cloud.conf. Include a
redacted copy – strip API keys, tokens, and client secrets before sharing.
Version Information#
helm list -A | grep nsight
kubectl get crds | grep nvidia.com
python3 nsight_operator.py --version # if available