Installation Guide#

Nsight Operator installation and configuration guide.

Installation and Configuration#

Quick Install#

For a stock installation with the defaults, run:

helm install --wait \
    --namespace nsight-operator \
    --create-namespace \
    nsight-operator \
    https://helm.ngc.nvidia.com/nvidia/devtools/charts/nsight-operator-26.2.1.tgz

By default, profiling is enabled in Coordinator Mode for all applications running in Pods labeled with nvidia-nsight-profile: enabled. Continue with the sections below to customize the installation.

Customizing Helm Values#

The NVIDIA Nsight Operator can be customized to suit particular needs. Likely, you will want to configure the nsight-injector.nsightToolConfig.nsightToolArgs, nsight-injector.nsightToolConfig.injectionIncludePatterns values. A values file can be used for setting these parameters.

Sample custom_values.yaml. This configuration will enable profiling for any instance of yourawesomeapp found in injection Pods, enabling Python sampling and PyTorch tracing. This configuration uses Coordinator Mode.

# Nsight Systems profiling configuration (under nsight-injector sub-chart)
nsight-injector:
  nsightToolConfig:
    nsightToolArgs: "--python-sampling=true --cuda-graph-trace=node --pytorch=autograd-nvtx"
    injectionIncludePatterns:
      - ".*yourawesomeapp.*"

helm install -f custom_values.yaml \
    --namespace nsight-operator \
    --create-namespace \
    nsight-operator https://helm.ngc.nvidia.com/nvidia/devtools/charts/nsight-operator-26.2.1.tgz

Sample custom_values_launch.yaml. This configuration switches to Launch Mode: profiling starts automatically when the target process starts and runs for a fixed duration, without any coordinator involvement. Use this mode for unattended captures with a bounded profiling window.

# Nsight Systems profiling configuration in launch mode
nsight-injector:
  nsightToolConfig:
    coordinator: false
    nsightToolArgs: "-t cuda,nvtx,osrt --python-sampling=true --duration=20 --kill=none"
    injectionIncludePatterns:
      - ".*yourawesomeapp.*"

Sample custom_values_extended.yaml: This configuration enables profiling for any instance of yourawesomeapp running in injected Pods, except for those started with the argumenttoskip argument. Profiling is configured to collect data for a maximum duration of 20 seconds. The nsys-output-volume will be mounted to all profiled Pods. A Persistent Volume Claim must be available in the target namespaces for successful operation. Additionally, kernel.perf_event_paranoid will be set to -1 on all nodes where profiling is performed.

# Nsight Systems profiling configuration
nsight-injector:
  nsightToolConfig:
    volumes:
      [
        {
          "name": "nsys-output-volume",
          "persistentVolumeClaim": { "claimName": "CSP-managed-disk" },
        },
      ]
    volumeMounts:
      [{ "name": "nsys-output-volume", "mountPath": "/mnt/nsys/output" }]
    nsightToolArgs: "--python-sampling=true"
    injectionIncludePatterns:
      - ".*yourawesomeapp.*"
    injectionExcludePatterns:
      - ".*yourawesomeapp.*argumenttoskip.*"

  machineConfig:
    - name: kernel.perf_event_paranoid
      value: -1

Configuration Values#

The NVIDIA Nsight Operator helm chart includes the following components:

Operator-level values: Configure the operator controller and general settings.
Coordinator values (nsight-coordinator.*): Configure the coordinator deployment for profiling session control.
Gateway values (nsight-gateway.*): Configure the Envoy gateway for REST API access.
Cloud Storage values (cloudStorage.*): Configure storage for profiling results.
OTLP Collector values (nsight-otel-collector.*): Configure OTLP collector for trace mirroring.
OTLP Proxy values (otlpProxyConfig.*): Configure OTLP proxy injection.
Injector sub-chart values (nsight-injector.*): Configure injection behavior.
STUNner TURN gateway values (nsight-tenant-operator.stunner.* and stunner-gateway-operator.*): Configure the bundled STUNner TURN gateway for Nsight Streamer WebRTC relay.

Operator-level Configuration#

Variable	Description	Default
`installation.multitenant`	Enable multi-tenant mode. When `true`, the operator controller runs cluster-wide but control-plane components (Coordinator, Gateway, Storage, OTel Collector, Analysis) are provisioned per tenant namespace rather than in the operator namespace. With default values, the operator auto-provisions them the first time a matching Pod is admitted into a namespace; namespace admins can also pre-deploy their own resources.	`false`
`leaderElection.enabled`	Enable Kubernetes leader election for the operator controller. Safe and recommended for multi-replica deployments; harmless with a single replica.	`true`
`leaderElection.resourceName`	Name of the Lease used for leader election.	`nsight-operator.nvidia.com`
`hostNetwork`	Run the operator controller in the host network namespace. Rarely needed.	`false`
`global.nsightCloud.schedulerConfig`	Default `nodeSelector`, `affinity`, `tolerations`, and `topologySpreadConstraints` inherited by all sub-components (coordinator, gateway, analysis, streamer, OTel collector, tenant operator, cloud UI) when they do not set their own.	(empty)
`global.nsightCloud.securityContext.pod`	Default Pod-level `securityContext` inherited by all sub-components. Defaults provide `runAsNonRoot: true` and the `RuntimeDefault` seccomp profile.	non-root / seccomp
`global.nsightCloud.securityContext.container`	Default container-level `securityContext` inherited by all sub-components. Drops all capabilities and disables privilege escalation.	drop ALL

Coordinator Configuration (nsight-coordinator.*):

Variable	Description	Default
`nsight-coordinator.enabled`	Enable deployment of the NsightCoordinator CR in the operator namespace.	`true`

Additional configuration options can be specified through Helm values that map to the default NsightCoordinator CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.

Gateway Configuration (nsight-gateway.*):

Variable	Description	Default
`nsight-gateway.enabled`	Enable deployment of the NsightGateway CR for REST API gateway access to Coordinator and Analysis services.	`true`

Additional configuration options can be specified through Helm values that map to the default NsightGateway CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.

Cloud Storage Configuration (cloudStorage.*):

Variable	Description	Default
`cloudStorage.enabled`	Enable cloud storage integration for profiling results.	`true`

Additional configuration options can be specified through Helm values that map to the default NsightCloudStorageConfig CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.

OTLP Collection Configuration (nsight-otel-collector.* and otlpProxyConfig.*):

Variable	Description	Default
`nsight-otel-collector.enabled`	Enable OTLP collector for trace mirroring.	`true`
`otlpProxyConfig.enabled`	Enable OTLP proxy injection for trace mirroring.	`true`

Additional configuration options can be specified through Helm values that map to the default NsightOtelCollector and OTLPProxyConfig CRDs created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.

STUNner TURN Gateway Configuration:

STUNner install-time settings, such as enabled, installGatewayApiCRDs, protocol, and gatewayAnnotations, live under nsight-tenant-operator.stunner.*. The TURN listener port is configured on nsight-gateway.service.turnPort. The STUNner relay Pod settings, such as resources, tolerations, affinity, securityContext, and containerSecurityContext, live under stunner-gateway-operator.stunnerGatewayOperator.dataplane.spec.*. See STUNner TURN Gateway for examples.

Analysis Configuration (nsight-analysis.*):

Variable	Description	Default
`nsight-analysis.enabled`	Enable analysis service for running recipes.	`true`

Additional configuration options can be specified through Helm values that map to the default NsightAnalysis CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.

Nsight Streamer:

Nsight Streamer is deployed separately using the NsightStreamer CRD rather than through Helm chart values. See Nsight Streamer for deployment instructions and the NsightStreamer CRD Reference for all available fields.

Injector Sub-chart Configuration#

These values are prefixed with nsight-injector. when using the parent chart:

Variable	Description	Default
`nsight-injector.nsightToolConfig.nsightToolArgs`	The parameters for Nsight Systems used during profiling. See the Nsight Systems User Guide for available parameters. Placeholders within these parameters will be substituted with their actual values during execution.	`--python-sampling=true --trace-fork-before-exec=true`
`nsight-injector.nsightToolConfig.injectionIncludePatterns`	Regex patterns that specify which processes or commands in the container should be profiled.	`[".*"]`
`nsight-injector.nsightToolConfig.injectionExcludePatterns`	Regex patterns that specify which processes or commands in the container should NOT be profiled.	`[]`
`nsight-injector.defaultInjectionExcludePatterns`	Cluster-wide default regex patterns that are always excluded from injection, even when custom `injectionExcludePatterns` are provided. Intended to skip shells, coreutils, Nsight tools, etc. Set to `[]` to disable.	A preset list including shells (bash, sh, zsh, dash), common utilities, and Nsight tools. See chart `values.yaml`.
`nsight-injector.nsightToolConfig.volumes`	Additional volumes that will be injected into profiled containers.
`nsight-injector.nsightToolConfig.volumeMounts`	Volume mounts that will be injected into profiled containers.
`nsight-injector.nsightToolConfig.env`	Environment variables injected only into the profiled process (not added to container spec). Each item must have `name` and `value` fields.
`nsight-injector.nsightToolConfig.containerEnv`	Environment variables to inject into the target container (added to the Pod spec). Visible to all processes in the container. If a variable is present in both `containerEnv` and `env`, the value from `env` takes precedence for profiled process execution.
`nsight-injector.nsightToolConfig.enableDefault`	Should the default (included in setup) profiling configuration be enabled?	`true`
`nsight-injector.nsightToolConfig.coordinator`	Enable coordinator mode for on-demand profiling.	`true`
`nsight-injector.nsightToolConfig.otlpMirroringEnabled`	Enable OTLP mirroring for this profile.	`true`
`nsight-injector.machineConfig`	Array of name/value pairs (system configurations) which should be updated before profiling on target nodes (currently, only `kernel.perf_event_paranoid` is supported). See Requirements for x86_64 and Arm SBSA targets on Linux. To prevent the operator from updating node configurations, set `machineConfig: null` in the custom values file.	`[{ name: kernel.perf_event_paranoid, value: 2 }]`

Readiness Waiter Configuration#

The readiness waiter is an init container that the injector adds to profiled Pods. It blocks the main container from starting until the storage configuration, MinIO service, and Coordinator service are reachable. This prevents a profiling session from missing the early stages of a workload when the operator is still starting up (for example, immediately after a rolling restart of the control plane).

Variable	Description	Default
`nsight-injector.readinessWaiter.enabled`	Enable the readiness waiter init container.	`true`
`nsight-injector.readinessWaiter.image`	Python image used to run the readiness checks.	(operator default)
`nsight-injector.readinessWaiter.imagePullPolicy`	Image pull policy.	`IfNotPresent`
`nsight-injector.readinessWaiter.timeout`	Maximum time to wait for dependencies (seconds). Applies per Pod start; after this elapses the waiter either fails or exits successfully depending on `failOnTimeout`.	`300`
`nsight-injector.readinessWaiter.interval`	Seconds between readiness checks while the waiter is active.	`5`
`nsight-injector.readinessWaiter.failOnTimeout`	When `true`, the Pod fails to start if dependencies are not ready within `timeout`. When `false`, the waiter exits successfully on timeout and the container starts – profiling for that container degrades gracefully and is retried on the next collection.	`false`

Supported Placeholders#

Placeholder	Replacement
`{NVDT_UID}`	The random alphanumeric string (8 symbols)
`{NVDT_PROCESS_NAME}`	The profiled process name
`{NVDT_PROCESS_ID}`	The profiled process id
`{NVDT_TIMESTAMP}`	The UNIX timestamp (in ms)
`%{ANY ENVIRONMENT VARIABLE}`	The “ANY ENVIRONMENT VARIABLE” environment variable inside a container. `NVDT_POD_FULLNAME` and `NVDT_CONTAINER_NAME` environment variables are set by the NVIDIA Nsight Operator

Enabling Profiling on Target Resources#

To enable automatic injection for all Pods in a namespace, add the nvidia-nsight-profile=enabled label to the namespace.

kubectl label namespaces <namespace name> nvidia-nsight-profile=enabled

To enable automatic injection for a specific workload in a namespace, add the nvidia-nsight-profile=enabled label to the workload’s Pod template. The injector evaluates newly created Pods, so labeling only the workload metadata does not cause Pods generated by that workload to match the default injection rule.

# Example for a deployment
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"metadata":{"labels":{"nvidia-nsight-profile":"enabled"}}}}}'

# Example for a statefulset
kubectl patch statefulset <statefulset-name> -p '{"spec":{"template":{"metadata":{"labels":{"nvidia-nsight-profile":"enabled"}}}}}'

At this point, any new Pod will be considered for injection based on labels and injectionIncludePatterns.

Existing Resources#

An already started Pod cannot be injected. After you add or remove namespace labels, Pod template labels, or injection rules, recreate existing Pods so the admission webhook can evaluate the updated configuration.

Sample commands to restart a Pod:

Resource with more than one replica

kubectl rollout restart <resource type>/<resource name>

For example:

kubectl rollout restart deployment/amazing_service

Resource with only one replica

kubectl scale <resource type>/<resource name> --replicas=0
kubectl scale <resource type>/<resource name> --replicas=1

For example:

kubectl scale deployment/amazing_service --replicas=0
kubectl scale deployment/amazing_service --replicas=1

Advanced Configuration#

In Kubernetes environments, managing sidecar injection and profiling configurations can be challenging, particularly in dynamic scenarios where Pods are created by custom resources or controllers. The process requires more than just filtering Pods – it requires selecting the appropriate configuration for each Pod, application, or namespace. While labels offer a basic level of control, they often lack the granularity required for precise targeting and configuration.

NVIDIA Nsight Operator supports the following mechanisms for filtering and targeting Pods for injection:

matchExpressions: Specify complex logic using custom expressions to evaluate Pod metadata and dynamically determine injection suitability.
namespaceSelector: Filter Pods based on labels applied to their namespaces.
objectSelector: Filter Pods based on labels applied directly to the Pod objects.

Example Configuration#

Below is a sample custom_values_fine_grained.yaml configuration demonstrating the use of these mechanisms for fine-grained injection control.

# Disable the default configuration
nsight-injector:
  nsightToolConfig:
    enableDefault: false

  injectionConfig:
    defaultNsightToolConfigRef: "triton-profile"
    nsightToolConfigs:
      - name: "triton-profile"
        nsightToolArgs: "--duration 20 --kill none -o /home/auto_{NVDT_PROCESS_NAME}_%{NVDT_POD_FULLNAME}_%{NVDT_CONTAINER_NAME}_{NVDT_TIMESTAMP}_{NVDT_UID}.nsys-rep"
        injectionIncludePatterns:
          - "^/opt/tritonserver/bin/tritonserver.*$"
      - name: "other-profile"
        nsightToolArgs: "--duration 30 --kill none -o /home/auto_{NVDT_PROCESS_NAME}_%{NVDT_POD_FULLNAME}_%{NVDT_CONTAINER_NAME}_{NVDT_TIMESTAMP}_{NVDT_UID}.nsys-rep"
        injectionIncludePatterns:
          - "^python MaxText/train.py.*$"
        env:
          - name: NSYS_NVTX_PROFILER_REGISTER_ONLY
            value: "0"
    injectionRules:
      - name: "has-injection-label-or-demo-injection-name"
        matchConditions:
        - name: "has-injection-label-or-demo-injection-name"
          expression: >
            ((has(object.metadata.labels) &&
            'nvidia-nsight-profile' in object.metadata.labels &&
            object.metadata.labels['nvidia-nsight-profile'] == 'enabled') ||
            object.metadata.name.contains('demo-injection'))
      - name: "train-injection"
        nsightToolConfigRef: "other-profile"
        matchConditions:
          - name: "fine-grained"
            expression: |
                        (
                          object.metadata.generateName.startsWith("example-deployment-name-") &&
                          object.metadata.namespace == "example-ns"
                        ) ||
                        (
                          object.metadata.ownerReferences.exists(ref, ref.kind == "DaemonSet" &&
                          ref.name == "example-daemonset")
                        )
      - name: "namespace-selector-filter"
        nsightToolConfigRef: "other-profile"
        namespaceSelector:
          matchLabels:
            custom-injection-label: enabled
      - name: "object-selector-filter"
        nsightToolConfigRef: "other-profile"
        objectSelector:
          matchLabels:
            custom-injection-label: enabled
      - name: "combined-filter"
        nsightToolConfigRef: "other-profile"
        namespaceSelector:
          matchLabels:
            combined-custom-injection-label: enabled
        objectSelector:
          matchLabels:
            combined-custom-injection-label: enabled
        matchConditions:
          - name: "combined"
            expression: 'object.metadata.name.startsWith("example-pod-prefix-")'

The above configuration customizes profiling parameters for different applications and Pods based on their metadata.

It enables profiling for 20 seconds for all the /opt/tritonserver/bin/tritonserver processes in all the Pods with the nvidia-nsight-profile=enabled label or Pods with the demo-injection in their name.
It enables profiling for 30 seconds for all the python MaxText/train.py processes in:
- all the Pods with generated name starting with example-deployment-name- in the example-ns namespace or Pods owned by the example-daemonset
- all the Pods in the namespace with the custom-injection-label=enabled label
- all the Pods with the custom-injection-label=enabled label
- all the Pods with the namespace label combined-custom-injection-label=enabled and the Pod label combined-custom-injection-label=enabled and the Pod name starting with example-pod-prefix-

Multi-Tenant Configuration#

NVIDIA Nsight Operator supports multi-tenant environments where different teams or users require separate configurations. The operator can be configured to apply different profiles and injection rules based on the namespace or Pod name. Below is a sample custom_values_multi_tenant.yaml configuration demonstrating the use of profiles and injection rules for multi-tenant environments. It activates possibility of profiling (profiling is still not enabled after installing) in all namespaces with the nvidia-nsight-profile=enabled label:

# Disable the default configuration
nsight-injector:
  nsightToolConfig:
    enableDefault: false

  clusterWideInjectionFilter:
    matchConditions:
      - name: "is-pod"
        expression: "object.kind == 'Pod'"
      - name: "not-self-managed"
        expression: "!(has(object.metadata.labels) && 'app' in object.metadata.labels && object.metadata.labels['app'] in ['nvidia-nsight-operator', 'nsight-operator'])"
    namespaceSelector:
      matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: "NotIn"
          values:
            - kube-system
            - kube-node-lease
            - kube-public
        - key: nvidia-nsight-profile
          operator: "In"
          values:
            - enabled

To enable profiling in a specific namespace, the user of this namespace should add the NsightOperatorProfileConfig resource with the profiling configuration content. The spec can include all subvalues of nsightToolConfig or injectionProfileConfig (see Advanced Configuration Values) values supported by the installation configuration.

Sample custom_installation_injection_config.yaml configuration (can be deployed by the kubectl apply -n example-ns -f custom_installation_injection_config.yaml command):

apiVersion: nvidia.com/v1
kind: NsightOperatorProfileConfig
metadata:
  name: custom-profile-config
spec:
  defaultNsightToolConfigRef: "update-profile"
  nsightToolConfigs:
    - name: "update-profile"
      nsightToolArgs: "--duration 2 --kill none -o /home/separate_auto_{NVDT_PROCESS_NAME}_%{NVDT_POD_FULLNAME}_%{NVDT_CONTAINER_NAME}_{NVDT_TIMESTAMP}_{NVDT_UID}.nsys-rep"
      injectionIncludePatterns:
        - "^/cuda-samples/vectorAdd_forever.*$"
      logOutput: /mnt/nv/out.log
  injectionRules:
    - name: "has-injection-label-or-demo-injection-name"
      enabled: false
    - name: "starts-with-name"
      matchConditions:
      - name: "starts-with-name"
        expression: >
          (has(object.metadata.generateName) &&
          object.metadata.generateName.contains('cuda-vector-add-forever'))

The configuration above enables profiling for 2 seconds for all the /cuda-samples/vectorAdd_forever processes in all the Pods with the generated name containing cuda-vector-add-forever. It also disables the “has-injection-label-or-demo-injection-name” injection configuration (if it was specified in the default cluster-wide configuration or any other NsightOperatorProfileConfig in the Pod’s namespace).

There can be multiple NsightOperatorProfileConfig resources in a Pod’s namespace. NVIDIA Nsight Operator will apply all the configuration from the NsightOperatorProfileConfig. Injection configurations cannot have the same name in the same namespace (the only situation when the same name is allowed is when the configuration is disabled).

Advanced Configuration Values#

Variable	Type	Description
`defaultNsightToolConfigRef`	string	The default profile name from the `nsightToolConfigs` section. Used if the `injectionRule` does not explicitly specify a profile name.
`nsightToolConfigs`	list	Defines one or more “profiles”. Each profile describes how the Nsight tool injection should be performed.
`nsightToolConfigs[].name`	string	A unique name identifying the profile.
`nsightToolConfigs[].nsightToolArgs`	string	Parameters for Nsight Systems. See the Nsight Systems User Guide. Placeholders will be substituted with actual values during execution.
`nsightToolConfigs[].injectionIncludePatterns`	list of strings	Regex patterns specifying which processes should be profiled.
`nsightToolConfigs[].injectionExcludePatterns`	list of strings	Regex patterns specifying which processes should NOT be profiled.
`nsightToolConfigs[].env`	list	Environment variables injected only into the profiled process.
`nsightToolConfigs[].containerEnv`	list	Environment variables to inject into the target container (added to Pod spec). Visible to all processes in the container.
`nsightToolConfigs[].volumes`	list	Additional volumes that will be injected into profiled containers.
`nsightToolConfigs[].volumeMounts`	list	Volume mounts that will be injected into profiled containers.
`nsightToolConfigs[].logOutput`	string	Logging output destination. Can be `stdout`, `stderr` or a file path. By default, logging is disabled.
`nsightToolConfigs[].coordinator`	boolean	When true, enables coordinator mode for this profile.
`nsightToolConfigs[].otlpMirroringEnabled`	boolean	Enable or disable OTLP mirroring for this profile.
`nsightToolConfigs[].cloudStorageConfigRef`	string	Reference to a NsightCloudStorageConfig resource in the same namespace.
`injectionRules`	list	A list of rules that determines which Pods should receive injection.
`injectionRules[].name`	string	A unique name identifying this set of injection rules.
`injectionRules[].nsightToolConfigRef`	string	The name of a specific profile to use if this rule matches. If omitted, the `defaultNsightToolConfigRef` is used.
`injectionRules[].matchConditions`	list	A list of conditions to evaluate using CEL expressions.
`injectionRules[].matchConditions[].name`	string	A name for the match condition.
`injectionRules[].matchConditions[].expression`	string (CEL)	A CEL expression that returns `true` if the Pod should be injected.
`injectionRules[].namespaceSelector`	object	Label selector to match namespace labels.
`injectionRules[].namespaceSelector.matchLabels`	map	Key-value pairs that must be present on the namespace.
`injectionRules[].objectSelector`	object	Label selector to match pod labels.
`injectionRules[].objectSelector.matchLabels`	map	Key-value pairs that must be present on the pod.