Installation Guide#

Nsight Operator installation and configuration guide.

Installation and Configuration#

Quick Install#

For a stock installation with the defaults, run:

helm install --wait \
    --namespace nsight-operator \
    --create-namespace \
    nsight-operator \
    https://helm.ngc.nvidia.com/nvidia/devtools/charts/nsight-operator-26.2.1.tgz

By default, profiling is enabled in Coordinator Mode for all applications running in Pods labeled with nvidia-nsight-profile: enabled. Continue with the sections below to customize the installation.

Customizing Helm Values#

The NVIDIA Nsight Operator can be customized to suit particular needs. Likely, you will want to configure the nsight-injector.nsightToolConfig.nsightToolArgs, nsight-injector.nsightToolConfig.injectionIncludePatterns values. A values file can be used for setting these parameters.

Sample custom_values.yaml. This configuration will enable profiling for any instance of yourawesomeapp found in injection Pods, enabling Python sampling and PyTorch tracing. This configuration uses Coordinator Mode.

# Nsight Systems profiling configuration (under nsight-injector sub-chart)
nsight-injector:
  nsightToolConfig:
    nsightToolArgs: "--python-sampling=true --cuda-graph-trace=node --pytorch=autograd-nvtx"
    injectionIncludePatterns:
      - ".*yourawesomeapp.*"
helm install -f custom_values.yaml \
    --namespace nsight-operator \
    --create-namespace \
    nsight-operator https://helm.ngc.nvidia.com/nvidia/devtools/charts/nsight-operator-26.2.1.tgz

Sample custom_values_launch.yaml. This configuration switches to Launch Mode: profiling starts automatically when the target process starts and runs for a fixed duration, without any coordinator involvement. Use this mode for unattended captures with a bounded profiling window.

# Nsight Systems profiling configuration in launch mode
nsight-injector:
  nsightToolConfig:
    coordinator: false
    nsightToolArgs: "-t cuda,nvtx,osrt --python-sampling=true --duration=20 --kill=none"
    injectionIncludePatterns:
      - ".*yourawesomeapp.*"

Sample custom_values_extended.yaml: This configuration enables profiling for any instance of yourawesomeapp running in injected Pods, except for those started with the argumenttoskip argument. Profiling is configured to collect data for a maximum duration of 20 seconds. The nsys-output-volume will be mounted to all profiled Pods. A Persistent Volume Claim must be available in the target namespaces for successful operation. Additionally, kernel.perf_event_paranoid will be set to -1 on all nodes where profiling is performed.

# Nsight Systems profiling configuration
nsight-injector:
  nsightToolConfig:
    volumes:
      [
        {
          "name": "nsys-output-volume",
          "persistentVolumeClaim": { "claimName": "CSP-managed-disk" },
        },
      ]
    volumeMounts:
      [{ "name": "nsys-output-volume", "mountPath": "/mnt/nsys/output" }]
    nsightToolArgs: "--python-sampling=true"
    injectionIncludePatterns:
      - ".*yourawesomeapp.*"
    injectionExcludePatterns:
      - ".*yourawesomeapp.*argumenttoskip.*"

  machineConfig:
    - name: kernel.perf_event_paranoid
      value: -1

Configuration Values#

The NVIDIA Nsight Operator helm chart includes the following components:

  • Operator-level values: Configure the operator controller and general settings.

  • Coordinator values (nsight-coordinator.*): Configure the coordinator deployment for profiling session control.

  • Gateway values (nsight-gateway.*): Configure the Envoy gateway for REST API access.

  • Cloud Storage values (cloudStorage.*): Configure storage for profiling results.

  • OTLP Collector values (nsight-otel-collector.*): Configure OTLP collector for trace mirroring.

  • OTLP Proxy values (otlpProxyConfig.*): Configure OTLP proxy injection.

  • Injector sub-chart values (nsight-injector.*): Configure injection behavior.

  • STUNner TURN gateway values (nsight-tenant-operator.stunner.* and stunner-gateway-operator.*): Configure the bundled STUNner TURN gateway for Nsight Streamer WebRTC relay.

Operator-level Configuration#

Variable

Description

Default

installation.multitenant

Enable multi-tenant mode. When true, the operator controller runs cluster-wide but control-plane components (Coordinator, Gateway, Storage, OTel Collector, Analysis) are provisioned per tenant namespace rather than in the operator namespace. With default values, the operator auto-provisions them the first time a matching Pod is admitted into a namespace; namespace admins can also pre-deploy their own resources.

false

leaderElection.enabled

Enable Kubernetes leader election for the operator controller. Safe and recommended for multi-replica deployments; harmless with a single replica.

true

leaderElection.resourceName

Name of the Lease used for leader election.

nsight-operator.nvidia.com

hostNetwork

Run the operator controller in the host network namespace. Rarely needed.

false

global.nsightCloud.schedulerConfig

Default nodeSelector, affinity, tolerations, and topologySpreadConstraints inherited by all sub-components (coordinator, gateway, analysis, streamer, OTel collector, tenant operator, cloud UI) when they do not set their own.

(empty)

global.nsightCloud.securityContext.pod

Default Pod-level securityContext inherited by all sub-components. Defaults provide runAsNonRoot: true and the RuntimeDefault seccomp profile.

non-root / seccomp

global.nsightCloud.securityContext.container

Default container-level securityContext inherited by all sub-components. Drops all capabilities and disables privilege escalation.

drop ALL

Coordinator Configuration (nsight-coordinator.*):

Variable

Description

Default

nsight-coordinator.enabled

Enable deployment of the NsightCoordinator CR in the operator namespace.

true

Additional configuration options can be specified through Helm values that map to the default NsightCoordinator CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.

Gateway Configuration (nsight-gateway.*):

Variable

Description

Default

nsight-gateway.enabled

Enable deployment of the NsightGateway CR for REST API gateway access to Coordinator and Analysis services.

true

Additional configuration options can be specified through Helm values that map to the default NsightGateway CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.

Cloud Storage Configuration (cloudStorage.*):

Variable

Description

Default

cloudStorage.enabled

Enable cloud storage integration for profiling results.

true

Additional configuration options can be specified through Helm values that map to the default NsightCloudStorageConfig CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.

OTLP Collection Configuration (nsight-otel-collector.* and otlpProxyConfig.*):

Variable

Description

Default

nsight-otel-collector.enabled

Enable OTLP collector for trace mirroring.

true

otlpProxyConfig.enabled

Enable OTLP proxy injection for trace mirroring.

true

Additional configuration options can be specified through Helm values that map to the default NsightOtelCollector and OTLPProxyConfig CRDs created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.

STUNner TURN Gateway Configuration:

STUNner install-time settings, such as enabled, installGatewayApiCRDs, protocol, and gatewayAnnotations, live under nsight-tenant-operator.stunner.*. The TURN listener port is configured on nsight-gateway.service.turnPort. The STUNner relay Pod settings, such as resources, tolerations, affinity, securityContext, and containerSecurityContext, live under stunner-gateway-operator.stunnerGatewayOperator.dataplane.spec.*. See STUNner TURN Gateway for examples.

Analysis Configuration (nsight-analysis.*):

Variable

Description

Default

nsight-analysis.enabled

Enable analysis service for running recipes.

true

Additional configuration options can be specified through Helm values that map to the default NsightAnalysis CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.

Nsight Streamer:

Nsight Streamer is deployed separately using the NsightStreamer CRD rather than through Helm chart values. See Nsight Streamer for deployment instructions and the NsightStreamer CRD Reference for all available fields.

Injector Sub-chart Configuration#

These values are prefixed with nsight-injector. when using the parent chart:

Variable

Description

Default

nsight-injector.nsightToolConfig.nsightToolArgs

The parameters for Nsight Systems used during profiling. See the Nsight Systems User Guide for available parameters. Placeholders within these parameters will be substituted with their actual values during execution.

--python-sampling=true --trace-fork-before-exec=true

nsight-injector.nsightToolConfig.injectionIncludePatterns

Regex patterns that specify which processes or commands in the container should be profiled.

[".*"]

nsight-injector.nsightToolConfig.injectionExcludePatterns

Regex patterns that specify which processes or commands in the container should NOT be profiled.

[]

nsight-injector.defaultInjectionExcludePatterns

Cluster-wide default regex patterns that are always excluded from injection, even when custom injectionExcludePatterns are provided. Intended to skip shells, coreutils, Nsight tools, etc. Set to [] to disable.

A preset list including shells (bash, sh, zsh, dash), common utilities, and Nsight tools. See chart values.yaml.

nsight-injector.nsightToolConfig.volumes

Additional volumes that will be injected into profiled containers.

nsight-injector.nsightToolConfig.volumeMounts

Volume mounts that will be injected into profiled containers.

nsight-injector.nsightToolConfig.env

Environment variables injected only into the profiled process (not added to container spec). Each item must have name and value fields.

nsight-injector.nsightToolConfig.containerEnv

Environment variables to inject into the target container (added to the Pod spec). Visible to all processes in the container. If a variable is present in both containerEnv and env, the value from env takes precedence for profiled process execution.

nsight-injector.nsightToolConfig.enableDefault

Should the default (included in setup) profiling configuration be enabled?

true

nsight-injector.nsightToolConfig.coordinator

Enable coordinator mode for on-demand profiling.

true

nsight-injector.nsightToolConfig.otlpMirroringEnabled

Enable OTLP mirroring for this profile.

true

nsight-injector.machineConfig

Array of name/value pairs (system configurations) which should be updated before profiling on target nodes (currently, only kernel.perf_event_paranoid is supported). See Requirements for x86_64 and Arm SBSA targets on Linux. To prevent the operator from updating node configurations, set machineConfig: null in the custom values file.

[{ name: kernel.perf_event_paranoid, value: 2 }]

Readiness Waiter Configuration#

The readiness waiter is an init container that the injector adds to profiled Pods. It blocks the main container from starting until the storage configuration, MinIO service, and Coordinator service are reachable. This prevents a profiling session from missing the early stages of a workload when the operator is still starting up (for example, immediately after a rolling restart of the control plane).

Variable

Description

Default

nsight-injector.readinessWaiter.enabled

Enable the readiness waiter init container.

true

nsight-injector.readinessWaiter.image

Python image used to run the readiness checks.

(operator default)

nsight-injector.readinessWaiter.imagePullPolicy

Image pull policy.

IfNotPresent

nsight-injector.readinessWaiter.timeout

Maximum time to wait for dependencies (seconds). Applies per Pod start; after this elapses the waiter either fails or exits successfully depending on failOnTimeout.

300

nsight-injector.readinessWaiter.interval

Seconds between readiness checks while the waiter is active.

5

nsight-injector.readinessWaiter.failOnTimeout

When true, the Pod fails to start if dependencies are not ready within timeout. When false, the waiter exits successfully on timeout and the container starts – profiling for that container degrades gracefully and is retried on the next collection.

false

Supported Placeholders#

Placeholder

Replacement

{NVDT_UID}

The random alphanumeric string (8 symbols)

{NVDT_PROCESS_NAME}

The profiled process name

{NVDT_PROCESS_ID}

The profiled process id

{NVDT_TIMESTAMP}

The UNIX timestamp (in ms)

%{ANY ENVIRONMENT VARIABLE}

The “ANY ENVIRONMENT VARIABLE” environment variable inside a container. NVDT_POD_FULLNAME and NVDT_CONTAINER_NAME environment variables are set by the NVIDIA Nsight Operator

Enabling Profiling on Target Resources#

To enable automatic injection for all Pods in a namespace, add the nvidia-nsight-profile=enabled label to the namespace.

kubectl label namespaces <namespace name> nvidia-nsight-profile=enabled

To enable automatic injection for a specific workload in a namespace, add the nvidia-nsight-profile=enabled label to the workload’s Pod template. The injector evaluates newly created Pods, so labeling only the workload metadata does not cause Pods generated by that workload to match the default injection rule.

# Example for a deployment
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"metadata":{"labels":{"nvidia-nsight-profile":"enabled"}}}}}'

# Example for a statefulset
kubectl patch statefulset <statefulset-name> -p '{"spec":{"template":{"metadata":{"labels":{"nvidia-nsight-profile":"enabled"}}}}}'

At this point, any new Pod will be considered for injection based on labels and injectionIncludePatterns.

Existing Resources#

An already started Pod cannot be injected. After you add or remove namespace labels, Pod template labels, or injection rules, recreate existing Pods so the admission webhook can evaluate the updated configuration.

Sample commands to restart a Pod:

  1. Resource with more than one replica

    kubectl rollout restart <resource type>/<resource name>
    

    For example:

    kubectl rollout restart deployment/amazing_service
    
  2. Resource with only one replica

    kubectl scale <resource type>/<resource name> --replicas=0
    kubectl scale <resource type>/<resource name> --replicas=1
    

    For example:

    kubectl scale deployment/amazing_service --replicas=0
    kubectl scale deployment/amazing_service --replicas=1
    

Advanced Configuration#

In Kubernetes environments, managing sidecar injection and profiling configurations can be challenging, particularly in dynamic scenarios where Pods are created by custom resources or controllers. The process requires more than just filtering Pods – it requires selecting the appropriate configuration for each Pod, application, or namespace. While labels offer a basic level of control, they often lack the granularity required for precise targeting and configuration.

NVIDIA Nsight Operator supports the following mechanisms for filtering and targeting Pods for injection:

  • matchExpressions: Specify complex logic using custom expressions to evaluate Pod metadata and dynamically determine injection suitability.

  • namespaceSelector: Filter Pods based on labels applied to their namespaces.

  • objectSelector: Filter Pods based on labels applied directly to the Pod objects.

Example Configuration#

Below is a sample custom_values_fine_grained.yaml configuration demonstrating the use of these mechanisms for fine-grained injection control.

# Disable the default configuration
nsight-injector:
  nsightToolConfig:
    enableDefault: false

  injectionConfig:
    defaultNsightToolConfigRef: "triton-profile"
    nsightToolConfigs:
      - name: "triton-profile"
        nsightToolArgs: "--duration 20 --kill none -o /home/auto_{NVDT_PROCESS_NAME}_%{NVDT_POD_FULLNAME}_%{NVDT_CONTAINER_NAME}_{NVDT_TIMESTAMP}_{NVDT_UID}.nsys-rep"
        injectionIncludePatterns:
          - "^/opt/tritonserver/bin/tritonserver.*$"
      - name: "other-profile"
        nsightToolArgs: "--duration 30 --kill none -o /home/auto_{NVDT_PROCESS_NAME}_%{NVDT_POD_FULLNAME}_%{NVDT_CONTAINER_NAME}_{NVDT_TIMESTAMP}_{NVDT_UID}.nsys-rep"
        injectionIncludePatterns:
          - "^python MaxText/train.py.*$"
        env:
          - name: NSYS_NVTX_PROFILER_REGISTER_ONLY
            value: "0"
    injectionRules:
      - name: "has-injection-label-or-demo-injection-name"
        matchConditions:
        - name: "has-injection-label-or-demo-injection-name"
          expression: >
            ((has(object.metadata.labels) &&
            'nvidia-nsight-profile' in object.metadata.labels &&
            object.metadata.labels['nvidia-nsight-profile'] == 'enabled') ||
            object.metadata.name.contains('demo-injection'))
      - name: "train-injection"
        nsightToolConfigRef: "other-profile"
        matchConditions:
          - name: "fine-grained"
            expression: |
                        (
                          object.metadata.generateName.startsWith("example-deployment-name-") &&
                          object.metadata.namespace == "example-ns"
                        ) ||
                        (
                          object.metadata.ownerReferences.exists(ref, ref.kind == "DaemonSet" &&
                          ref.name == "example-daemonset")
                        )
      - name: "namespace-selector-filter"
        nsightToolConfigRef: "other-profile"
        namespaceSelector:
          matchLabels:
            custom-injection-label: enabled
      - name: "object-selector-filter"
        nsightToolConfigRef: "other-profile"
        objectSelector:
          matchLabels:
            custom-injection-label: enabled
      - name: "combined-filter"
        nsightToolConfigRef: "other-profile"
        namespaceSelector:
          matchLabels:
            combined-custom-injection-label: enabled
        objectSelector:
          matchLabels:
            combined-custom-injection-label: enabled
        matchConditions:
          - name: "combined"
            expression: 'object.metadata.name.startsWith("example-pod-prefix-")'

The above configuration customizes profiling parameters for different applications and Pods based on their metadata.

  1. It enables profiling for 20 seconds for all the /opt/tritonserver/bin/tritonserver processes in all the Pods with the nvidia-nsight-profile=enabled label or Pods with the demo-injection in their name.

  2. It enables profiling for 30 seconds for all the python MaxText/train.py processes in:

    • all the Pods with generated name starting with example-deployment-name- in the example-ns namespace or Pods owned by the example-daemonset

    • all the Pods in the namespace with the custom-injection-label=enabled label

    • all the Pods with the custom-injection-label=enabled label

    • all the Pods with the namespace label combined-custom-injection-label=enabled and the Pod label combined-custom-injection-label=enabled and the Pod name starting with example-pod-prefix-

Multi-Tenant Configuration#

NVIDIA Nsight Operator supports multi-tenant environments where different teams or users require separate configurations. The operator can be configured to apply different profiles and injection rules based on the namespace or Pod name. Below is a sample custom_values_multi_tenant.yaml configuration demonstrating the use of profiles and injection rules for multi-tenant environments. It activates possibility of profiling (profiling is still not enabled after installing) in all namespaces with the nvidia-nsight-profile=enabled label:

# Disable the default configuration
nsight-injector:
  nsightToolConfig:
    enableDefault: false

  clusterWideInjectionFilter:
    matchConditions:
      - name: "is-pod"
        expression: "object.kind == 'Pod'"
      - name: "not-self-managed"
        expression: "!(has(object.metadata.labels) && 'app' in object.metadata.labels && object.metadata.labels['app'] in ['nvidia-nsight-operator', 'nsight-operator'])"
    namespaceSelector:
      matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: "NotIn"
          values:
            - kube-system
            - kube-node-lease
            - kube-public
        - key: nvidia-nsight-profile
          operator: "In"
          values:
            - enabled

To enable profiling in a specific namespace, the user of this namespace should add the NsightOperatorProfileConfig resource with the profiling configuration content. The spec can include all subvalues of nsightToolConfig or injectionProfileConfig (see Advanced Configuration Values) values supported by the installation configuration.

Sample custom_installation_injection_config.yaml configuration (can be deployed by the kubectl apply -n example-ns -f custom_installation_injection_config.yaml command):

apiVersion: nvidia.com/v1
kind: NsightOperatorProfileConfig
metadata:
  name: custom-profile-config
spec:
  defaultNsightToolConfigRef: "update-profile"
  nsightToolConfigs:
    - name: "update-profile"
      nsightToolArgs: "--duration 2 --kill none -o /home/separate_auto_{NVDT_PROCESS_NAME}_%{NVDT_POD_FULLNAME}_%{NVDT_CONTAINER_NAME}_{NVDT_TIMESTAMP}_{NVDT_UID}.nsys-rep"
      injectionIncludePatterns:
        - "^/cuda-samples/vectorAdd_forever.*$"
      logOutput: /mnt/nv/out.log
  injectionRules:
    - name: "has-injection-label-or-demo-injection-name"
      enabled: false
    - name: "starts-with-name"
      matchConditions:
      - name: "starts-with-name"
        expression: >
          (has(object.metadata.generateName) &&
          object.metadata.generateName.contains('cuda-vector-add-forever'))

The configuration above enables profiling for 2 seconds for all the /cuda-samples/vectorAdd_forever processes in all the Pods with the generated name containing cuda-vector-add-forever. It also disables the “has-injection-label-or-demo-injection-name” injection configuration (if it was specified in the default cluster-wide configuration or any other NsightOperatorProfileConfig in the Pod’s namespace).

There can be multiple NsightOperatorProfileConfig resources in a Pod’s namespace. NVIDIA Nsight Operator will apply all the configuration from the NsightOperatorProfileConfig. Injection configurations cannot have the same name in the same namespace (the only situation when the same name is allowed is when the configuration is disabled).

Advanced Configuration Values#

Variable

Type

Description

defaultNsightToolConfigRef

string

The default profile name from the nsightToolConfigs section. Used if the injectionRule does not explicitly specify a profile name.

nsightToolConfigs

list

Defines one or more “profiles”. Each profile describes how the Nsight tool injection should be performed.

nsightToolConfigs[].name

string

A unique name identifying the profile.

nsightToolConfigs[].nsightToolArgs

string

Parameters for Nsight Systems. See the Nsight Systems User Guide. Placeholders will be substituted with actual values during execution.

nsightToolConfigs[].injectionIncludePatterns

list of strings

Regex patterns specifying which processes should be profiled.

nsightToolConfigs[].injectionExcludePatterns

list of strings

Regex patterns specifying which processes should NOT be profiled.

nsightToolConfigs[].env

list

Environment variables injected only into the profiled process.

nsightToolConfigs[].containerEnv

list

Environment variables to inject into the target container (added to Pod spec). Visible to all processes in the container.

nsightToolConfigs[].volumes

list

Additional volumes that will be injected into profiled containers.

nsightToolConfigs[].volumeMounts

list

Volume mounts that will be injected into profiled containers.

nsightToolConfigs[].logOutput

string

Logging output destination. Can be stdout, stderr or a file path. By default, logging is disabled.

nsightToolConfigs[].coordinator

boolean

When true, enables coordinator mode for this profile.

nsightToolConfigs[].otlpMirroringEnabled

boolean

Enable or disable OTLP mirroring for this profile.

nsightToolConfigs[].cloudStorageConfigRef

string

Reference to a NsightCloudStorageConfig resource in the same namespace.

injectionRules

list

A list of rules that determines which Pods should receive injection.

injectionRules[].name

string

A unique name identifying this set of injection rules.

injectionRules[].nsightToolConfigRef

string

The name of a specific profile to use if this rule matches. If omitted, the defaultNsightToolConfigRef is used.

injectionRules[].matchConditions

list

A list of conditions to evaluate using CEL expressions.

injectionRules[].matchConditions[].name

string

A name for the match condition.

injectionRules[].matchConditions[].expression

string (CEL)

A CEL expression that returns true if the Pod should be injected.

injectionRules[].namespaceSelector

object

Label selector to match namespace labels.

injectionRules[].namespaceSelector.matchLabels

map

Key-value pairs that must be present on the namespace.

injectionRules[].objectSelector

object

Label selector to match pod labels.

injectionRules[].objectSelector.matchLabels

map

Key-value pairs that must be present on the pod.