Installation Guide#
Nsight Operator installation and configuration guide.
Installation and Configuration#
Quick Install#
For a stock installation with the defaults, run:
helm install --wait \
--namespace nsight-operator \
--create-namespace \
nsight-operator \
https://helm.ngc.nvidia.com/nvidia/devtools/charts/nsight-operator-26.2.1.tgz
By default, profiling is enabled in Coordinator Mode for all applications
running in Pods labeled with nvidia-nsight-profile: enabled. Continue
with the sections below to customize the installation.
Customizing Helm Values#
The NVIDIA Nsight Operator can be customized to suit particular needs. Likely, you
will want to configure the
nsight-injector.nsightToolConfig.nsightToolArgs,
nsight-injector.nsightToolConfig.injectionIncludePatterns values.
A values file can
be used for setting these parameters.
Sample custom_values.yaml. This configuration will enable profiling for any
instance of yourawesomeapp found in injection Pods, enabling Python sampling
and PyTorch tracing. This configuration uses Coordinator Mode.
# Nsight Systems profiling configuration (under nsight-injector sub-chart)
nsight-injector:
nsightToolConfig:
nsightToolArgs: "--python-sampling=true --cuda-graph-trace=node --pytorch=autograd-nvtx"
injectionIncludePatterns:
- ".*yourawesomeapp.*"
helm install -f custom_values.yaml \
--namespace nsight-operator \
--create-namespace \
nsight-operator https://helm.ngc.nvidia.com/nvidia/devtools/charts/nsight-operator-26.2.1.tgz
Sample custom_values_launch.yaml. This configuration switches to Launch
Mode: profiling starts automatically when the target process starts and runs
for a fixed duration, without any coordinator involvement. Use this mode for
unattended captures with a bounded profiling window.
# Nsight Systems profiling configuration in launch mode
nsight-injector:
nsightToolConfig:
coordinator: false
nsightToolArgs: "-t cuda,nvtx,osrt --python-sampling=true --duration=20 --kill=none"
injectionIncludePatterns:
- ".*yourawesomeapp.*"
Sample custom_values_extended.yaml: This configuration enables profiling for
any instance of yourawesomeapp running in injected Pods, except for those
started with the argumenttoskip argument. Profiling is configured to collect
data for a maximum duration of 20 seconds. The nsys-output-volume will be
mounted to all profiled Pods. A Persistent Volume Claim must be available in the
target namespaces for successful operation. Additionally,
kernel.perf_event_paranoid will be set to -1 on all nodes where profiling
is performed.
# Nsight Systems profiling configuration
nsight-injector:
nsightToolConfig:
volumes:
[
{
"name": "nsys-output-volume",
"persistentVolumeClaim": { "claimName": "CSP-managed-disk" },
},
]
volumeMounts:
[{ "name": "nsys-output-volume", "mountPath": "/mnt/nsys/output" }]
nsightToolArgs: "--python-sampling=true"
injectionIncludePatterns:
- ".*yourawesomeapp.*"
injectionExcludePatterns:
- ".*yourawesomeapp.*argumenttoskip.*"
machineConfig:
- name: kernel.perf_event_paranoid
value: -1
Configuration Values#
The NVIDIA Nsight Operator helm chart includes the following components:
Operator-level values: Configure the operator controller and general settings.
Coordinator values (
nsight-coordinator.*): Configure the coordinator deployment for profiling session control.Gateway values (
nsight-gateway.*): Configure the Envoy gateway for REST API access.Cloud Storage values (
cloudStorage.*): Configure storage for profiling results.OTLP Collector values (
nsight-otel-collector.*): Configure OTLP collector for trace mirroring.OTLP Proxy values (
otlpProxyConfig.*): Configure OTLP proxy injection.Injector sub-chart values (
nsight-injector.*): Configure injection behavior.STUNner TURN gateway values (
nsight-tenant-operator.stunner.*andstunner-gateway-operator.*): Configure the bundled STUNner TURN gateway for Nsight Streamer WebRTC relay.
Operator-level Configuration#
Variable |
Description |
Default |
|---|---|---|
|
Enable multi-tenant mode. When |
|
|
Enable Kubernetes leader election for the operator controller. Safe and recommended for multi-replica deployments; harmless with a single replica. |
|
|
Name of the Lease used for leader election. |
|
|
Run the operator controller in the host network namespace. Rarely needed. |
|
|
Default |
(empty) |
|
Default Pod-level |
non-root / seccomp |
|
Default container-level |
drop ALL |
Coordinator Configuration (nsight-coordinator.*):
Variable |
Description |
Default |
|---|---|---|
|
Enable deployment of the NsightCoordinator CR in the operator namespace. |
|
Additional configuration options can be specified through Helm values that map to the default NsightCoordinator CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.
Gateway Configuration (nsight-gateway.*):
Variable |
Description |
Default |
|---|---|---|
|
Enable deployment of the NsightGateway CR for REST API gateway access to Coordinator and Analysis services. |
|
Additional configuration options can be specified through Helm values that map to the default NsightGateway CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.
Cloud Storage Configuration (cloudStorage.*):
Variable |
Description |
Default |
|---|---|---|
|
Enable cloud storage integration for profiling results. |
|
Additional configuration options can be specified through Helm values that map to the default NsightCloudStorageConfig CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.
OTLP Collection Configuration (nsight-otel-collector.* and otlpProxyConfig.*):
Variable |
Description |
Default |
|---|---|---|
|
Enable OTLP collector for trace mirroring. |
|
|
Enable OTLP proxy injection for trace mirroring. |
|
Additional configuration options can be specified through Helm values that map to the default NsightOtelCollector and OTLPProxyConfig CRDs created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.
STUNner TURN Gateway Configuration:
STUNner install-time settings, such as enabled,
installGatewayApiCRDs, protocol, and gatewayAnnotations, live under
nsight-tenant-operator.stunner.*. The TURN listener port is configured on
nsight-gateway.service.turnPort. The STUNner relay Pod settings, such as
resources, tolerations, affinity, securityContext, and
containerSecurityContext, live under
stunner-gateway-operator.stunnerGatewayOperator.dataplane.spec.*. See
STUNner TURN Gateway for examples.
Analysis Configuration (nsight-analysis.*):
Variable |
Description |
Default |
|---|---|---|
|
Enable analysis service for running recipes. |
|
Additional configuration options can be specified through Helm values that map to the default NsightAnalysis CRD created during installation. See the Custom Resource Definition (CRD) Reference section for all available fields and their corresponding Helm value paths.
Nsight Streamer:
Nsight Streamer is deployed separately using the NsightStreamer CRD rather
than through Helm chart values. See Nsight Streamer for deployment
instructions and the NsightStreamer CRD Reference
for all available fields.
Injector Sub-chart Configuration#
These values are prefixed with nsight-injector. when using the parent chart:
Variable |
Description |
Default |
|---|---|---|
|
The parameters for Nsight Systems used during profiling. See the Nsight Systems User Guide for available parameters. Placeholders within these parameters will be substituted with their actual values during execution. |
|
|
Regex patterns that specify which processes or commands in the container should be profiled. |
|
|
Regex patterns that specify which processes or commands in the container should NOT be profiled. |
|
|
Cluster-wide default regex patterns that are always excluded from injection,
even when custom |
A preset list including shells (bash, sh, zsh, dash), common utilities,
and Nsight tools. See chart |
|
Additional volumes that will be injected into profiled containers. |
|
|
Volume mounts that will be injected into profiled containers. |
|
|
Environment variables injected only into the profiled process (not added
to container spec). Each item must have |
|
|
Environment variables to inject into the target container (added to the
Pod spec). Visible to all processes in the container. If a variable is
present in both |
|
|
Should the default (included in setup) profiling configuration be enabled? |
|
|
Enable coordinator mode for on-demand profiling. |
|
|
Enable OTLP mirroring for this profile. |
|
|
Array of name/value pairs (system configurations) which should be updated
before profiling on target nodes (currently, only
|
|
Readiness Waiter Configuration#
The readiness waiter is an init container that the injector adds to profiled Pods. It blocks the main container from starting until the storage configuration, MinIO service, and Coordinator service are reachable. This prevents a profiling session from missing the early stages of a workload when the operator is still starting up (for example, immediately after a rolling restart of the control plane).
Variable |
Description |
Default |
|---|---|---|
|
Enable the readiness waiter init container. |
|
|
Python image used to run the readiness checks. |
(operator default) |
|
Image pull policy. |
|
|
Maximum time to wait for dependencies (seconds). Applies per Pod
start; after this elapses the waiter either fails or exits
successfully depending on |
|
|
Seconds between readiness checks while the waiter is active. |
|
|
When |
|
Supported Placeholders#
Placeholder |
Replacement |
|---|---|
|
The random alphanumeric string (8 symbols) |
|
The profiled process name |
|
The profiled process id |
|
The UNIX timestamp (in ms) |
|
The “ANY ENVIRONMENT VARIABLE” environment variable inside a container.
|
Enabling Profiling on Target Resources#
To enable automatic injection for all Pods in a namespace, add the
nvidia-nsight-profile=enabled label to the namespace.
kubectl label namespaces <namespace name> nvidia-nsight-profile=enabled
To enable automatic injection for a specific workload in a namespace, add the
nvidia-nsight-profile=enabled label to the workload’s Pod template. The
injector evaluates newly created Pods, so labeling only the workload metadata
does not cause Pods generated by that workload to match the default injection
rule.
# Example for a deployment
kubectl patch deployment <deployment-name> -p '{"spec":{"template":{"metadata":{"labels":{"nvidia-nsight-profile":"enabled"}}}}}'
# Example for a statefulset
kubectl patch statefulset <statefulset-name> -p '{"spec":{"template":{"metadata":{"labels":{"nvidia-nsight-profile":"enabled"}}}}}'
At this point, any new Pod will be considered for injection based on labels and
injectionIncludePatterns.
Existing Resources#
An already started Pod cannot be injected. After you add or remove namespace labels, Pod template labels, or injection rules, recreate existing Pods so the admission webhook can evaluate the updated configuration.
Sample commands to restart a Pod:
Resource with more than one replica
kubectl rollout restart <resource type>/<resource name>
For example:
kubectl rollout restart deployment/amazing_service
Resource with only one replica
kubectl scale <resource type>/<resource name> --replicas=0 kubectl scale <resource type>/<resource name> --replicas=1
For example:
kubectl scale deployment/amazing_service --replicas=0 kubectl scale deployment/amazing_service --replicas=1
Advanced Configuration#
In Kubernetes environments, managing sidecar injection and profiling configurations can be challenging, particularly in dynamic scenarios where Pods are created by custom resources or controllers. The process requires more than just filtering Pods – it requires selecting the appropriate configuration for each Pod, application, or namespace. While labels offer a basic level of control, they often lack the granularity required for precise targeting and configuration.
NVIDIA Nsight Operator supports the following mechanisms for filtering and targeting Pods for injection:
matchExpressions: Specify complex logic using custom expressions to evaluate Pod metadata and dynamically determine injection suitability.
namespaceSelector: Filter Pods based on labels applied to their namespaces.
objectSelector: Filter Pods based on labels applied directly to the Pod objects.
Example Configuration#
Below is a sample custom_values_fine_grained.yaml configuration demonstrating
the use of these mechanisms for fine-grained injection control.
# Disable the default configuration
nsight-injector:
nsightToolConfig:
enableDefault: false
injectionConfig:
defaultNsightToolConfigRef: "triton-profile"
nsightToolConfigs:
- name: "triton-profile"
nsightToolArgs: "--duration 20 --kill none -o /home/auto_{NVDT_PROCESS_NAME}_%{NVDT_POD_FULLNAME}_%{NVDT_CONTAINER_NAME}_{NVDT_TIMESTAMP}_{NVDT_UID}.nsys-rep"
injectionIncludePatterns:
- "^/opt/tritonserver/bin/tritonserver.*$"
- name: "other-profile"
nsightToolArgs: "--duration 30 --kill none -o /home/auto_{NVDT_PROCESS_NAME}_%{NVDT_POD_FULLNAME}_%{NVDT_CONTAINER_NAME}_{NVDT_TIMESTAMP}_{NVDT_UID}.nsys-rep"
injectionIncludePatterns:
- "^python MaxText/train.py.*$"
env:
- name: NSYS_NVTX_PROFILER_REGISTER_ONLY
value: "0"
injectionRules:
- name: "has-injection-label-or-demo-injection-name"
matchConditions:
- name: "has-injection-label-or-demo-injection-name"
expression: >
((has(object.metadata.labels) &&
'nvidia-nsight-profile' in object.metadata.labels &&
object.metadata.labels['nvidia-nsight-profile'] == 'enabled') ||
object.metadata.name.contains('demo-injection'))
- name: "train-injection"
nsightToolConfigRef: "other-profile"
matchConditions:
- name: "fine-grained"
expression: |
(
object.metadata.generateName.startsWith("example-deployment-name-") &&
object.metadata.namespace == "example-ns"
) ||
(
object.metadata.ownerReferences.exists(ref, ref.kind == "DaemonSet" &&
ref.name == "example-daemonset")
)
- name: "namespace-selector-filter"
nsightToolConfigRef: "other-profile"
namespaceSelector:
matchLabels:
custom-injection-label: enabled
- name: "object-selector-filter"
nsightToolConfigRef: "other-profile"
objectSelector:
matchLabels:
custom-injection-label: enabled
- name: "combined-filter"
nsightToolConfigRef: "other-profile"
namespaceSelector:
matchLabels:
combined-custom-injection-label: enabled
objectSelector:
matchLabels:
combined-custom-injection-label: enabled
matchConditions:
- name: "combined"
expression: 'object.metadata.name.startsWith("example-pod-prefix-")'
The above configuration customizes profiling parameters for different applications and Pods based on their metadata.
It enables profiling for 20 seconds for all the
/opt/tritonserver/bin/tritonserverprocesses in all the Pods with thenvidia-nsight-profile=enabledlabel or Pods with thedemo-injectionin their name.It enables profiling for 30 seconds for all the
python MaxText/train.pyprocesses in:all the Pods with generated name starting with
example-deployment-name-in theexample-nsnamespace or Pods owned by theexample-daemonsetall the Pods in the namespace with the
custom-injection-label=enabledlabelall the Pods with the
custom-injection-label=enabledlabelall the Pods with the namespace label
combined-custom-injection-label=enabledand the Pod labelcombined-custom-injection-label=enabledand the Pod name starting withexample-pod-prefix-
Multi-Tenant Configuration#
NVIDIA Nsight Operator supports multi-tenant environments where different teams or
users require separate configurations. The operator can be configured to apply
different profiles and injection rules based on the namespace or Pod name. Below
is a sample custom_values_multi_tenant.yaml configuration demonstrating the
use of profiles and injection rules for multi-tenant environments. It activates
possibility of profiling (profiling is still not enabled after installing) in all
namespaces with the nvidia-nsight-profile=enabled label:
# Disable the default configuration
nsight-injector:
nsightToolConfig:
enableDefault: false
clusterWideInjectionFilter:
matchConditions:
- name: "is-pod"
expression: "object.kind == 'Pod'"
- name: "not-self-managed"
expression: "!(has(object.metadata.labels) && 'app' in object.metadata.labels && object.metadata.labels['app'] in ['nvidia-nsight-operator', 'nsight-operator'])"
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: "NotIn"
values:
- kube-system
- kube-node-lease
- kube-public
- key: nvidia-nsight-profile
operator: "In"
values:
- enabled
To enable profiling in a specific namespace, the user of this namespace should
add the NsightOperatorProfileConfig resource with the profiling configuration
content. The spec can include all subvalues of nsightToolConfig or
injectionProfileConfig (see Advanced Configuration Values) values
supported by the installation configuration.
Sample custom_installation_injection_config.yaml configuration (can be
deployed by the
kubectl apply -n example-ns -f custom_installation_injection_config.yaml
command):
apiVersion: nvidia.com/v1
kind: NsightOperatorProfileConfig
metadata:
name: custom-profile-config
spec:
defaultNsightToolConfigRef: "update-profile"
nsightToolConfigs:
- name: "update-profile"
nsightToolArgs: "--duration 2 --kill none -o /home/separate_auto_{NVDT_PROCESS_NAME}_%{NVDT_POD_FULLNAME}_%{NVDT_CONTAINER_NAME}_{NVDT_TIMESTAMP}_{NVDT_UID}.nsys-rep"
injectionIncludePatterns:
- "^/cuda-samples/vectorAdd_forever.*$"
logOutput: /mnt/nv/out.log
injectionRules:
- name: "has-injection-label-or-demo-injection-name"
enabled: false
- name: "starts-with-name"
matchConditions:
- name: "starts-with-name"
expression: >
(has(object.metadata.generateName) &&
object.metadata.generateName.contains('cuda-vector-add-forever'))
The configuration above enables profiling for 2 seconds for all the
/cuda-samples/vectorAdd_forever processes in all the Pods with the generated
name containing cuda-vector-add-forever. It also disables the
“has-injection-label-or-demo-injection-name” injection configuration (if it was
specified in the default cluster-wide configuration or any other
NsightOperatorProfileConfig in the Pod’s namespace).
There can be multiple NsightOperatorProfileConfig resources in a Pod’s
namespace. NVIDIA Nsight Operator will apply all the configuration from the
NsightOperatorProfileConfig. Injection configurations cannot have the same
name in the same namespace (the only situation when the same name is allowed is
when the configuration is disabled).
Advanced Configuration Values#
Variable |
Type |
Description |
|---|---|---|
|
string |
The default profile name from the |
|
list |
Defines one or more “profiles”. Each profile describes how the Nsight tool injection should be performed. |
|
string |
A unique name identifying the profile. |
|
string |
Parameters for Nsight Systems. See the Nsight Systems User Guide. Placeholders will be substituted with actual values during execution. |
|
list of strings |
Regex patterns specifying which processes should be profiled. |
|
list of strings |
Regex patterns specifying which processes should NOT be profiled. |
|
list |
Environment variables injected only into the profiled process. |
|
list |
Environment variables to inject into the target container (added to Pod spec). Visible to all processes in the container. |
|
list |
Additional volumes that will be injected into profiled containers. |
|
list |
Volume mounts that will be injected into profiled containers. |
|
string |
Logging output destination. Can be |
|
boolean |
When true, enables coordinator mode for this profile. |
|
boolean |
Enable or disable OTLP mirroring for this profile. |
|
string |
Reference to a NsightCloudStorageConfig resource in the same namespace. |
|
list |
A list of rules that determines which Pods should receive injection. |
|
string |
A unique name identifying this set of injection rules. |
|
string |
The name of a specific profile to use if this rule matches. If omitted,
the |
|
list |
A list of conditions to evaluate using CEL expressions. |
|
string |
A name for the match condition. |
|
string (CEL) |
A CEL expression that returns |
|
object |
Label selector to match namespace labels. |
|
map |
Key-value pairs that must be present on the namespace. |
|
object |
Label selector to match pod labels. |
|
map |
Key-value pairs that must be present on the pod. |