Architecture and Concepts#

Architecture overview, component model, and core concepts for Nsight Operator.

Component Overview#

NVIDIA Nsight Operator is a Kubernetes operator that orchestrates the profiling of containerized workloads using NVIDIA Nsight tools. It is composed of a controller that reconciles a set of Custom Resources, a mutating webhook that injects profiling sidecars into target Pods, and several services that together provide collection control, report storage, analysis, and viewing capabilities.

High-Level Diagram#

The operator controller reconciles every CR into concrete Deployments, Services, and webhooks. The injector mutates profiled Pods; the gateway fronts every HTTP client (CLI, Cloud UI, browser):

                    +----------------------+
                    |  Operator Controller |   reconciles all CRs
                    +----------+-----------+
                               |
    +--------------+-----------+------------+--------------+
    v              v                        v              v
+---------+  +-------------+        +-------------+  +-----------+
|Injector |  |Coordinator  |        |  Gateway    |  | Analysis  |
|(webhook)|  |(REST/ZeroMQ)|<------>|  (Envoy)    |  |(recipes)  |
+----+----+  +------+------+        +------+------+  +-----+-----+
     |              ^                      ^               ^
     | mutates      | agent registers,     | HTTP routes   |
     v              | start/stop           | (CLI, UI)     |
+---------+         |                      |               |
|Target   |---------+                      |               |
|Pod      |------.                         +---------------+
|(+sidecar|      |                         |
| +init)  |      | OTLP                    | MinIO proxy :9000
+----+----+      v                         v
     |     +---------------+         +----------------+
     |     |OtelCollector  |         | CloudStorage   |
     |     |(+ converter)  |-------->| (S3 / MinIO)   |
     |     +---------------+  writes +--------+-------+
     |                                        ^     ^
     | .nsys-rep upload                       |     |
     +----------------------------------------+     |  reads
                                                    |
+-------------+        +------------------+        +---------------+
| CloudUI     |------->| TenantOperator   |------->| NsightStreamer|
| (SPA)       |        | (REST, per-ns)   | creates| (nsys)        |
+-------------+        +------------------+        +-------+-------+

Component Responsibilities#

Operator Controller

The core Kubernetes controller deployed by the Helm chart. It reconciles NVIDIA Custom Resources into Deployments, StatefulSets, Services, ConfigMaps, Secrets, and RBAC. The controller supports leader election to protect against split-brain when run with multiple replicas.

NsightInjector (Mutating Admission Webhook)

A cluster-wide mutating webhook that inspects incoming Pods and, when matching rules apply, injects:

  • An init container that waits for dependencies (storage, coordinator) to be ready (see the readiness waiter).

  • A shared volume with the Nsight Systems CLI binaries.

  • A process hook that transparently launches the target binary under nsys (or via the coordinator) when it matches the configured include patterns.

Rules come from the cluster-wide installation values and namespace-scoped NsightOperatorProfileConfig CRs.

NsightCoordinator

A REST service that brokers profiling sessions and collections. Collection agents running inside profiled Pods connect to the coordinator, which pushes start/stop commands when the user invokes profiler-start or profiler-stop. See the NsightCoordinator CRD.

NsightGateway (Envoy)

An Envoy-based HTTP gateway that provides a unified entry point for the Coordinator, Analysis, Tenant Operator, Cloud UI, and an integrated MinIO proxy. Handles TLS termination and authentication (API key, JWT, OAuth2). See the NsightGateway CRD and the gateway configuration guide.

NsightAnalysis

A service that runs Nsight Systems Recipes on collected profiles, producing Jupyter notebooks, CSVs, and other outputs. Exposed behind the gateway. See the NsightAnalysis CRD and the Analysis Guide.

NsightOtelCollector + OTLPProxyConfig

An OpenTelemetry Collector (StatefulSet with optional OTLP-to-Nsight converter sidecar) that receives mirrored traces from profiled Pods via an injected OTLP proxy sidecar. Enables trace mirroring to external observability backends while still producing native .nsys-rep reports. See the NsightOtelCollector and OTLPProxyConfig CRDs.

NsightCloudStorageConfig

Configures where profiling reports are stored: an operator-managed MinIO deployment (default) or an external S3-compatible backend. Can be namespace-scoped for per-tenant isolation. See the NsightCloudStorageConfig CRD.

NsightStreamer

Deploys a browser-based remote desktop for Nsight Systems with access to reports in cloud storage. Uses WebRTC and optionally a STUNner TURN gateway for external reachability. See the Nsight Streamer documentation and its CRD.

NsightTenantOperator

A tenant-scoped API used by the Cloud UI to launch and manage streamer instances per session. See the NsightTenantOperator CRD.

NsightCloudUI

A web-based UI for sessions, collections, and analysis jobs. See the NsightCloudUI CRD.

Concepts and Terminology#

This page defines terms used throughout the Nsight Operator documentation.

Profiling Modes#

Coordinator Mode

Profiling starts and stops only in response to explicit user commands (profiler-start / profiler-stop). Injected agents connect to the NsightCoordinator and wait for commands. This is the default and the mode used by all examples in Getting Started.

Launch Mode

Profiling starts automatically when the target process starts and continues until nsys exits or the process terminates. No coordinator is required. Use when you want unattended captures with a fixed duration (via --duration) or lifecycle bound to the process. Enable by setting coordinator: false on the profile config.

Sessions#

Session

A logical group of profiling collections identified by a UUID. A session is opened (implicitly by profiler-start or explicitly by session-begin) and can span multiple collections before being ended with session-end. Sessions can be given a human-readable title with session-begin --title to help organize downloaded reports.

Collection

A single start/stop cycle within a session. Each profiler-start / profiler-stop pair produces one collection, which in turn produces one .nsys-rep report per collection agent.

Collection Agent

The profiling process running inside a target Pod that communicates with the coordinator. One agent per injected container. Visible in the output of nsight_operator.py status.

Tag

A label (default: default) used to partition agents and sessions on a single coordinator. All CLI commands accept --tag to target a specific tag. Useful when multiple disjoint profiling workflows share a coordinator.

Configuration#

Profile Config (NsightOperatorProfileConfig)

A namespace-scoped CR that defines one or more nsight tool configs (reusable profile definitions) and injection rules (which Pods to inject and which tool config to use). See the NsightOperatorProfileConfig CRD.

Nsight Tool Config

A named profile describing how to invoke Nsight Systems: CLI arguments, include/exclude regex patterns, environment variables, volumes, coordinator mode, OTLP mirroring, and storage reference.

Injection Rule

A predicate that decides which Pods receive sidecar injection. Rules can use label selectors, namespace selectors, or CEL expressions against the Pod object.

Injection Include/Exclude Patterns

Regex patterns matched against the executable path and arguments of each child process spawned inside an injected container. The process hook uses them to decide whether to launch that specific process under nsys: a process is profiled only if it matches an include pattern and does not match any exclude pattern. Exclude patterns include a cluster-wide default list of common shells and utilities (sh, bash, ls, etc.) so that helper processes run by the container are not profiled; user-supplied excludes are merged with these defaults.

Deployment#

Single-tenant

Control plane components (coordinator, gateway, storage, OTel collector) run in the operator namespace and serve the whole cluster.

Multi-tenant

Control plane components run per tenant namespace. The operator runs cluster-wide; each tenant namespace has its own Coordinator, Storage, Gateway, etc.

Auto-provisioning

In multi-tenant mode with default values, the operator automatically creates NsightCoordinator, NsightCloudStorageConfig, NsightGateway, NsightOtelCollector, OTLPProxyConfig, and NsightAnalysis resources in a namespace the first time a Pod matching profiling rules is admitted there. Existing resources are respected and not overwritten.

Trace Mirroring#

OTLP Mirroring

An optional feature that mirrors Nsight NVTX ranges to the OTLP protocol via an injected Envoy sidecar. Enables sending traces to the NsightOtelCollector (which can in turn export to Jaeger, Prometheus, etc.) while still producing native .nsys-rep reports. Controlled by otlpMirroringEnabled on each tool config.

OTLP Converter

An optional sidecar in the NsightOtelCollector that converts the OTLP spans buffered by the collector back into .nsys-rep files, so the same report data is available for both observability and native Nsight Systems workflows.

Personas#

Cluster Admin

Has cluster-wide privileges. Installs Nsight Operator with Helm, manages CRDs, configures cluster-wide filters and policies. Required for initial setup in both single- and multi-tenant modes.

Namespace Admin

Has admin rights in a specific namespace. In multi-tenant mode, deploys or verifies per-tenant infrastructure (Coordinator, Storage, Gateway) and authors NsightOperatorProfileConfig CRs to define who can be profiled in the namespace.

Profiling User

Runs workloads and captures profiles. Interacts only with the gateway (via nsight_operator.py) and, optionally, the Nsight Streamer or Nsight Cloud UI to view reports. Does not need cluster-admin access.

Profiling Session Data Flow#

The diagram below shows the lifecycle of a coordinator-mode profiling session.

User / CI         Kubernetes       NsightInjector     Target Pod      NsightCoordinator    Storage
---------         ----------       --------------     ----------      -----------------    -------
    |                  |                  |                |                  |               |
    | label Pod        |                  |                |                  |               |
    |----------------->|                  |                |                  |               |
    |                  | admit Pod        |                |                  |               |
    |                  |----------------->|                |                  |               |
    |                  |                  | mutate:        |                  |               |
    |                  |                  |  +init waiter  |                  |               |
    |                  |                  |  +nsys volume  |                  |               |
    |                  |                  |  +process hook |                  |               |
    |                  |<-----------------|                |                  |               |
    |                  | start Pod        |                |                  |               |
    |                  | - - - - - - - - - - - - - - - - ->|                  |               |
    |                  |                  |                | boot             |               |
    |                  |                  |                |                  |               |
    |                  |                  |                | agent registers  |               |
    |                  |                  |                |----------------->|               |
    |                  |                  |                |                  |               |
    |                  |                  |                |                  |               |
    | configure CLI (autoconfigure / configure + login)    |                  |               |
    |------------------------>            |                |                  |               |
    |                  |                  |                |                  |               |
    | session-begin --title "my run"  (optional)           |                  |               |
    |------------------------------------------------------------------------> open session   |
    |                  |                  |                |                  |               |
    | profiler-start   |                  |                |                  |               |
    |------------------------------------------------------------------------>|               |
    |                  |                  |                |                  |               |
    |                  |                   <---------broadcast START----------|               |
    |                  |              collection           |                  |               |
    |                  |              starts in agent      |                  |               |
    |                  |                  |                |                  |               |
    | (workload runs; NVTX ranges optionally mirrored via OTLP proxy sidecar -> OtelCollector |
    |                  |                  |                |                  |               |
    | profiler-stop    |                  |                |                  |               |
    |------------------------------------------------------------------------ |               |
    |                  |                  |                |                  |               |
    |                  |                  |                |                  |               |
    |                  |                   <---------broadcast STOP-----------|               |
    |                  |              agent finalizes      |                  |               |
    |                  |                   -------- .nsys-rep ---- upload ------------------->|
    |                  |                  |                |                  |               |
    |<-------------------------- ls / download --------------- gateway ---- cloud storage ----+

Viewing and Analyzing Reports#

The diagram below shows how a previously captured .nsys-rep is analyzed or viewed after a profiling is stopped.

User / CI              Storage            Nsight Streamer        Nsight Analysis
---------              -------            ---------------        ---------------
    |                     |                      |                      |
    | analysis run <recipe>                      |                      |
    |------------------------------------------------------------------>|
    |                     |                      |                      |
    |                     |<---------- read .nsys-rep ------------------|
    |                     |                      |                      |
    |<------------------------- Jupyter notebook  ----------------------|
    |                     |                      |                      |
    | Click View Traces in UI or                 |                      |
    | create NsightStreamer CR                   |                      |
    |------------------------------------------->|                      |
    |                     |                      |                      |
    |                     |<-- read .nsys-rep ---|                      |
    |                     |                      |                      |
    |<-------- browser view of .nsys-rep --------|                      |
    |                     |                      |                      |

Steps in Detail#

  1. Label / target a workload. Add the nvidia-nsight-profile=enabled label (or configure a custom NsightOperatorProfileConfig rule) so the admission webhook matches the Pod. Existing Pods must be restarted; only new Pods are mutated.

  2. Admission + injection. The webhook mutates the incoming Pod to add:

    • A readiness waiter init container that blocks start-up until storage, MinIO, and the coordinator service are reachable.

    • A volume containing the Nsight Systems CLI binaries.

    • A process hook that wraps matching child processes under nsys (in coordinator mode, under the coordinator-provided command).

  3. Agent registration. Once the Pod starts, the Nsight Systems agent registers with the NsightCoordinator (ZeroMQ, CURVE-encrypted by default). The agent is now visible in nsight_operator.py status.

  4. CLI configuration. The profiling user points the CLI at the gateway using autoconfigure (reads cluster state) or configure --gw (manual). Credentials and endpoints are stored in ~/.nsight-cloud.conf.

  5. Session + collection. profiler-start creates a session (if none is active) and begins a collection. The coordinator broadcasts a start command to all registered agents for the current tag.

  6. Workload runs. While agents capture CPU/GPU samples, optional OTLP mirroring streams NVTX ranges to the NsightOtelCollector via the injected Envoy proxy sidecar.

  7. Stop + upload. profiler-stop ends the collection. Each agent finalizes its .nsys-rep file and uploads it to the configured cloud storage backend.

  8. List / download / analyze. Reports are listed via ls, downloaded via download, or analyzed via analysis run <recipe>. The analysis service outputs Jupyter notebooks that can be rendered by the built-in gateway UI or opened locally.

  9. View reports. Instead of downloading, create a NsightStreamer CR to browse reports in a web-based Nsight Systems GUI hosted inside the cluster.

  10. End session (optional). session-end closes the session. Ended sessions remain listable and downloadable until the storage backend retention policy reclaims them.