Architecture and Concepts#
Architecture overview, component model, and core concepts for Nsight Operator.
Component Overview#
NVIDIA Nsight Operator is a Kubernetes operator that orchestrates the profiling of containerized workloads using NVIDIA Nsight tools. It is composed of a controller that reconciles a set of Custom Resources, a mutating webhook that injects profiling sidecars into target Pods, and several services that together provide collection control, report storage, analysis, and viewing capabilities.
High-Level Diagram#
The operator controller reconciles every CR into concrete Deployments, Services, and webhooks. The injector mutates profiled Pods; the gateway fronts every HTTP client (CLI, Cloud UI, browser):
+----------------------+
| Operator Controller | reconciles all CRs
+----------+-----------+
|
+--------------+-----------+------------+--------------+
v v v v
+---------+ +-------------+ +-------------+ +-----------+
|Injector | |Coordinator | | Gateway | | Analysis |
|(webhook)| |(REST/ZeroMQ)|<------>| (Envoy) | |(recipes) |
+----+----+ +------+------+ +------+------+ +-----+-----+
| ^ ^ ^
| mutates | agent registers, | HTTP routes |
v | start/stop | (CLI, UI) |
+---------+ | | |
|Target |---------+ | |
|Pod |------. +---------------+
|(+sidecar| | |
| +init) | | OTLP | MinIO proxy :9000
+----+----+ v v
| +---------------+ +----------------+
| |OtelCollector | | CloudStorage |
| |(+ converter) |-------->| (S3 / MinIO) |
| +---------------+ writes +--------+-------+
| ^ ^
| .nsys-rep upload | |
+----------------------------------------+ | reads
|
+-------------+ +------------------+ +---------------+
| CloudUI |------->| TenantOperator |------->| NsightStreamer|
| (SPA) | | (REST, per-ns) | creates| (nsys) |
+-------------+ +------------------+ +-------+-------+
Component Responsibilities#
- Operator Controller
The core Kubernetes controller deployed by the Helm chart. It reconciles NVIDIA Custom Resources into Deployments, StatefulSets, Services, ConfigMaps, Secrets, and RBAC. The controller supports leader election to protect against split-brain when run with multiple replicas.
- NsightInjector (Mutating Admission Webhook)
A cluster-wide mutating webhook that inspects incoming Pods and, when matching rules apply, injects:
An init container that waits for dependencies (storage, coordinator) to be ready (see the readiness waiter).
A shared volume with the Nsight Systems CLI binaries.
A process hook that transparently launches the target binary under
nsys(or via the coordinator) when it matches the configured include patterns.
Rules come from the cluster-wide installation values and namespace-scoped NsightOperatorProfileConfig CRs.
- NsightCoordinator
A REST service that brokers profiling sessions and collections. Collection agents running inside profiled Pods connect to the coordinator, which pushes start/stop commands when the user invokes
profiler-startorprofiler-stop. See the NsightCoordinator CRD.- NsightGateway (Envoy)
An Envoy-based HTTP gateway that provides a unified entry point for the Coordinator, Analysis, Tenant Operator, Cloud UI, and an integrated MinIO proxy. Handles TLS termination and authentication (API key, JWT, OAuth2). See the NsightGateway CRD and the gateway configuration guide.
- NsightAnalysis
A service that runs Nsight Systems Recipes on collected profiles, producing Jupyter notebooks, CSVs, and other outputs. Exposed behind the gateway. See the NsightAnalysis CRD and the Analysis Guide.
- NsightOtelCollector + OTLPProxyConfig
An OpenTelemetry Collector (StatefulSet with optional OTLP-to-Nsight converter sidecar) that receives mirrored traces from profiled Pods via an injected OTLP proxy sidecar. Enables trace mirroring to external observability backends while still producing native
.nsys-repreports. See the NsightOtelCollector and OTLPProxyConfig CRDs.- NsightCloudStorageConfig
Configures where profiling reports are stored: an operator-managed MinIO deployment (default) or an external S3-compatible backend. Can be namespace-scoped for per-tenant isolation. See the NsightCloudStorageConfig CRD.
- NsightStreamer
Deploys a browser-based remote desktop for Nsight Systems with access to reports in cloud storage. Uses WebRTC and optionally a STUNner TURN gateway for external reachability. See the Nsight Streamer documentation and its CRD.
- NsightTenantOperator
A tenant-scoped API used by the Cloud UI to launch and manage streamer instances per session. See the NsightTenantOperator CRD.
- NsightCloudUI
A web-based UI for sessions, collections, and analysis jobs. See the NsightCloudUI CRD.
Concepts and Terminology#
This page defines terms used throughout the Nsight Operator documentation.
Profiling Modes#
- Coordinator Mode
Profiling starts and stops only in response to explicit user commands (
profiler-start/profiler-stop). Injected agents connect to the NsightCoordinator and wait for commands. This is the default and the mode used by all examples in Getting Started.- Launch Mode
Profiling starts automatically when the target process starts and continues until
nsysexits or the process terminates. No coordinator is required. Use when you want unattended captures with a fixed duration (via--duration) or lifecycle bound to the process. Enable by settingcoordinator: falseon the profile config.
Sessions#
- Session
A logical group of profiling collections identified by a UUID. A session is opened (implicitly by
profiler-startor explicitly bysession-begin) and can span multiple collections before being ended withsession-end. Sessions can be given a human-readable title withsession-begin --titleto help organize downloaded reports.- Collection
A single start/stop cycle within a session. Each
profiler-start/profiler-stoppair produces one collection, which in turn produces one.nsys-repreport per collection agent.- Collection Agent
The profiling process running inside a target Pod that communicates with the coordinator. One agent per injected container. Visible in the output of
nsight_operator.py status.- Tag
A label (default:
default) used to partition agents and sessions on a single coordinator. All CLI commands accept--tagto target a specific tag. Useful when multiple disjoint profiling workflows share a coordinator.
Configuration#
- Profile Config (NsightOperatorProfileConfig)
A namespace-scoped CR that defines one or more nsight tool configs (reusable profile definitions) and injection rules (which Pods to inject and which tool config to use). See the NsightOperatorProfileConfig CRD.
- Nsight Tool Config
A named profile describing how to invoke Nsight Systems: CLI arguments, include/exclude regex patterns, environment variables, volumes, coordinator mode, OTLP mirroring, and storage reference.
- Injection Rule
A predicate that decides which Pods receive sidecar injection. Rules can use label selectors, namespace selectors, or CEL expressions against the Pod object.
- Injection Include/Exclude Patterns
Regex patterns matched against the executable path and arguments of each child process spawned inside an injected container. The process hook uses them to decide whether to launch that specific process under
nsys: a process is profiled only if it matches an include pattern and does not match any exclude pattern. Exclude patterns include a cluster-wide default list of common shells and utilities (sh,bash,ls, etc.) so that helper processes run by the container are not profiled; user-supplied excludes are merged with these defaults.
Deployment#
- Single-tenant
Control plane components (coordinator, gateway, storage, OTel collector) run in the operator namespace and serve the whole cluster.
- Multi-tenant
Control plane components run per tenant namespace. The operator runs cluster-wide; each tenant namespace has its own Coordinator, Storage, Gateway, etc.
- Auto-provisioning
In multi-tenant mode with default values, the operator automatically creates NsightCoordinator, NsightCloudStorageConfig, NsightGateway, NsightOtelCollector, OTLPProxyConfig, and NsightAnalysis resources in a namespace the first time a Pod matching profiling rules is admitted there. Existing resources are respected and not overwritten.
Trace Mirroring#
- OTLP Mirroring
An optional feature that mirrors Nsight NVTX ranges to the OTLP protocol via an injected Envoy sidecar. Enables sending traces to the NsightOtelCollector (which can in turn export to Jaeger, Prometheus, etc.) while still producing native
.nsys-repreports. Controlled byotlpMirroringEnabledon each tool config.- OTLP Converter
An optional sidecar in the NsightOtelCollector that converts the OTLP spans buffered by the collector back into
.nsys-repfiles, so the same report data is available for both observability and native Nsight Systems workflows.
Personas#
- Cluster Admin
Has cluster-wide privileges. Installs Nsight Operator with Helm, manages CRDs, configures cluster-wide filters and policies. Required for initial setup in both single- and multi-tenant modes.
- Namespace Admin
Has admin rights in a specific namespace. In multi-tenant mode, deploys or verifies per-tenant infrastructure (Coordinator, Storage, Gateway) and authors NsightOperatorProfileConfig CRs to define who can be profiled in the namespace.
- Profiling User
Runs workloads and captures profiles. Interacts only with the gateway (via
nsight_operator.py) and, optionally, the Nsight Streamer or Nsight Cloud UI to view reports. Does not need cluster-admin access.
Profiling Session Data Flow#
The diagram below shows the lifecycle of a coordinator-mode profiling session.
User / CI Kubernetes NsightInjector Target Pod NsightCoordinator Storage
--------- ---------- -------------- ---------- ----------------- -------
| | | | | |
| label Pod | | | | |
|----------------->| | | | |
| | admit Pod | | | |
| |----------------->| | | |
| | | mutate: | | |
| | | +init waiter | | |
| | | +nsys volume | | |
| | | +process hook | | |
| |<-----------------| | | |
| | start Pod | | | |
| | - - - - - - - - - - - - - - - - ->| | |
| | | | boot | |
| | | | | |
| | | | agent registers | |
| | | |----------------->| |
| | | | | |
| | | | | |
| configure CLI (autoconfigure / configure + login) | | |
|------------------------> | | | |
| | | | | |
| session-begin --title "my run" (optional) | | |
|------------------------------------------------------------------------> open session |
| | | | | |
| profiler-start | | | | |
|------------------------------------------------------------------------>| |
| | | | | |
| | <---------broadcast START----------| |
| | collection | | |
| | starts in agent | | |
| | | | | |
| (workload runs; NVTX ranges optionally mirrored via OTLP proxy sidecar -> OtelCollector |
| | | | | |
| profiler-stop | | | | |
|------------------------------------------------------------------------ | |
| | | | | |
| | | | | |
| | <---------broadcast STOP-----------| |
| | agent finalizes | | |
| | -------- .nsys-rep ---- upload ------------------->|
| | | | | |
|<-------------------------- ls / download --------------- gateway ---- cloud storage ----+
Viewing and Analyzing Reports#
The diagram below shows how a previously captured .nsys-rep is analyzed
or viewed after a profiling is stopped.
User / CI Storage Nsight Streamer Nsight Analysis
--------- ------- --------------- ---------------
| | | |
| analysis run <recipe> | |
|------------------------------------------------------------------>|
| | | |
| |<---------- read .nsys-rep ------------------|
| | | |
|<------------------------- Jupyter notebook ----------------------|
| | | |
| Click View Traces in UI or | |
| create NsightStreamer CR | |
|------------------------------------------->| |
| | | |
| |<-- read .nsys-rep ---| |
| | | |
|<-------- browser view of .nsys-rep --------| |
| | | |
Steps in Detail#
Label / target a workload. Add the
nvidia-nsight-profile=enabledlabel (or configure a custom NsightOperatorProfileConfig rule) so the admission webhook matches the Pod. Existing Pods must be restarted; only new Pods are mutated.Admission + injection. The webhook mutates the incoming Pod to add:
A readiness waiter init container that blocks start-up until storage, MinIO, and the coordinator service are reachable.
A volume containing the Nsight Systems CLI binaries.
A process hook that wraps matching child processes under
nsys(in coordinator mode, under the coordinator-provided command).
Agent registration. Once the Pod starts, the Nsight Systems agent registers with the NsightCoordinator (ZeroMQ, CURVE-encrypted by default). The agent is now visible in
nsight_operator.py status.CLI configuration. The profiling user points the CLI at the gateway using
autoconfigure(reads cluster state) orconfigure --gw(manual). Credentials and endpoints are stored in~/.nsight-cloud.conf.Session + collection.
profiler-startcreates a session (if none is active) and begins a collection. The coordinator broadcasts a start command to all registered agents for the current tag.Workload runs. While agents capture CPU/GPU samples, optional OTLP mirroring streams NVTX ranges to the NsightOtelCollector via the injected Envoy proxy sidecar.
Stop + upload.
profiler-stopends the collection. Each agent finalizes its.nsys-repfile and uploads it to the configured cloud storage backend.List / download / analyze. Reports are listed via
ls, downloaded viadownload, or analyzed viaanalysis run <recipe>. The analysis service outputs Jupyter notebooks that can be rendered by the built-in gateway UI or opened locally.View reports. Instead of downloading, create a NsightStreamer CR to browse reports in a web-based Nsight Systems GUI hosted inside the cluster.
End session (optional).
session-endcloses the session. Ended sessions remain listable and downloadable until the storage backend retention policy reclaims them.