Telemetry Requirements
Telemetry Requirements
The telemetry requirements are comprised of two core components that require alignment between DGX Cloud and the NCP:
- Delivery Method: How telemetry will be delivered by NCP to DGX Cloud for ingestion
- Telemetry Scope: What telemetry the NCP will deliver to DGX Cloud
Delivery Method
NCP shall deliver all required telemetry, including metrics and logs, in a manner that allows for ingestion into DGX Cloud systems. The preferred methodology is natively via the OpenTelemetry Protocol with a latency of no longer than 120 seconds.
Telemetry Scope
DGX Cloud will provide the NCP with a detailed specification document with the required metrics and logs. Upon receipt, the NCP shall be required to provide a formal written response detailing the following:
- Confirmation of its ability to deliver the specified metrics and logs.
- Projected timelines for delivery.
- Specific technical details, including metric names, label names, and label values.
Network Telemetry
The NCP shall provide network telemetry across the following domains:
- North-South (Front-End) Network (client-facing and external interconnects)
- East-West (Back-end) Network (GPU/GPU interconnects)
- Management Network (control plane and orchestration traffic)
- NVSwitch Fabric (intra-node GPU switching, applicable for only GB200 and beyond clusters)
- Host Network (NIC-level and server connectivity)
Logs
DGX Cloud will require the NCP to provide logs from various network technologies, including but not limited to:
- Fabric Manager logs for the NVLink domain (where applicable)
- Subnet Manager logs for the NVLink domain (where applicable)
- VPC Flow logs (all ingress/egress traffic)
- UFM Event logs
- General Switch Logs
- Switch syslogs
- Switch kernel logs
- BMC SEL logs
- syslogs