Telemetry Requirements

View as Markdown

Telemetry Requirements

The telemetry requirements are comprised of two core components that require alignment between DGX Cloud and the NCP:

  1. Delivery Method: How telemetry will be delivered by NCP to DGX Cloud for ingestion
  2. Telemetry Scope: What telemetry the NCP will deliver to DGX Cloud

Delivery Method
NCP shall deliver all required telemetry, including metrics and logs, in a manner that allows for ingestion into DGX Cloud systems. The preferred methodology is natively via the OpenTelemetry Protocol with a latency of no longer than 120 seconds.

Telemetry Scope
DGX Cloud will provide the NCP with a detailed specification document with the required metrics and logs. Upon receipt, the NCP shall be required to provide a formal written response detailing the following:

  • Confirmation of its ability to deliver the specified metrics and logs.
  • Projected timelines for delivery.
  • Specific technical details, including metric names, label names, and label values.

Network Telemetry
The NCP shall provide network telemetry across the following domains:

  • North-South (Front-End) Network (client-facing and external interconnects)
  • East-West (Back-end) Network (GPU/GPU interconnects)
  • Management Network (control plane and orchestration traffic)
  • NVSwitch Fabric (intra-node GPU switching, applicable for only GB200 and beyond clusters)
  • Host Network (NIC-level and server connectivity)

Logs
DGX Cloud will require the NCP to provide logs from various network technologies, including but not limited to:

  1. Fabric Manager logs for the NVLink domain (where applicable)
  2. Subnet Manager logs for the NVLink domain (where applicable)
  3. VPC Flow logs (all ingress/egress traffic)
  4. UFM Event logs
  5. General Switch Logs
  6. Switch syslogs
  7. Switch kernel logs
  8. BMC SEL logs
  9. syslogs