Service Delivery SLAs

View as Markdown

Service Delivery SLAs

NCPs should be able to demonstrate ability to meet below SLA by category and operational requirements to be considered for offtake.

Services Delivery Timelines

The NCP must demonstrate API readiness, transport establishment at least 12 weeks ahead of GPU delivery, and the ability to provide Dev capacity (CPU nodes only) with the API integrated 6 weeks prior to GPU and cluster delivery.

One key request is for early access to ancillary compute nodes to act as the Data Mover function. This will help us pre-position data into the data center for use when GPUs are available. Access to Data Mover compute (and target storage) should be available ~2 weeks ahead of GPU cluster delivery.

SLA and SLO

Managed K8s

  • Control Plane SLA target: Financially-backed 99.95%+ uptime for production.

Storage

  • Performance (QoS): Must provision needed throughput requested for minimum bandwidth and IOPS.
  • Home Directory Storage:
    • Availability: Over 99% availability for unplanned incidents. Exclusive of scheduled maintenance.
    • Durability: Over 99.99% for any FS less than 1 PB
  • High Speed Storage Service Requirements:
    • Availability (SLO): Must meet 99.99% availability in a 30-day rolling SLO exclusive of maintenance
  • High-Speed Storage Filesystem Requirements
    • End to End Availability: Over 99.5% uptime per PB
    • Durability: Over 99.999% durability per PB annually

Operational Requirements

  • Dedicated Technical specialist/engineer available to NVIDIA
  • Slack channel monitored by technical specialist / engineer
  • 24x7 support available per partner standard incident severity procedures
  • Service impacting incidents, planned, and unplanned maintenance events are communicated to NVIDIA.
  • For planned maintenance, NVIDIA can schedule maintenance windows via APIs / console tools - avoiding unexpected outages + the ability for NVIDIA to provide feedback.
  • NCP to remediate critical vulnerabilities in a timely manner while providing transparent disclosures of any issues

Telemetry Delivery Method

NCP shall deliver all required telemetry, including metrics and logs, in a manner that allows for ingestion into DGX Cloud systems. The preferred methodology is natively via the OpenTelemetry Protocol with a latency of no longer than 120 seconds.

Exemplar Cloud Workload Performance

NVIDIA Exemplar Cloud seeks to improve performance per TCO with hardware and software recipes, references, tools, and capabilities. Run the latest publicly available release from https://github.com/NVIDIA/dgxc-benchmarking (Always pick the latest release version from the GH repo) to be successfully completed on 1 uniform HW cluster type. Please run all the workloads for a given release and share the results in the template below.

Test IDFeatureMin SizeDescription
BM01Benchmarking for exemplar cloud512 GPU clusterAchieve within 5% of an NVIDIA provided target performance number