ZTR-RTT Congestion Control Algorithm Overview

ZTR-RTT Congestion Control Algorithm Overview v1.0 Download PDF

Overview

NVIDIA Zero Touch RoCE (ZTR) enables data centers to seamlessly deploy RDMA over Converged Ethernet (RoCE) without requiring any special switch configuration. Built according to the InfiniBand Trade Association (IBTA) RDMA standard and fully compliant with the RoCE specifications, ZTR enables seamless deployment of RoCE. ZTR also boasts performance equivalent to traditional switch-enabled RoCE and is significantly better than traditional TCP-based memory access. Moreover, with ZTR, RoCE network transport services operate side-by-side with non-RoCE communications in ordinary TCP/IP environments.

The new NVIDIA Congestion Control algorithm, Round-Trip Time Congestion Control (RTTCC) allows ZTR to scale to thousands of servers without compromising performance. Using ZTR and RTTCC allows data center operators to enjoy ease-of-deployment and operations together with the superb performance of Remote Direct Memory Access (RDMA) at a massive scale, without any switch configuration.

The new NVIDIA congestion control algorithm, RTTCC, actively monitors network RTT to proactively detect and adapt to the onset of congestion before dropping packets. RTTCC enables dynamic congestion control using a hardware-based feedback loop that provides dramatically superior performance compared to software-based congestion control algorithms. RTTCC also supports faster transmission rates and can deploy ZTR at a larger scale.

Main ZTR-RTT CC algorithm's characters are : ​

  • Implemented on top of DPA (Data Path Accelerator) ​

  • RTT-based congestion control ​

  • Current default CC algorithm for RoCE ​

  • Demonstrates better performance than DCQCN on HPC and AI workloads​

  • Maintain DCQCN good performance on storage workload

© Copyright 2024, NVIDIA. Last updated on Sep 25, 2024.