ZTR-RTT Congestion Control Algorithm Overview v1.0

Congestion Control

Congestion Control provides performance isolation when multiple applications running on the same cluster. Additionally, it prevents congestion spreading when there is a slow receiver, reduce latency in the cluster, improves fairness, prevents parking-lot effects and packet's drop in lossy networks.

The diagram below shows an example of head of the line blocking scenario.

image-2024-9-18_13-4-10-version-1-modificationdate-1727268393957-api-v2.png

Head of the Line Blocking Scenario

Developing a congestion control algorithm for datacenters present the following challenges:

  • Several µ-sec of latency with hundreds of Gbps of bandwidth

    • Congestion buildup is fast, so the congestion loop should be short

  • A wide variety of traffic types, topologies and applications

    • Hard to develop an algorithm that suits all

    • Congestion Control algorithms are constantly being introduced with new congestion indications

  • Hardware implementation is not robust enough

  • Software implementation reacts too slow

To face the challenges above, NVIDIA CC algorithm is developed on top of an infrastructure with the following characteristics:

image-2024-9-18_13-9-43-version-1-modificationdate-1727268393210-api-v2.png

ZTR RTTCC Infrastructure

image-2024-9-18_13-11-54-version-1-modificationdate-1727268392313-api-v2.png

RTT Measurement Flow

image-2024-9-25_17-20-17-version-1-modificationdate-1727274097950-api-v2.png

ZTR RTTCC Algorithm

© Copyright 2024, NVIDIA. Last updated on Sep 29, 2024.