ZTR-RTT Congestion Control Algorithm Overview
ZTR-RTT Congestion Control Algorithm Overview v1.0

Congestion Control Algorithm

Congestion Control provides performance isolation when multiple applications running on the same cluster. Additionally, it prevents congestion spreading when there is a slow receiver, reduce latency in the cluster, improves fairness, prevents parking-lot effects and packet's drop in lossy networks.

ZTR_RTTCC is NVIDIA’s default Congestion Control algorithm.

The diagram below shows an example of head of the line blocking scenario.

image-2024-9-18_13-4-10-version-1-modificationdate-1727268393957-api-v2.png

Head of the Line Blocking Scenario

The following are the Datacenter Congestion Control challenges:

  • Several µ-sec of latency with hundreds of Gbps of bandwidth

    • Congestion buildup is fast, so the congestion loop should be short

  • A wide variety of traffic types, topologies and applications

    • Hard to develop an algorithm that suits all

    • Congestion Control algorithms are constantly being introduced with new congestion indications

  • Hardware implementation is not robust enough

  • Software implementation reacts too slow

image-2024-9-18_13-9-43-version-1-modificationdate-1727268393210-api-v2.png

ZTR RTTCC Infrastructure

image-2024-9-18_13-11-54-version-1-modificationdate-1727268392312-api-v2.png

RTT Measurement Flow

image-2024-9-25_17-20-17-version-1-modificationdate-1727274097951-api-v2.png

ZTR RTTCC Algorithm

© Copyright 2024, NVIDIA. Last updated on Sep 25, 2024.