Congestion Control Algorithm
Congestion Control provides performance isolation when multiple applications running on the same cluster. Additionally, it prevents congestion spreading when there is a slow receiver, reduce latency in the cluster, improves fairness, prevents parking-lot effects and packet's drop in lossy networks.
ZTR_RTTCC is NVIDIA’s default Congestion Control algorithm.
The diagram below shows an example of head of the line blocking scenario.
The following are the Datacenter Congestion Control challenges:
Several µ-sec of latency with hundreds of Gbps of bandwidth
Congestion buildup is fast, so the congestion loop should be short
A wide variety of traffic types, topologies and applications
Hard to develop an algorithm that suits all
Congestion Control algorithms are constantly being introduced with new congestion indications
Hardware implementation is not robust enough
Software implementation reacts too slow