Fault Tolerance Launcher Guide#
The ft_launcher is provided by nvidia-resiliency-ext (included in NeMo RL dependencies) and enables automatic fault tolerance and recovery for distributed training runs.
Key Arguments#
Argument |
Description |
Example |
|---|---|---|
|
Path to FT YAML config file |
|
|
Heartbeat timeout in seconds |
|
|
Initial timeout (longer for setup) |
|
|
Maximum number of restart attempts |
|
Basic Usage#
uv run ft_launcher \
--ft-cfg-path examples/ft_launcher/ft_config.yaml \
--ft-rank-heartbeat-timeout 450 \
--ft-initial-rank-heartbeat-timeout 1200 \
--max-restarts 5 \
examples/run_grpo.py \
--config <your_config.yaml>
FT Config File (examples/ft_launcher/ft_config.yaml)#
fault_tolerance:
initial_rank_heartbeat_timeout: 360
restart_policy: any-failed
Important Notes#
Checkpointing: Enable checkpointing for recovery to work:
++checkpointing.enabled=true ++checkpointing.checkpoint_dir=/path/to/checkpoints ++checkpointing.save_period=50
Timeouts: Set
--ft-initial-rank-heartbeat-timeouthigher than--ft-rank-heartbeat-timeoutto allow for model loading/setup time.Restart Policy: The
any-failedrestart policy will restart the entire job if any rank fails. Look for these log messages to identify when a restart occurs:[ERROR] [ft_launcher...] failed (exitcode: 1) local_rank: 0 (pid: ...) of binary: ... [INFO] [ft_launcher...] [default] Worker group FAILED. 3/5 attempts left; will restart worker group [INFO] [ft_launcher...] Stopping workers... Timeout = 30 sec. [INFO] [ft_launcher...] The node '...' attempts to join the next round of the rendezvous '...'. [INFO] [ft_launcher...] The node '...' has joined round N of the rendezvous '...' as rank 0 in a world of size 1.
Key indicators:
Worker group FAILED. X/Y attempts left- shows a restart is happening and remaining attemptswill restart worker group- confirms restart is in progresshas joined round N- the round number increases with each restart