RAS¶
Since NCCL 2.24, the reliability, availability, and serviceability (RAS) subsystem can be used to query the health of NCCL jobs during execution. This can help with the diagnosis and debugging of crashes and hangs. RAS is a low-overhead infrastructure that NCCL users and developers can use while the application is running. It provides a global view of the state of the running application and can aide in the detection of outliers such as unresponsive processes. With that information, users can then narrow down on the suspected root cause(s) through other techniques such as interactive debugging, system log analysis, etc.
Principle of Operation¶
RAS is built into NCCL and launches during NCCL initialization. It consists of a set of threads (one per process) that establish connections with each other, forming a network that the RAS threads then use to exchange information and monitor each other’s health. In a typical configuration, the RAS network traffic (which uses plain TCP/IP sockets on top of the bootstrap/out-of-band network interface that NCCL uses during initialization) should not compete with the main NCCL traffic (which utilizes RDMA networking). RAS is lightweight and should not interfere with the main NCCL job; as such, it is enabled by default (but see NCCL_RAS_ENABLE).
The RAS threads communicate with each other about any changes to the job configuration; they also exchange regular keep-alive messages. If a NCCL process crashes or hangs, the RAS threads running on other NCCL processes learn about it through the RAS network connections to that process being shut down or becoming unresponsive.
RAS Queries¶
The RAS threads also listen for client connections on localhost
, port 28028
(these defaults can be changed using
NCCL_RAS_ADDR). The ncclras
binary client can be used to connect to that socket and query the RAS
subsystem for the current job status, which is then printed to standard output. The client accepts the -h
and
-p
arguments to specify the host name and port, -v
to produce a more verbose output in case of problems, and
-t
to specify a different timeout (5
seconds by default; 0 disables the timeout).
As the client communication protocol is fully text-based, standard networking tools such as telnet or netcat can be used
instead of the ncclras
binary. The relevant commands include STATUS
, VERBOSE STATUS
(equivalent to the
ncclras
client’s -v
argument), and TIMEOUT <seconds>
(equivalent to -t
); e.g., echo verbose status |
nc localhost 28028
.
Irrespective of how the query is submitted, the receiving RAS thread sends back the job summary information as well as the summary information about all the NCCL communicators; the latter is collected from all the job’s processes so, for jobs experiencing problems or ones that are particularly large, the response may take several seconds to generate. In case any issues were encountered, additional information is provided.
Sample Output¶
This section contains excerpts of the RAS status output. Please note that the exact format and scope of the information being made available is expected to evolve; the excerpts are provided for illustrative purposes only.
Here’s an example output from a job that is progressing normally:
Job summary
===========
Nodes Processes GPUs Processes GPUs
(total) per node per process (total) (total)
4 8 1 32 32
We’ve got a job consisting of 32 GPUs (1 GPU per process) running on 4 nodes.
Communicators... (0.00s)
=============
Group Comms Nodes Ranks Ranks Ranks Status Errors
# in group per comm per node per comm in group
0 8 4 1 4 32 RUNNING OK
The GPUs are split into 8 communicators, 1 GPU per node. RAS attempts to make the summary output as short as possible by grouping together objects having the same size and other important properties.
For jobs that are actively communicating during the RAS query, the following output can sometimes be observed:
Group Comms Nodes Ranks Ranks Ranks Status Errors
# in group per comm per node per comm in group
0 1 4 8 32 32 RUNNING MISMATCH
The output indicates that there is an inconsistency in the information provided by different communicator ranks. Additional information is printed underneath (in this case it’s in the Warnings section, indicating a potentially lower severity):
Warnings
========
#0-0 (27a079b828ff1a75) MISMATCH
Communicator ranks have different collective operation counts
26 ranks have launched up to operation 6650
6 ranks have launched up to operation 6649
Rank 0 -- GPU 0 managed by process 483072 on node 172.16.64.210
Rank 2 -- GPU 2 managed by process 483074 on node 172.16.64.210
Rank 3 -- GPU 3 managed by process 483075 on node 172.16.64.210
Rank 4 -- GPU 4 managed by process 483076 on node 172.16.64.210
Rank 5 -- GPU 5 managed by process 483077 on node 172.16.64.210
Rank 7 -- GPU 7 managed by process 483079 on node 172.16.64.210
Communicators are referred to using the #<x>-<y>
identifiers, where <x>
is the group number from the
summary output and <y>
is the communicator number within the group, both starting with 0 (in this example there is
only one (32-GPU) communicator so, unsurprisingly, the identifier is #0-0
). The identifier is followed by a
communicator hash, which is a value that can be found in NCCL’s regular debug output as well, and the rank information.
RAS groups together the ranks with the same relevant property (the count of issued collective operations in
this case). If a group constitutes an outlier, RAS prints additional information about each group member. By default
this is done if the group size is at most 25% of the total and the group has no more than 10 members; enabling verbose
output relaxes this to under 50% of the total and lifts the group size limit.
The particular case above should not be a cause for concern, as long as the counts increase across repeated queries. NCCL collectives, being optimized for speed, can easily outpace the RAS collective queries, especially if the size of the collectives is fairly small. An application may also exhibit work imbalance, with certain ranks routinely arriving to the collective operations later than others – an experience with a particular workload is needed to determine what’s normal and what’s not. However, if the output does not change across subsequent RAS queries, it may indicate that the communicator is “stuck” for some reason, which could warrant an investigation.
Similar effects can sometimes be observed during communicator initialization or tear-down:
Group Comms Nodes Ranks Ranks Ranks Status Errors
# in group per comm per node per comm in group
0 1 4 1-2 32 32 FINALIZE MISMATCH
1 7 4 1 4 28 RUNNING OK
2 1 4 1 4 4 INIT OK
[...]
#0-0 (9e17999afaa87dbb) MISMATCH
Communicator ranks have different status
26 ranks have status UNKNOWN
4 ranks have status RUNNING
Rank 0 -- GPU 0 managed by process 507285 on node 172.16.64.210
Rank 8 -- GPU 0 managed by process 1598388 on node 172.16.64.212
Rank 16 -- GPU 0 managed by process 3500071 on node 172.16.64.213
Rank 24 -- GPU 0 managed by process 2405067 on node 172.16.64.222
2 ranks have status FINALIZE
Rank 4 -- GPU 4 managed by process 507289 on node 172.16.64.210
Rank 20 -- GPU 4 managed by process 3500075 on node 172.16.64.213
The above snapshot depicts a transitional situation as the initial, 32-GPU communicator is being replaced by eight 4-GPU
communicators (one of which is still initializing, so it is listed separately (group #2
) from the already
initialized seven (group #1
)). The
32-GPU communicator (#0-0
) is being torn down, with two ranks in the middle of ncclCommFinalize, four ranks that
have not called ncclCommFinalize yet, and the remaining 26 ranks “unknown” – meaning that they didn’t provide any
information about that communicator when RAS was collecting data, simply because their call to ncclCommFinalize has
already completed so they are in fact no longer that communicator’s members. Again, as long as the situation is
resolved when the query is repeated, it can be ignored.
Here’s an excerpt from an invocation right after artificially creating a problem with one of the job processes:
Communicators... (2.05s)
=============
Group Comms Nodes Ranks Ranks Ranks Status Errors
# in group per comm per node per comm in group
0 1 4 7-8 32 32 RUNNING INCOMPLETE
Errors
======
INCOMPLETE
Missing communicator data from 1 job process
Process 3487984 on node 172.16.64.213 managing GPU 5
#0-0 (cf264af53edbe986) INCOMPLETE
Missing communicator data from 1 rank
The missing rank: 21
Warnings
========
TIMEOUT
Encountered 2 communication timeouts while gathering communicator data
In this case the summary takes a few seconds to generate because RAS waits for the data from the process experiencing problems (the process is unresponsive – it was stopped – but RAS doesn’t know it yet). Repeated queries should be much faster because once RAS determines that a process is unresponsive, it reconfigures the RAS network to route around it.
RAS will attempt to reestablish communication with the unresponsive process; if it’s unable to do so for 60 seconds, it will declare the process dead (permanently):
Errors
======
DEAD
1 job process is considered dead (unreachable via the RAS network)
Process 3487984 on node 172.16.64.213 managing GPU 5
#0-0 (cf264af53edbe986) INCOMPLETE
Missing communicator data from 1 rank
The missing rank: 21
RAS will simply stop attempting to communicate with such processes over the RAS network anymore, leaving it up to the user to determine if any additional action is warranted.