RAS

Since NCCL 2.24, the reliability, availability, and serviceability (RAS) subsystem can be used to query the health of NCCL jobs during execution. This can help with the diagnosis and debugging of crashes and hangs. RAS is a low-overhead infrastructure that NCCL users and developers can use while the application is running. It provides a global view of the state of the running application and can aide in the detection of outliers such as unresponsive processes. With that information, users can then narrow down on the suspected root cause(s) through other techniques such as interactive debugging, system log analysis, etc.

Principle of Operation

RAS is built into NCCL and launches during NCCL initialization. It consists of a set of threads (one per process) that establish connections with each other, forming a network that the RAS threads then use to exchange information and monitor each other’s health. In a typical configuration, the RAS network traffic (which uses plain TCP/IP sockets on top of the bootstrap/out-of-band network interface that NCCL uses during initialization) should not compete with the main NCCL traffic (which utilizes RDMA networking). RAS is lightweight and should not interfere with the main NCCL job; as such, it is enabled by default (but see NCCL_RAS_ENABLE).

The RAS threads communicate with each other about any changes to the job configuration; they also exchange regular keep-alive messages. If a NCCL process crashes or hangs, the RAS threads running on other NCCL processes learn about it through the RAS network connections to that process being shut down or becoming unresponsive.

RAS Queries

The RAS threads also listen for client connections on localhost, port 28028 (these defaults can be changed using NCCL_RAS_ADDR). The ncclras binary client can be used to connect to that socket and query the RAS subsystem for the current job status, which is then printed to standard output. The client accepts the -h and -p arguments to specify the host name and port, -v to produce a more verbose output in case of problems, and -t to specify a different timeout (5 seconds by default; 0 disables the timeout).

As the client communication protocol is fully text-based, standard networking tools such as telnet or netcat can be used instead of the ncclras binary. The relevant commands include STATUS, VERBOSE STATUS (equivalent to the ncclras client’s -v argument), and TIMEOUT <seconds> (equivalent to -t); e.g., echo verbose status | nc localhost 28028.

Irrespective of how the query is submitted, the receiving RAS thread sends back the job summary information as well as the summary information about all the NCCL communicators; the latter is collected from all the job’s processes so, for jobs experiencing problems or ones that are particularly large, the response may take several seconds to generate. In case any issues were encountered, additional information is provided.

Sample Output

This section contains excerpts of the RAS status output. Please note that the exact format and scope of the information being made available is expected to evolve; the excerpts are provided for illustrative purposes only.

Here’s an example output from a job that is progressing normally:

Job summary
===========

  Nodes  Processes         GPUs  Processes     GPUs
(total)   per node  per process    (total)  (total)
      4          8            1         32       32

We’ve got a job consisting of 32 GPUs (1 GPU per process) running on 4 nodes.

Communicators... (0.00s)
=============

Group     Comms     Nodes     Ranks     Ranks     Ranks    Status  Errors
    #  in group  per comm  per node  per comm  in group
    0         8         4         1         4        32   RUNNING      OK

The GPUs are split into 8 communicators, 1 GPU per node. RAS attempts to make the summary output as short as possible by grouping together objects having the same size and other important properties.

For jobs that are actively communicating during the RAS query, the following output can sometimes be observed:

Group     Comms     Nodes     Ranks     Ranks     Ranks    Status  Errors
    #  in group  per comm  per node  per comm  in group
    0         1         4         8        32        32   RUNNING  MISMATCH

The output indicates that there is an inconsistency in the information provided by different communicator ranks. Additional information is printed underneath (in this case it’s in the Warnings section, indicating a potentially lower severity):

Warnings
========

#0-0 (27a079b828ff1a75) MISMATCH
  Communicator ranks have different collective operation counts
  26 ranks have launched up to operation 6650
  6 ranks have launched up to operation 6649
  Rank 0 -- GPU 0 managed by process 483072 on node 172.16.64.210
  Rank 2 -- GPU 2 managed by process 483074 on node 172.16.64.210
  Rank 3 -- GPU 3 managed by process 483075 on node 172.16.64.210
  Rank 4 -- GPU 4 managed by process 483076 on node 172.16.64.210
  Rank 5 -- GPU 5 managed by process 483077 on node 172.16.64.210
  Rank 7 -- GPU 7 managed by process 483079 on node 172.16.64.210

Communicators are referred to using the #<x>-<y> identifiers, where <x> is the group number from the summary output and <y> is the communicator number within the group, both starting with 0 (in this example there is only one (32-GPU) communicator so, unsurprisingly, the identifier is #0-0). The identifier is followed by a communicator hash, which is a value that can be found in NCCL’s regular debug output as well, and the rank information. RAS groups together the ranks with the same relevant property (the count of issued collective operations in this case). If a group constitutes an outlier, RAS prints additional information about each group member. By default this is done if the group size is at most 25% of the total and the group has no more than 10 members; enabling verbose output relaxes this to under 50% of the total and lifts the group size limit.

The particular case above should not be a cause for concern, as long as the counts increase across repeated queries. NCCL collectives, being optimized for speed, can easily outpace the RAS collective queries, especially if the size of the collectives is fairly small. An application may also exhibit work imbalance, with certain ranks routinely arriving to the collective operations later than others – an experience with a particular workload is needed to determine what’s normal and what’s not. However, if the output does not change across subsequent RAS queries, it may indicate that the communicator is “stuck” for some reason, which could warrant an investigation.

Similar effects can sometimes be observed during communicator initialization or tear-down:

Group     Comms     Nodes     Ranks     Ranks     Ranks    Status  Errors
    #  in group  per comm  per node  per comm  in group
    0         1         4       1-2        32        32  FINALIZE  MISMATCH
    1         7         4         1         4        28   RUNNING      OK
    2         1         4         1         4         4      INIT      OK

[...]

#0-0 (9e17999afaa87dbb) MISMATCH
  Communicator ranks have different status
  26 ranks have status UNKNOWN
  4 ranks have status RUNNING
  Rank 0 -- GPU 0 managed by process 507285 on node 172.16.64.210
  Rank 8 -- GPU 0 managed by process 1598388 on node 172.16.64.212
  Rank 16 -- GPU 0 managed by process 3500071 on node 172.16.64.213
  Rank 24 -- GPU 0 managed by process 2405067 on node 172.16.64.222
  2 ranks have status FINALIZE
  Rank 4 -- GPU 4 managed by process 507289 on node 172.16.64.210
  Rank 20 -- GPU 4 managed by process 3500075 on node 172.16.64.213

The above snapshot depicts a transitional situation as the initial, 32-GPU communicator is being replaced by eight 4-GPU communicators (one of which is still initializing, so it is listed separately (group #2) from the already initialized seven (group #1)). The 32-GPU communicator (#0-0) is being torn down, with two ranks in the middle of ncclCommFinalize, four ranks that have not called ncclCommFinalize yet, and the remaining 26 ranks “unknown” – meaning that they didn’t provide any information about that communicator when RAS was collecting data, simply because their call to ncclCommFinalize has already completed so they are in fact no longer that communicator’s members. Again, as long as the situation is resolved when the query is repeated, it can be ignored.

Here’s an excerpt from an invocation right after artificially creating a problem with one of the job processes:

Communicators... (2.05s)
=============

Group     Comms     Nodes     Ranks     Ranks     Ranks    Status  Errors
    #  in group  per comm  per node  per comm  in group
    0         1         4       7-8        32        32   RUNNING  INCOMPLETE

Errors
======

INCOMPLETE
  Missing communicator data from 1 job process
  Process 3487984 on node 172.16.64.213 managing GPU 5

#0-0 (cf264af53edbe986) INCOMPLETE
  Missing communicator data from 1 rank
  The missing rank: 21

Warnings
========

TIMEOUT
  Encountered 2 communication timeouts while gathering communicator data

In this case the summary takes a few seconds to generate because RAS waits for the data from the process experiencing problems (the process is unresponsive – it was stopped – but RAS doesn’t know it yet). Repeated queries should be much faster because once RAS determines that a process is unresponsive, it reconfigures the RAS network to route around it.

RAS will attempt to reestablish communication with the unresponsive process; if it’s unable to do so for 60 seconds, it will declare the process dead (permanently):

Errors
======

DEAD
  1 job process is considered dead (unreachable via the RAS network)
  Process 3487984 on node 172.16.64.213 managing GPU 5

#0-0 (cf264af53edbe986) INCOMPLETE
  Missing communicator data from 1 rank
  The missing rank: 21

RAS will simply stop attempting to communicate with such processes over the RAS network anymore, leaving it up to the user to determine if any additional action is warranted.