*** RAS *** .. highlight:: none Since NCCL 2.24, the reliability, availability, and serviceability (RAS) subsystem can be used to query the health of NCCL jobs during execution. This can help with the diagnosis and debugging of crashes and hangs. RAS is a low-overhead infrastructure that NCCL users and developers can use while the application is running. It provides a global view of the state of the running application and can aide in the detection of outliers such as unresponsive processes. With that information, users can then narrow down on the suspected root cause(s) through other techniques such as interactive debugging, system log analysis, etc. Principle of Operation ---------------------- RAS is built into NCCL and launches during NCCL initialization. It consists of a set of threads (one per process) that establish connections with each other, forming a network that the RAS threads then use to exchange information and monitor each other's health. In a typical configuration, the RAS network traffic (which uses plain TCP/IP sockets on top of the bootstrap/out-of-band network interface that NCCL uses during initialization) should not compete with the main NCCL traffic (which utilizes RDMA networking). RAS is lightweight and should not interfere with the main NCCL job; as such, it is enabled by default (but see :ref:`env_NCCL_RAS_ENABLE`). The RAS threads communicate with each other about any changes to the job configuration; they also exchange regular keep-alive messages. If a NCCL process crashes or hangs, the RAS threads running on other NCCL processes learn about it through the RAS network connections to that process being shut down or becoming unresponsive. RAS Queries ----------- The RAS threads also listen for client connections on ``localhost``, port ``28028`` (these defaults can be changed using :ref:`env_NCCL_RAS_ADDR`). The ``ncclras`` binary client can be used to connect to that socket and query the RAS subsystem for the current job status, which is then printed to standard output. The client accepts the ``-h`` and ``-p`` arguments to specify the host name and port, ``-v`` to produce a more verbose output in case of problems, and ``-t`` to specify a different timeout (``5`` seconds by default; `0` disables the timeout). As the client communication protocol is fully text-based, standard networking tools such as telnet or netcat can be used instead of the ``ncclras`` binary. The relevant commands include ``STATUS``, ``VERBOSE STATUS`` (equivalent to the ``ncclras`` client's ``-v`` argument), and ``TIMEOUT `` (equivalent to ``-t``); e.g., ``echo verbose status | nc localhost 28028``. Irrespective of how the query is submitted, the receiving RAS thread sends back the job summary information as well as the summary information about all the NCCL communicators; the latter is collected from all the job's processes so, for jobs experiencing problems or ones that are particularly large, the response may take several seconds to generate. In case any issues were encountered, additional information is provided. Sample Output ------------- This section contains excerpts of the RAS status output. Please note that the exact format and scope of the information being made available is expected to evolve; the excerpts are provided for illustrative purposes only. Here's an example output from a job that is progressing normally: .. code:: Job summary =========== Nodes Processes GPUs Processes GPUs (total) per node per process (total) (total) 4 8 1 32 32 We've got a job consisting of 32 GPUs (1 GPU per process) running on 4 nodes. .. code:: Communicators... (0.00s) ============= Group Comms Nodes Ranks Ranks Ranks Status Errors # in group per comm per node per comm in group 0 8 4 1 4 32 RUNNING OK The GPUs are split into 8 communicators, 1 GPU per node. RAS attempts to make the summary output as short as possible by grouping together objects having the same size and other important properties. For jobs that are actively communicating during the RAS query, the following output can sometimes be observed: .. code:: Group Comms Nodes Ranks Ranks Ranks Status Errors # in group per comm per node per comm in group 0 1 4 8 32 32 RUNNING MISMATCH The output indicates that there is an inconsistency in the information provided by different communicator ranks. Additional information is printed underneath (in this case it's in the Warnings section, indicating a potentially lower severity): .. code:: Warnings ======== #0-0 (27a079b828ff1a75) MISMATCH Communicator ranks have different collective operation counts 26 ranks have launched up to operation 6650 6 ranks have launched up to operation 6649 Rank 0 -- GPU 0 managed by process 483072 on node 172.16.64.210 Rank 2 -- GPU 2 managed by process 483074 on node 172.16.64.210 Rank 3 -- GPU 3 managed by process 483075 on node 172.16.64.210 Rank 4 -- GPU 4 managed by process 483076 on node 172.16.64.210 Rank 5 -- GPU 5 managed by process 483077 on node 172.16.64.210 Rank 7 -- GPU 7 managed by process 483079 on node 172.16.64.210 Communicators are referred to using the ``#-`` identifiers, where ```` is the group number from the summary output and ```` is the communicator number within the group, both starting with 0 (in this example there is only one (32-GPU) communicator so, unsurprisingly, the identifier is ``#0-0``). The identifier is followed by a communicator hash, which is a value that can be found in NCCL's regular debug output as well, and the rank information. RAS groups together the ranks with the same relevant property (the count of issued collective operations in this case). If a group constitutes an outlier, RAS prints additional information about each group member. By default this is done if the group size is at most 25% of the total *and* the group has no more than 10 members; enabling verbose output relaxes this to under 50% of the total and lifts the group size limit. The particular case above should not be a cause for concern, as long as the counts increase across repeated queries. NCCL collectives, being optimized for speed, can easily outpace the RAS collective queries, especially if the size of the collectives is fairly small. An application may also exhibit work imbalance, with certain ranks routinely arriving to the collective operations later than others -- an experience with a particular workload is needed to determine what's normal and what's not. However, if the output does not change across subsequent RAS queries, it may indicate that the communicator is "stuck" for some reason, which could warrant an investigation. Similar effects can sometimes be observed during communicator initialization or tear-down: .. code:: Group Comms Nodes Ranks Ranks Ranks Status Errors # in group per comm per node per comm in group 0 1 4 1-2 32 32 FINALIZE MISMATCH 1 7 4 1 4 28 RUNNING OK 2 1 4 1 4 4 INIT OK [...] #0-0 (9e17999afaa87dbb) MISMATCH Communicator ranks have different status 26 ranks have status UNKNOWN 4 ranks have status RUNNING Rank 0 -- GPU 0 managed by process 507285 on node 172.16.64.210 Rank 8 -- GPU 0 managed by process 1598388 on node 172.16.64.212 Rank 16 -- GPU 0 managed by process 3500071 on node 172.16.64.213 Rank 24 -- GPU 0 managed by process 2405067 on node 172.16.64.222 2 ranks have status FINALIZE Rank 4 -- GPU 4 managed by process 507289 on node 172.16.64.210 Rank 20 -- GPU 4 managed by process 3500075 on node 172.16.64.213 The above snapshot depicts a transitional situation as the initial, 32-GPU communicator is being replaced by eight 4-GPU communicators (one of which is still initializing, so it is listed separately (group ``#2``) from the already initialized seven (group ``#1``)). The 32-GPU communicator (``#0-0``) is being torn down, with two ranks in the middle of `ncclCommFinalize`, four ranks that have *not* called `ncclCommFinalize` yet, and the remaining 26 ranks "unknown" -- meaning that they didn't provide any information about that communicator when RAS was collecting data, simply because their call to `ncclCommFinalize` has already completed so they are in fact no longer that communicator's members. Again, as long as the situation is resolved when the query is repeated, it can be ignored. Here's an excerpt from an invocation right after artificially creating a problem with one of the job processes: .. code:: Communicators... (2.05s) ============= Group Comms Nodes Ranks Ranks Ranks Status Errors # in group per comm per node per comm in group 0 1 4 7-8 32 32 RUNNING INCOMPLETE Errors ====== INCOMPLETE Missing communicator data from 1 job process Process 3487984 on node 172.16.64.213 managing GPU 5 #0-0 (cf264af53edbe986) INCOMPLETE Missing communicator data from 1 rank The missing rank: 21 Warnings ======== TIMEOUT Encountered 2 communication timeouts while gathering communicator data In this case the summary takes a few seconds to generate because RAS waits for the data from the process experiencing problems (the process is unresponsive -- it was stopped -- but RAS doesn't know it yet). Repeated queries should be much faster because once RAS determines that a process is unresponsive, it reconfigures the RAS network to route around it. RAS will attempt to reestablish communication with the unresponsive process; if it's unable to do so for 60 seconds, it will declare the process dead (permanently): .. code:: Errors ====== DEAD 1 job process is considered dead (unreachable via the RAS network) Process 3487984 on node 172.16.64.213 managing GPU 5 #0-0 (cf264af53edbe986) INCOMPLETE Missing communicator data from 1 rank The missing rank: 21 RAS will simply stop attempting to communicate with such processes over the RAS network anymore, leaving it up to the user to determine if any additional action is warranted. .. highlight:: shell