Connections, Quorum, and Clean up#
Connections#
IMEX supports dynamically reconnecting to other IMEX daemons, but because IMEX manages references to handles allocated by CUDA jobs, it needs to identify whether the IMEX instance to which it has connected is the same (or a new) instance. If it is the same instance, nothing needs to be done. If it is a new instance, recovery and clean up needs to occur.
IMEX will detect when a connection is lost, and by default, will not do any clean up until the connection is established again. However, with the IMEX_NODE_DISCONNECTED_GRACE_TIME configuration option, a timeout can be set to trigger clean-up after a specified period.
Quorum#
IMEX no longer requires a full quorum to complete initialization, so during normal operations, IMEX will be ready for processing immediately after starting. However, if IMEX was not properly and cleanly shut down, after IMEX starts up, it will wait for nodes that had previously imported memory to reconnect. This allows the nodes to clean up old references by using the /var/run/nvidia-imex/persist.dat persistence file, which contains a key generated from nodes_config and a list of Node #: ip address of the imported nodes.
If there are no outstanding importers, the file will only contain the nodes_config key.
Note This recovery quorum behavior can be disabled by setting IMEX_WAIT_FOR_QUORUM=NONE. Also, If the nodes_config file has been changed between restarts, the persist.dat file is discarded, and the recovery quorum is skipped. |
Cleaning up#
If a node detects that another node has restarted, or IMEX_NODE_DISCONNECTED_GRACE_TIME is configured and the timeout has expired, the node will trigger a cleanup by notifying the driver that all jobs that had imported from that node are invalid. If the CUDA application tries to access imported memory from that node, an error will be triggered.