IMEX Service Command Service#

The IMEX Service can also be configured to have a command service enabled (refer to “IMEX Service CMD Utility Service Configuration” on page 25). Enabling this feature allows you to use the nvidia-imex-ctl tool, which is also installed as part of the IMEX package. This tool facilitates querying the state of one (or all) IMEX instances and providing a convenient way to assess the IMEX cluster’s status without manually inspecting all log files.

The nvidia-imex-ctl Tool#

The nvidia-imex-ctl tool connects to the IMEX Service command service.

Here is an example of the usage:

nvidia-imex-ctl --help
Usage: nvidia-imex-ctl [-i <IP> <PORT>|-u <unix domain socket path>|-c <nvidia-imex config.cfg path>] <-n|-N|-q|-s|-d|-m> [-t <timeout>]
Where:
       -c <nvidia-imex config path>: Path to nvidia-imex configuration file.  Default value is /etc/nvidia-imex/config.cfg
NOTE:  If none of -i, -u, -c are provided, -c is assumed with default value.
       -n: Continuously monitor the entire IMEX domain status.  Requires config file (default or via -c)
       -N: Gets the full status of the entire IMEX domain.  Requires config file (default or via -c)
       -H: Show hostname alongside node IP, only works with -n/-N
       -t: Timeout (in milliseconds) for the -s command to terminate.  Unspecified means it will run until IMEX terminates.
       -j: Output of the -n/-N command will be in JSON, rather than formatted.
NOTE: only one of -i, -u, or -c can be specified.  If none are specified, -c is assumed with the default nvidia-imex config file location.
      also, only one command can be executed.

Connecting to the IMEX Services#

Depending on the IMEX Service configuration, nvidia-imex-ctl can establish connections using TCP/IP or UNIX domain sockets. The connection method is determined by the -i, -u, and -c parameters.

  • -i : Connects through TCP/IP to the provided address:port.

  • -u : Connects by using a UNIX domain socket.

  • -c: Reads the config file and connects by TCP/IP to what is configured in the config file.

To query the configured nodes, this parameter can be used with the query (-q) parameter.

Operations#

This section provides information about the operations.

Full Domain Monitoring and Querying#

The full domain monitoring/querying commands (-n/-N) connect nvidia-imex-ctl to each IMEX daemon in the domain as read in from the configuration file specified by the -c command. The process retrieves or subscribes to status updates, including the status of the IMEX daemon status (refer to “Operations” on page 34) and the interconnectivity state matrix between the iMEX daemons in the domain. In this guide, information about the subscription/monitoring command (-n) is provided, the output for the one-time query command (-N) is the same, except this command will only retrieve the snapshot and not have indicators for state changes.

Here is an example of running a 2-node NO GPU nvidia-imex domain in monitoring mode (-n) with -H enabled to show hostnames:

./nvidia-imex-ctl -H -n -c nodaemon.cfg
Connectivity Table Legend:
I - Invalid - Node wasn't reachable, no connection status available
N - Never Connected
R - Recovering - Connection was lost, but clean up has not yet been triggered.
D - Disconnected - Connection was lost, and clean up has been triggreed.
A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.
!V! - Version mismatch, communication disabled.
!M! - Node map mismatch, communication disabled.
!A! - Authentication error, communication disabled.
C - Connected - Ready for operation
In monitoring mode, changes in state will be denoted by * *

3/18/2025 12:41:51.570
Nodes:
Node #0   * 10.127.59.221 *    - UNAVAILABLE          - Version:                 - Hostname: imex-mini-1
Node #1   - 10.127.59.222      - UNAVAILABLE          - Version:                 - Hostname: imex-mini-2

 Nodes From\To  0   1
       0        I   I
       1        I   I
Domain State: DOWN

3/18/2025 12:42:37.613
Nodes:
Node #0   * 10.127.59.221 *    - *READY*              - Version: NO_GPU          - Hostname: imex-mini-1
Node #1   - 10.127.59.222      - UNAVAILABLE          - Version:                 - Hostname: imex-mini-2

 Nodes From\To  0   1
       0       *C* *N*
       1        I   I
Domain State: DEGRADED

3/18/2025 12:42:41.938
Nodes:
Node #0   * 10.127.59.221 *    - READY                - Version: NO_GPU          - Hostname: imex-mini-1
Node #1   - 10.127.59.222      - *READY*              - Version: NO_GPU          - Hostname: imex-mini-2

 Nodes From\To  0   1
       0        C  *C*
       1       *C* *C*
Domain State: UP

Note: The ‘*’ surrounding the ip address in the Node list indicates the node that the nvidia-imex-ctl command is being run from.

Here’s an example of running domain query mode (-N) which will just retrieve the current state and then exit. This shows the output from an 8 node cluster running version 580.00 with -H, but the hostname for 2 nodes cannot be retrieved:

Failed to look up hostname for Node #4 IP address: 10.76.188.218
Failed to look up hostname for Node #7 IP address: 10.76.191.240
Connectivity Table Legend:
I - Invalid - Node wasn't reachable, no connection status available
N - Never Connected
R - Recovering - Connection was lost, but clean up has not yet been triggered.
D - Disconnected - Connection was lost, and clean up has been triggreed.
A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.
!V! - Version mismatch, communication disabled.
!M! - Node map mismatch, communication disabled.
!A! - Authentication error, communication disabled.
C - Connected - Ready for operation

3/18/2025 19:17:50.187
Nodes:
Node #0   * 10.76.179.115 *    - READY                - Version: 580.00          - Hostname: imex-compute0.nvidia.com
Node #1   - 10.76.183.30       - READY                - Version: 580.00          - Hostname: imex-compute1.nvidia.com
Node #2   - 10.76.184.132      - READY                - Version: 580.00          - Hostname: imex-compute2.nvidia.com
Node #3   - 10.76.186.255      - READY                - Version: 580.00          - Hostname: imex-compute3.nvidia.com
Node #4   - 10.76.188.218      - READY                - Version: 580.00          - Hostname: N/A
Node #5   - 10.76.189.233      - READY                - Version: 580.00          - Hostname: imex-compute5.nvidia.com
Node #6   - 10.76.190.23       - READY                - Version: 580.00          - Hostname: imex-compute6.nvidia.com
Node #7   - 10.76.191.240      - READY                - Version: 580.00          - Hostname: N/A

 Nodes From\To  0   1   2   3   4   5   6   7
       0        C   C   C   C   C   C   C   C
       1        C   C   C   C   C   C   C   C
       2        C   C   C   C   C   C   C   C
       3        C   C   C   C   C   C   C   C
       4        C   C   C   C   C   C   C   C
       5        C   C   C   C   C   C   C   C
       6        C   C   C   C   C   C   C   C
       7        C   C   C   C   C   C   C   C
Domain State: UP

When you include the -j option with -n or -N, the output will be like JSON.