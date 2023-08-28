Run IB Cluster Health Checks to perform diagnostic of the fabric's state based on ibdiagnet checks, SM files, and switch commands.

Warning By default, this job template is configured to run against the ib_host_manager group of IB Cluster Inventory.

The following instructions describe how to run this job template:

Go to Resources > Templates. Click the "Launch Template" button on "IB Cluster Health Checks".

The following variables are available for running IB Cluster Health Checks:

Name Description check_max_failure_percentage Max failure percentage for cluster health checks cot_executable Path to the installed cotclient tool exclude_scope List of node GUIDs and their ports to be excluded device Name of the RDMA device of the port used to connect to the fabric routing_check Specify for routing check sm_configuration_file Path for SM configuration file; supported only when the SM is running on the ib_host_manager sm_unhealthy_ports_check Specify for SM unhealthy ports check; supported only when the SM is running on the ib_host_manager topology_type Type of topology to discover mlnxos_switch_hostname Hostname expression that represents switches running MLNX-OS mlnxos_switch_username Username to authenticate against the target switches mlnxos_switch_password Password to authenticate against the target switches

The following are variable definitions and default values for the health check:

Name Default Type check_max_failure_percentage 1 Float cot_executable '/opt/nvidia/cot/client/bin/cotclient' String exclude_scope NULL List(String) device 'mlx5_0' String routing_check True Boolean sm_configuration_file '/etc/opensm/opensm.conf' String sm_unhealthy_ports_check false Boolean topology_type 'infiniband' String mlnxos_switch_hostname NULL String mlnxos_switch_username NULL String mlnxos_switch_password NULL String

The following example shows how to exclude ports using the exclude_scope variable:

Copy Copied! exclude_scope: ['0x1234@1/3', '0x1235']

In this example, IB Cluster Health Check runs over the fabric except on ports 1 and 3 of node GUID 0x1234 and all ports of node GUID 0x1235.

The following example shows how to configure switch variables:

Copy Copied! mlnxos_switch_hostname: 'ib-switch-t[1-2],ib-switch-s1' mlnxos_switch_username: 'admin' mlnxos_switch_password: 'my_admin_password'