Known Issues
Internal Reference Number |
Issues |
3478803 |
Description: Getting topology info (sharp_cmd topology), fails when executed from the mgmt host. |
Workaround: It is possible to run it from different hosts, or add the following environment variable: SHARP_ALLOW_SM_PORT=1 |
|
Keywords: SHARP topology API |
|
Discovered in Version: 3.5.0 |
|
3340353 |
Description: When reconfiguring a standby management host to operate as a compute host, it will not be able to run SHARP jobs unless sharp_am is restarted. In case that a host runs the SM process, it will automatically be detected by the master SM as a standby SM and be reported as a standby management host. Note that restart is not required if ignore_sm_guids is set to FALSE. |
Workaround: N/A |
|
Keywords: Slave; compute host; ignore_sm_guids |
|
Discovered in Version: 3.3.0 |
|
3371820 |
Description: Congestion Control cannot be configured on the same SLs used by sharp_am. |
Workaround: N/A |
|
Keywords: Congestion control; SL |
|
Discovered in Version: 3.3.0 |
|
3438393 |
Description: When operating in the following configuration mode, resource limitation is ignored and no limit is set to any application: Dynamic trees allocation is used; Quasi Fat Tree (QFT)-oriented logic is used; and reservation_mode is on. |
Workaround: N/A |
|
Keywords: Dynamic trees allocation; QFT; resource limitation |
|
Discovered in Version: 3.3.0 |
|
3305335 |
Description: When running mpirun with multiple groups, the following error message might be received: [error] - AM QPAlloc confirm QP MAD response status 0x1c00 This message is received due to to the fact that multiple unserialized MAD requests are run in parallel. |
Workaround: Set the SHARP_COLL_SERIALIZE_MADS environment variable to TRUE when running mpirun. |
|
Keywords: mpirun; SHARP_COLL_SERIALIZE_MADS |
|
Discovered in Version: 3.2.0 |
|
3225401 |
Description: Dynamic trees creation feature does not support a case in which all root switches are down and restarted. If such a scenario takes place, sharp_am should be restarted once the root switches are up and running. |
Workaround: N/A |
|
Keywords: Aggregation Manager; sharp_am; dynamic trees |
|
Discovered in Version: 3.1.0 |
|
3237831 |
Description: SHARP does not support reassignment of LID values. In case LID reassignment is desired, make sure to stop all SHARP jobs, reassign LIDs via OpenSM, and restart sharp_am once the reassignment is done. |
Workaround: N/A |
|
Keywords: Aggregation Manager; OpenSM |
|
Discovered in Version: 3.1.0 |
|
3048427 |
Description: In the case that a switch split mode is modified (off/on), sharp_am does not handle the new number of supported ports unless it is restarted. |
Workaround: Restart sharp_am after changing a switch split mode definition. |
|
Keywords: Aggregation Manager; split mode |
|
Discovered in Release: 2.7.0 |
|
3051699 |
Description: Changing the configuration of SHARP switch ports using device_configuration_file does not take effect on disconnected split ports. If these ports are connected later, they will remain with their default configuration. |
Workaround: If the new configuration is desired for the split ports, make sure to restart the Aggregation Manager after connecting a split port to a host. |
|
Keywords: Aggregation Manager; split port |
|
Discovered in Release: 2.7.0 |
|
3051924 |
Description: Adding or replacing non-leaf switches is currently not supported by Aggregation Manager for Dragonfly+ topologies. |
Workaround: Restart Aggregation Manager after the Subnet Manager completes fabric reconfiguration followed by the fabric changes. |
|
Keywords: Fabric extension; Aggregation Manager; AM |
|
Discovered in Release: 2.7.0 |
|
- |
Description: On multi PKEY environment, UCX in SHARP can use only the default PKEY (PKEY at index 0). |
Workaround: Use sockets for communication over non-default PKEY. |
|
Keywords: Configuration, SMX, UCX, PKEY |
|
Discovered in Release: 2.4.3 |
|
1307124 |
Description: Begin Job requests with virtual ports might be rejected until fabric virtualization info file is parsed. |
Workaround: Wait for AM to discover virtual ports before sending Begin Job requests. |
|
Keywords: Aggregation Manager, Socket Direct, Virtual Ports |
|
Discovered in Release: 1.5.3 |
|
1193629 |
Description: Configuring sharp_am as daemon is not possible when installing from RPM into non-default location. |
Workaround: Configure daemon manually. |
|
Keywords: Configuration |
|
Discovered in Release: 1.5.3 |
|
1307108 |
Description: Discovering a new Aggregation Node (AN) found on the shortest path between two ANs might invalidate the existing path. |
Workaround: Restart Aggregation Manager after the Subnet Manager completes fabric reconfiguration followed by the fabric changes. |
|
Keywords: Aggregation Manager, Aggregation Node |
|
Discovered in Release: 1.5.3 |
|
- |
Description: Aggregation Manager High Availability is currently not supported in HPCX/MLNX OFED packages. Therefore, only a single instance of Aggregation Manager can run in the IB fabric. |
Workaround: Use Aggregation Manager in UFM. |
|
Keywords: Aggregation Manager |
|
- |
Description: Aggregation manager should run on the same Host where the Master Subnet Manager (SM) is running. |
Workaround: N/A |
|
Keywords: Aggregation Manager |
|
- |
Description: In case of HPCX/MLNX OFED packages, upon Subnet Manager handover/failover, another instance of Aggregation Manager should be started on the Host where the new Master SM is running |
Workaround: Use Aggregation Manager in UFM. |
|
Keywords: Aggregation Manager |
|
- |
Description: Aggregation Manager should be started after completion of fabric configuration by the Subnet Manager. |
Workaround: N/A |
|
Keywords: Aggregation Manager |
|
- |
Description: Only Fat-Tree, Quasi-Fat-Tree, Hypercube and Dragonfly+ topologies are supported by the Aggregation Manager. |
Workaround: N/A |
|
Keywords: Fabric Topology |
|
- |
Description: Only IB fabrics where all compute nodes are connected to NVIDIA SHARP capable switches are supported by the Aggregation Manager. |
Workaround: Manually configure mapping between the compute port and the Aggregation Node. |
|
Keywords: Fabric Topology |
|
- |
Description: Upon changes in configuration file beyond parameters in 3.3, Aggregation Manager should be restarted to deploy new configuration. |
Workaround: N/A |
|
Keywords: Configuration |
|
3686321 |
Description: When upgrading UFM from previous versions to UFM 6.15.x, sharp_am persistent directory as mentioned in the configuration file directs to a path that does not exist. This leads to failure in saving reservation and job information, so in case of a restart of sharp_am, it won’t be able to retrieve required information and return to its previous state. |
Workaround: Edit the file /opt/ufm/files/conf/sharp/sharp_am.cfg, modify the parameter persistent_dir, to direct to the path: /opt/ufm/files/conf/sharp/jobs. Make sure this part does exist. |
|
Keywords: sharp_am, UFM, upgrade |
|
Discovered in Version: 3.5.0 |