Message Passing Interface (MPI) provides virtual topology, synchronization, and communication functionality between a set of processes. MPI enables running one process on several hosts. With MPI you can run one process on several hosts.
- Windows MPI run over the following protocols:
- Sockets (Ethernet or IPoIB)
- Network Direct (ND) Ethernet and InfiniBand
- Install HPC (Build: 4.0.3906.0).
- Validate traffic (ping) between the whole MPI Hosts.
- Every MPI client need to run smpd process which open the mpi channel.
- MPI Initiator Server need to run: mpiexec. If the initiator is also a client, it should also run smpd.
Run the following command on each mpi client.
Install ND provider on each MPI client in MPI ND.
- Run the following command on MPI server.
Directing MSMPI Traffic
Directing MPI traffic to a specific QoS priority may delayed due to:
- Except for NetDirectPortMatchCondition, the QoS powershell CmdLet for NetworkDirect traffic does not support port range. Therefore, NetwrokDirect traffic cannot be directed to ports 1-65536.
- The MSMPI directive to control the port range (namely: MPICH_PORT_RANGE 3000,3030) is not working for ND, and MSMPI chose a random port.
Running MSMPI on the Desired Priority
Set the default QoS policy to be the desired priority (Note: this prio should be lossless all the way in the switches*)
- Set SMB policy to a desired priority only if SMD Traffic running.
[Recommended] Direct ALL TCP/UDP traffic to a lossy priority by using the “IPProtocolMatchCondition”.
TCP is being used for MPI control channel (smpd), while UDP is being used for other services such as remote-desktop.
Arista switches forwards the pcp bits (e.g. 802.1p priority within the vlan tag) from ingress to egress to enable any two End-Nodes in the fabric as to maintain the priority along the route.
In this case the packet from the sender goes out with priority X and reaches the far end-node with the same priority X.
The priority should be lossless in the switches
To force MSMPI to work over ND and not over sockets, add the following in mpiexec command:
Configure all the hosts in the cluster with identical PFC (see the PFC example below).
- Run the WHCK ND based traffic tests to Check PFC (ndrping, ndping, ndrpingpong, ndpingpong).
- Validate PFC counters, during the run-time of ND tests, with “Mellanox Adapter QoS Counters” in the perfmon.
- Install the same version of HPC Pack in the entire cluster.
- NOTE: Version mismatch in HPC Pack 2012 can cause MPI to hung.
- Validate the MPI base infrastructure with simple commands, such as “hostname”.
In the example below, ND and NDK go to priority 3 that configures no-drop in the switches. The TCP/UDP traffic directs ALL traffic to priority 1.
Remove the entire previous settings.
Set the DCBX Willing parameter to false as Mellanox drivers do not support this feature.
Create a Quality of Service (QoS) policy and tag each type of traffic with the relevant priority.
In this example we used TCP/UDP priority 1, ND/NDK priority 3.
Enable PFC on priority 3.
Disable Priority Flow Control (PFC) for all other priorities except for 3.
Enable QoS on the relevant interface.
Running MPI Command Examples
Running MPI pallas test over ND.
Running MPI pallas test over ETH.