Windows MPI (MS-MPI)
Message Passing Interface (MPI) provides virtual topology, synchronization, and communication functionality between a set of processes. MPI enables running one process on several hosts. With MPI you can run one process on several hosts.
Windows MPI run over the following protocols:
Sockets (Ethernet or IPoIB)
Network Direct (ND) Ethernet and InfiniBand
Install HPC (Build: 4.0.3906.0).
Validate traffic (ping) between the whole MPI Hosts.
Every MPI client need to run smpd process which open the mpi channel.
MPI Initiator Server need to run: mpiexec. If the initiator is also a client, it should also run smpd.
Run the following command on each mpi client.
start smpd -d -p <port>
Install ND provider on each MPI client in MPI ND.
mpiexec.exe -p <smpd_port> -hosts <num_of_hosts> <hosts_ip_list> -env MPICH_NETMASK <network_ip/subnet> -env MPICH_ND_ZCOPY_THRESHOLD -1 -env MPICH_DISABLE_ND <0/1> -env MPICH_DISABLE_SOCK <0/1> -affinity <process>
Run the following command on MPI server.
Directing MPI traffic to a specific QoS priority may delayed due to:
Except for NetDirectPortMatchCondition, the QoS powershell CmdLet for NetworkDirect traffic does not support port range. Therefore, NetwrokDirect traffic cannot be directed to ports 1-65536.
The MSMPI directive to control the port range (namely: MPICH_PORT_RANGE 3000,3030) is not working for ND, and MSMPI chose a random port.
Set the default QoS policy to be the desired priority (Note: this prio should be lossless all the way in the switches*)
Set SMB policy to a desired priority only if SMD Traffic running.
[Recommended] Direct ALL TCP/UDP traffic to a lossy priority by using the “IPProtocolMatchCondition”.
WarningTCP is being used for MPI control channel (smpd), while UDP is being used for other services such as remote-desktop.
Arista switches forwards the pcp bits (e.g. 802.1p priority within the vlan tag) from ingress to egress to enable any two End-Nodes in the fabric as to maintain the priority along the route.
In this case the packet from the sender goes out with priority X and reaches the far end-node with the same priority X.
The priority should be lossless in the switches
To force MSMPI to work over ND and not over sockets, add the following in mpiexec command:
-env MPICH_DISABLE_ND 0
-env MPICH_DISABLE_SOCK 1
Configure all the hosts in the cluster with identical PFC (see the PFC example below).
Run the WHCK ND based traffic tests to Check PFC (ndrping, ndping, ndrpingpong, ndpingpong).
Validate PFC counters, during the run-time of ND tests, with “Mellanox Adapter QoS Counters” in the perfmon.
Install the same version of HPC Pack in the entire cluster.
NOTE: Version mismatch in HPC Pack 2012 can cause MPI to hung.
Validate the MPI base infrastructure with simple commands, such as “hostname”.
PFC Example
In the example below, ND and NDK go to priority 3 that configures no-drop in the switches. The TCP/UDP traffic directs ALL traffic to priority 1.
Install DCBX.
Install-WindowsFeature Data-Center-Bridging
Remove the entire previous settings.
Remove-NetQosTrafficClass Remove-NetQosPolicy -Confirm:$False
Set the DCBX Willing parameter to false as Mellanox drivers do not support this feature.
Set-NetQosDcbxSetting -Willing
0
Create a Quality of Service (QoS) policy and tag each type of traffic with the relevant priority.
In this example we used TCP/UDP priority 1, ND/NDK priority 3.New-NetQosPolicy “SMB" -NetDirectPortMatchCondition
445
-PriorityValue8021Action3
New-NetQosPolicy “DEFAULT" -Default -PriorityValue8021Action3
New-NetQosPolicy “TCP" -IPProtocolMatchCondition TCP -PriorityValue8021Action1 New-NetQosPolicy “UDP" -IPProtocolMatchCondition UDP -PriorityValue8021Action1
Enable PFC on priority 3.
Enable-NetQosFlowControl
3
Disable Priority Flow Control (PFC) for all other priorities except for 3.
Disable-NetQosFlowControl
0
,1
,2
,4
,5
,6
,7
Enable QoS on the relevant interface.
Enable-netadapterqos -Name
Running MPI Command Examples
Running MPI pallas test over ND.
> mpiexec.exe -p
19020
-hosts4
11.11
.146.101
11.21
.147.101
11.21
.147.51
11.11
.145.101
-env MPICH_NETMASK11.0
.0.0
/255.0
.0.0
-env MPICH_ND_ZCOPY_THRESHOLD -1
-env MPICH_DISABLE_ND0
-env MPICH_DISABLE_SOCK1
-affinity c:\\test1.exeRunning MPI pallas test over ETH.
> exempiexec.exe -p
19020
-hosts4
11.11
.146.101
11.21
.147.101
11.21
.147.51
11.11
.145.101
-env MPICH_NETMASK11.0
.0.0
/255.0
.0.0
-env MPICH_ND_ZCOPY_THRESHOLD -1
-env MPICH_DISABLE_ND1
-env MPICH_DISABLE_SOCK0
-affinity c:\\test1.exe