Appendix: Windows MPI (MS-MPI)
On This Page
Message Passing Interface (MPI) is meant to provide virtual topology, synchronization, and communication functionality between a set of processes.
With MPI you can run one process on several hosts.
Windows MPI run over the following protocols:
Network Direct (ND)
Install HPC (Build: 4.0.3906.0).
Validate traffic (ping) between the whole MPI Hosts.
Every MPI client need to run smpd process which open the mpi channel.
MPI Initiator Server need to run: mpiexec. If the initiator is also client it should also run smpd.
Run the following command on each mpi client.
start smpd -d -p <port>
Install ND provider on each MPI client in MPI ND.
Run the following command on MPI server.
mpiexec.exe -p <smpd_port> -hosts <num_of_hosts> <hosts_ip_list> -env MPICH_NETMASK <network_ip/subnet> -env MPICH_ND_ZCOPY_THRESHOLD -
1-env MPICH_DISABLE_ND <
1> -env MPICH_DISABLE_SOCK <
1> -affinity <process>
Directing MPI traffic to a specific QoS priority may delayed due to:
Except for NetDirectPortMatchCondition, the QoS powershell CmdLet for NetworkDirect traffic does not support port range. Therefore, NetwrokDirect traffic cannot be directed to ports 1-65536.
The MSMPI directive to control the port range (namely: MPICH_PORT_RANGE 3000,3030) is not working for ND, and MSMPI chose a random port.
Set the default QoS policy to be the desired priority (Note: this prio should be lossless all the way in the switches*)
Set SMB policy to a desired priority only if SMD Traffic running.
[Recommended] Direct ALL TCP/UDP traffic to a lossy priority by using the “IPProtocolMatchCondition”.Warning
TCP is being used for MPI control channel (smpd), while UDP is being used for other services such as remote-desktop.
Arista switches forwards the pcp bits (e.g. 802.1p priority within the vlan tag) from ingress to egress to enable any two End-Nodes in the fabric as to maintain the priority along the route.
In this case the packet from the sender goes out with priority X and reaches the far end-node with the same priority X.
The priority should be lossless in the switches.
To force MSMPI to work over ND and not over sockets, add the following in mpiexec command:
Configure all the hosts in the cluster with identical PFC (see the PFC example below).
Run the WHCK ND based traffic tests to Check PFC (ndrping, ndping, ndrpingpong, ndpingpong).
Validate PFC counters, during the run-time of ND tests, with “Mellanox Adapter QoS Counters” in the perfmon.
Install the same version of HPC Pack in the entire cluster. NOTE: Version mismatch in HPC Pack 2012 can cause MPI to hung.
Validate the MPI base infrastructure with simple commands, such as “hostname”.
In the example below, ND and NDK go to priority 3 that configures no-drop in the switches. The TCP/UDP traffic directs ALL traffic to priority 1.
Remove the entire previous settings.
Remove-NetQosTrafficClass Remove-NetQosPolicy -Confirm:$False
Set the DCBX Willing parameter to false as Mellanox drivers do not support this feature.
Create a Quality of Service (QoS) policy and tag each type of traffic with the relevant priority.
In this example we used TCP/UDP priority 1, ND/NDK priority 3.
New-NetQosPolicy “SMB" -NetDirectPortMatchCondition
3New-NetQosPolicy “DEFAULT" -Default -PriorityValue8021Action
3New-NetQosPolicy “TCP" -IPProtocolMatchCondition TCP -PriorityValue8021Action1 New-NetQosPolicy “UDP" -IPProtocolMatchCondition UDP -PriorityValue8021Action
Enable PFC on priority 3.
Disable Priority Flow Control (PFC) for all other priorities except for 3.
Enable QoS on the relevant interface.
Running MPI Command Examples
Running MPI pallas test over ND.
> mpiexec.exe -p
0.0-env MPICH_ND_ZCOPY_THRESHOLD -
Running MPI pallas test over ETH.
> exempiexec.exe -p
0.0-env MPICH_ND_ZCOPY_THRESHOLD -