NVIDIA WinOF-2 Documentation v3.0
Linux Kernel Upstream Release Notes v6.5

Windows MPI (MS-MPI)

Message Passing Interface (MPI) provides virtual topology, synchronization, and communication functionality between a set of processes. MPI enables running one process on several hosts. With MPI you can run one process on several hosts.

  • Windows MPI run over the following protocols:

    • Sockets (Ethernet or IPoIB)

    • Network Direct (ND) Ethernet and InfiniBand

  • Install HPC (Build: 4.0.3906.0).

  • Validate traffic (ping) between the whole MPI Hosts.

  • Every MPI client need to run smpd process which open the mpi channel.

  • MPI Initiator Server need to run: mpiexec. If the initiator is also a client, it should also run smpd.

  1. Run the following command on each mpi client.

    Copy
    Copied!
                

    start smpd -d -p <port>

  2. Install ND provider on each MPI client in MPI ND.

    Copy
    Copied!
                

    mpiexec.exe -p <smpd_port> -hosts <num_of_hosts> <hosts_ip_list> -env MPICH_NETMASK <network_ip/subnet> -env MPICH_ND_ZCOPY_THRESHOLD -1 -env MPICH_DISABLE_ND <0/1> -env MPICH_DISABLE_SOCK <0/1> -affinity <process>

  3. Run the following command on MPI server.

Directing MPI traffic to a specific QoS priority may delayed due to:

  • Except for NetDirectPortMatchCondition, the QoS powershell CmdLet for NetworkDirect traffic does not support port range. Therefore, NetwrokDirect traffic cannot be directed to ports 1-65536.

  • The MSMPI directive to control the port range (namely: MPICH_PORT_RANGE 3000,3030) is not working for ND, and MSMPI chose a random port.

Set the default QoS policy to be the desired priority (Note: this prio should be lossless all the way in the switches*)

  1. Set SMB policy to a desired priority only if SMD Traffic running.

  2. [Recommended] Direct ALL TCP/UDP traffic to a lossy priority by using the “IPProtocolMatchCondition”.

    Warning

    TCP is being used for MPI control channel (smpd), while UDP is being used for other services such as remote-desktop.

Arista switches forwards the pcp bits (e.g. 802.1p priority within the vlan tag) from ingress to egress to enable any two End-Nodes in the fabric as to maintain the priority along the route.

In this case the packet from the sender goes out with priority X and reaches the far end-node with the same priority X.

Warning

The priority should be lossless in the switches

To force MSMPI to work over ND and not over sockets, add the following in mpiexec command:

Copy
Copied!
            

-env MPICH_DISABLE_ND 0 -env MPICH_DISABLE_SOCK 1

Configure all the hosts in the cluster with identical PFC (see the PFC example below).

  1. Run the WHCK ND based traffic tests to Check PFC (ndrping, ndping, ndrpingpong, ndpingpong).

  2. Validate PFC counters, during the run-time of ND tests, with “Mellanox Adapter QoS Counters” in the perfmon.

  3. Install the same version of HPC Pack in the entire cluster.

  4. NOTE: Version mismatch in HPC Pack 2012 can cause MPI to hung.

  5. Validate the MPI base infrastructure with simple commands, such as “hostname”.

PFC Example

In the example below, ND and NDK go to priority 3 that configures no-drop in the switches. The TCP/UDP traffic directs ALL traffic to priority 1.

  • Install DCBX.

    Copy
    Copied!
                

    Install-WindowsFeature Data-Center-Bridging

  • Remove the entire previous settings.

    Copy
    Copied!
                

    Remove-NetQosTrafficClass Remove-NetQosPolicy -Confirm:$False

  • Set the DCBX Willing parameter to false as Mellanox drivers do not support this feature.

    Copy
    Copied!
                

    Set-NetQosDcbxSetting -Willing 0

  • Create a Quality of Service (QoS) policy and tag each type of traffic with the relevant priority.
    In this example we used TCP/UDP priority 1, ND/NDK priority 3.

    Copy
    Copied!
                

    New-NetQosPolicy “SMB" -NetDirectPortMatchCondition 445 -PriorityValue8021Action 3 New-NetQosPolicy “DEFAULT" -Default -PriorityValue8021Action 3 New-NetQosPolicy “TCP" -IPProtocolMatchCondition TCP -PriorityValue8021Action1 New-NetQosPolicy “UDP" -IPProtocolMatchCondition UDP -PriorityValue8021Action 1

  • Enable PFC on priority 3.

    Copy
    Copied!
                

    Enable-NetQosFlowControl 3

  • Disable Priority Flow Control (PFC) for all other priorities except for 3.

    Copy
    Copied!
                

    Disable-NetQosFlowControl 0,1,2,4,5,6,7

  • Enable QoS on the relevant interface.

    Copy
    Copied!
                

    Enable-netadapterqos -Name

Running MPI Command Examples

  • Running MPI pallas test over ND.

    Copy
    Copied!
                

    > mpiexec.exe -p 19020 -hosts 4 11.11.146.101 11.21.147.101 11.21.147.51 11.11.145.101 -env MPICH_NETMASK 11.0.0.0/ 255.0.0.0 -env MPICH_ND_ZCOPY_THRESHOLD -1 -env MPICH_DISABLE_ND 0 -env MPICH_DISABLE_SOCK 1 -affinity c:\\test1.exe

  • Running MPI pallas test over ETH.

    Copy
    Copied!
                

    > exempiexec.exe -p 19020 -hosts 4 11.11.146.101 11.21.147.101 11.21.147.51 11.11.145.101 -env MPICH_NETMASK 11.0.0.0/ 255.0.0.0 -env MPICH_ND_ZCOPY_THRESHOLD -1 -env MPICH_DISABLE_ND 1 -env MPICH_DISABLE_SOCK 0 -affinity c:\\test1.exe

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.