image image image image image image



On This Page


Created on Feb 10, 2020 by Boris Kovalev, Peter Rudenko

Introduction

This Reference Deployment Guide (RDG) will demonstrate a multi-node cluster deployment procedure of RoCE/UCX Accelerated Apache Spark 2.4/3.0 over NVIDIA end-to-end 100 Gb/s Ethernet solution.

Below walk through guide for installation process of a pre-built Spark 2.4.4 Spark 3.0-preview2 standalone cluster of 15 physical nodes running Ubuntu 18.04.3 LTS includes step-by-step procedure to prepare the the network for RoCE traffic using NVIDIA recommended settings on both host and switch sides.

The HDFS cluster includes 15 datanodes and 1 namenode server.

References

Overview

What is Apache Spark™?

Apache Spark™ is an open-source, fast and general engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

NVIDIA SparkUCX Plugin

Apache Spark™ replaces MapReduce

Apache Spark is a general purpose engine like MapReduce, but is designed to run much faster and with many more workloads. 
One of the most interesting features of Spark is its efficient use of memory, while MapReduce worked primarily with data stored on disk.

MapReduce, as implemented in Hadoop, is a popular and widely-used engine, but it suffers from high-latency. furthermore, its batch-mode response time curbs the performance for many applications that process and analyze data. 

Accelerating Spark Shuffle

Shuffling is the process of redistributing data across partitions (repartitioning) between stages of computation. Shuffle is a costly process that should be avoided when possible.

In Hadoop shuffle writes intermediate files to disk, these files are pulled by the next step/stage. With Spark Shuffle, RDDs are kept in-memory and allow data to be within reach, but when working with a cluster, network resources are required for fetching data blocks from remote worker and adding on overall execution time.

Accelerating the network fetch for data blocks with RDMA (InfiniBand or RoCE) using SparkUCX plugin reduces the CPU usage and overall execution time.


SparkUCX Plugin

SparkUCX is a high-performance, scalable and efficient Shuffle-Manager plugin for Apache Spark. It utilizes RDMA (Remote Direct Memory Access) and other high performance transports to reduce CPU cycles needed to Shuffle data transfers. It reduces memory usage by reusing memory for transfers instead of copying data multiple times down the traditional TCP-stack.

SparkUCX plugin is built to provide the best performance out-of-the-box, and provides multiple configuration options to further tune SparkUCX per-job.

SparkRDMA and SparkUCX Comparison

SparkRDMA

SparkUCX

Based on abandoned IBM DiSNi verbs package

Production grade application which is based on UCX high-level API with a dedicated R&D and wide developer community

Supports IB/ROCE with RC only

Supports IB, RoCE with RC/DC/Shared memory, and TCP as fallback

Not scalable, CQ and progress thread per connection

Scalable, CQ per executor

Communications progress on dedicated thread with high CPU consumption

Communications are initiated from application threads and progressed asynchronously by hardware

RDMA protocols are implemented in Java

Based on standard UCX API and protocols hiding complexity of RDMA

Registering each data block with different key

Registering all data as a single chunk

Showed improved vs. worst TCP numbers

Showed improved vs. best TCP numbers


Performance

TeraSort benchmark: 60% overall reduction and 30% reduction in total execution time. Pagerank benchmark:  27% reduction in execution time.

Setup Overview

Before we start, make sure to familiarize with the Apache Cluster multi-node cluster architecture, see Overview - Spark 2.4.0 Documentation for more info.

Logical Design

Bill of Materials - BoM

In the distributed SparkUCX/HDFS configuration described in this guide, we are using the following hardware setup.


 

This document, does not cover the server’s storage aspect. You should configure the servers with the storage components suitable for your use case (Data Set size)

Physical Network

Connectivity

Network Configuration

In our reference design we will use a single port per server.

  • In case of a single port NIC we will wire the available port. 
  • In case of dual port NIC we will wire the 1st port to an Ethernet switch and leave the 2nd port unused.

We will cover the procedure later in the NVIDIA OFED Installation section.

Each server is connected to the SN2700 switch by a 100GbE copper cable.

The switch port connectivity in our case is as follows:

  • Port 1 – connected to the Namenode Server
  • Port 2 to 15 – connected to Worker Servers

Server names with network configuration provided below

Server Type

Server Name

IP and NICs

Internal Network -

100 GigE

Management Network -

1 GigE

Node 01 (master)

clx-mld-41

enp1f031.31.31.41

eno0: From DHCP (reserved)

Node 02

clx-mld-42

enp1f031.31.31.42

eno0: From DHCP (reserved)

Node 03

clx-mld-43

enp1f031.31.31.43

eno0: From DHCP (reserved)

Node 04

clx-mld-44

enp1f031.31.31.44

eno0: From DHCP (reserved)

Node 05

clx-mld-45

enp1f0: 31.31.31.45

eno0: From DHCP (reserved)

Node 06clx-mld-46enp1f0: 31.31.31.46 eno0: From DHCP (reserved)
Node 07clx-mld-47 enp1f031.31.31.47 eno0: From DHCP (reserved)
Node 08clx-mld-48 enp1f0 31.31.31. 48 eno0: From DHCP (reserved)
Node 09clx-mld-49 enp1f0 31.31.31. 49 eno0: From DHCP (reserved)
Node 10clx-mld-50 enp1f0 31.31.31. 50 eno0: From DHCP (reserved)
Node 11clx-mld-51 enp1f0 31.31.31. 51 eno0: From DHCP (reserved)
Node 12clx-mld-52 enp1f0 31.31.31. 52 eno0: From DHCP (reserved)
Node 13clx-mld-53 enp1f0 31.31.31. 53 eno0: From DHCP (reserved)
Node 14clx-mld-54 enp1f0 31.31.31. 54 eno0: From DHCP (reserved)
Node 15clx-mld-55 enp1f0 31.31.31. 55 eno0: From DHCP (reserved)

Network Switch Configuration

If you are not familiar with NVIDIA switch software follow the HowTo Get Started with NVIDIA switches guide. For more information please refer to the NVIDIA Onyx User Manual at  https://docs.NVIDIA.com/category/onyx

First we need to update the switch OS to the latest ONYX software version. For more details go to HowTo Upgrade MLNX-OS Software on NVIDIA switch systems.

We will accelerate SparkUCX by using RoCE transport. 

In our deployment we will configure our network to be lossy. No additional configuration on the switch is needed.

For lossless configuration and for NVIDIA Onyx version 3.8.2004 and above run: 

Switch console
switch (config) #roce lossless

(PFC+ECN) to run RoCE on lossless fabric are configured by the "roce" command


To see the RoCE configuration run:

Switch console
show roce


To monitor the RoCE counters run: 

Switch console
show interface ethernet counters roce

Master and Worker Servers installation and Configuration

Prerequisites

Updating Ubuntu Software Packages on the Master and Worker Servers

To update/upgrade Ubuntu software packages, run the the following commands:

Server Console
sudo apt-get update            # Fetches the list of available update
sudo apt-get upgrade -y        # Strictly upgrades the current packages

Installing General Dependencies on the Master and Worker Servers

To install general dependencies, run the commands below or paste each line.

Server Console
sudo apt-get install git bc

Installing Java 8 (Recommended Oracle Java) on the Master and Worker Servers

Server Console
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

Adding Entries to Host files on the Master and Worker Servers

To edit the host files:

Server Console
sudo vim /etc/hosts

Now add entries of namenoder (master) and worker servers

Server Console
127.0.0.1       localhost

127.0.1.1       clx-mld-42.local.domain      clx-mld-42

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters
31.31.31.41  namenoder
31.31.31.42  clx-mld-42-r
31.31.31.43  clx-mld-43-r
31.31.31.44  clx-mld-44-r
31.31.31.45  clx-mld-45-r
31.31.31.46  clx-mld-46-r
31.31.31.47  clx-mld-47-r
31.31.31.48  clx-mld-48-r
31.31.31.49  clx-mld-49-r
31.31.31.50  clx-mld-50-r
31.31.31.51  clx-mld-51-r
31.31.31.52  clx-mld-52-r
31.31.31.53  clx-mld-53-r
31.31.31.54  clx-mld-54-r
31.31.31.55  clx-mld-55-r

Creating Network File System (NFS) Share

Install the NFS server on the Master server. Create the directory /share/sparkucx and export it to all Worker servers.
Install the NFS client on all Worker servers. Mount /share/sparkucx export from Master server to /share/sparkucx local directory (the same path as on Master).

Configuring SSH

We will configure passwordless SSH access from Master to all Slave nodes.

  1. Install OpenSSH Server-Client on both Master and Slaves nodes

    Server Console
    sudo apt-get install openssh-server openssh-client
  2. On Master node - Generate Key Pairs

    Server Console
    ssh-keygen -t rsa -P ""
  3. Copy the content of .ssh/id_rsa.pub file on master to .ssh/authorized_keys file on all nodes (Master and Slave nodes)
  4. Check you can access Slave nodes from Master

    Server Console
    ssh clx-mld-41-r
    ssh clx-mld-42-r
    ssh clx-mld-43-r
    ...
    ssh clx-mld-55-r

Install Hadoop

Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster.

  • Download hadoop-2.7.7.tar.gz  to the machine to install it on.

    Server Console
    wget https://downloads.apache.org/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
  • Extract the compressed tar file

    Server Console
    cd /share/sparkucx/
    tar -zxvf hadoop-2.7.7.tar.gz

Downloading Apache Spark

  1. Go to Downloads | Apache Spark and download the Apache Spark™ and the spark-2.4.4-bin-hadoop2.7.tgz  to share the /share/sparkucx shared folder.
  2. Choose a Spark release: 2.4.4 (Aug 30 2019) or 3.0.0-preview2(Decv 23 2019)
  3. Choose a package type:  Pre-built for Apache Hadoop 2.7
  4. Download Spark: spark-2.4.4-bin-hadoop2.7.tgz
    https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz  
  5. Verify this release using the 2.4.4 signatureschecksums and project release KEYS.

Cloning the HiBench Suite 7.0 Repository

To clone the latest HiBench repository, run the following command:

Server Console
cd /share/sparkucx
git clone https://github.com/intel-hadoop/HiBench.git

The preceding git clone command creates a subdirectory called “HiBench”. After cloning, you may optionally build a specific branch (such as a release branch) by invoking the following commands:

Server Console
cd HiBench
git checkout master        # where master is the desired branch (by default)
cd <Path to NFS share>

Installing MLNX_OFED for Ubuntu on the Master and Workers

This chapter describes how to install and test MLNX_OFED for Linux package on a single host machine with NVIDIA ConnectX®-5 adapter card installed. 
For more information click on NVIDIA OFED for Linux User Manual.

Downloading NVIDIA OFED

  1. Verify that the system has a NVIDIA network adapter (HCA/NIC) installed:

    Shell
    lspci -v | grep Mellanox

    The following example shows a system with an installed NVIDIA HCA:

  2. Download the ISO image according to you OS to your host.
    The image’s name has the format MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso
    You can download it from:  http://www.NVIDIA.com > Products > Software > InfiniBand/VPI Drivers > NVIDIA OFED Linux (MLNX_OFED) > Download.

  3. Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page:

    Shell
    md5sum MLNX_OFED_LINUX-<ver>-<OS label>.iso

Installing NVIDIA OFED

MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:

  • Discovers the currently installed kernel.
  • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack.
  • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel).
  • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware.

The installation script removes all previously installed NVIDIA OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages

1. Log into the installation machine as root.

2. Copy the downloaded ISO to /root

3. Mount the ISO image on your machine:

Shell
mkdir /mnt/iso
mount -o loop /root/MLNX_OFED_LINUX-4.7-3.2.9.0-ubuntu18.04-x86_64.iso /mnt/iso
cd /mnt/iso

4.Run the installation script:

Shell
./mlnxofedinstall --all

5.Reboot after the installation finished successfully:

Shell
# /etc/init.d/openibd restart
# reboot

ConnectX®-5 ports can be individually configured to work as Infiniband or Ethernet ports. By default both ConnectX-5 VPI ports are initialized as InfiniBand ports.

6. Check that the port modes are Ethernet:

Shell
ibv_devinfo

7.If you see the following - You need to change the interfaces port type to Ethernet

Change the interfaces port type to Ethernet mode.

Change the mode to Ethernet. Use the mlxconfig script after the driver is loaded.

* LINK_TYPE_P1=2 is an Ethernet mode

a. Start mst and see ports names:

Shell
mst start
mst status

b. Change the port mode to Ethernet:

Shell
mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=2 
    Port 1 set to ETH mode
reboot

c. Query Ethernet devices and print information on them that is available for use from userspace:

Shell
ibv_devinfo

d. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the InfiniBand devices/ports:

Shell
ibdev2netdev

e. Configure the network interface:

Shell
ifconfig ens13f0 31.31.31.28 netmask 255.255.255.0

f.Insert the lines below to the /etc/network/interfaces file after the following lines:

Shell
vim /etc/network/interfaces

auto eno1
iface eno1 inet dhcp

The new lines:

Shell
auto ens13f0
iface ens13f0 inet static
address 31.31.31.28
netmask 255.255.255.0

Example:

Shell
vim /etc/network/interfaces
 
auto eno1
iface eno1 inet dhcp
auto ens13f0
iface ens13f0 inet static
address 31.31.31.28
netmask 255.255.255.0

g. Check the network configuration is set correctly:

vim /etc/network/interfaces
ifconfig -a

Lossless Fabric with L3 (DSCP) Configuration

This post provides a configuration example for NVIDIA devices installed with MLNX_OFED running RoCE over a lossless network, in DSCP-based QoS mode.

Sample for our environment:

Notations

<interface> refers to the parent interface (for example ens13f0)

<mlx-device> refers to mlx device (for example mlx5_0)

To refer the above, run:

Shell
ibdev2netdev

<mst-device> refers to MST device (for example /dev/mst/ mt4121_pciconf0) by running: 

Shell
mst start
mst status
Configuration:
Shell
mlnx_qos -i ens13f0 --trust dscp
echo 106 > /sys/class/infiniband/mlx5_0/tc/1/traffic_class
cma_roce_tos -d mlx5_0 -t 106
sysctl -w net.ipv4.tcp_ecn=1
mlnx_qos -i ens13f0 --pfc 0,0,0,1,0,0,0,0

Validate MOFED

Check the "mofed" version and "uverbs":

Shell
ofed_info -s
MLNX_OFED_LINUX-4.7-3.2.9.0
ls /dev/infiniband/uverbs1

Run bandwidth stress over InfiniBand in container.

Server

ib_write_bw -a -d mlx5_0 &

Client

ib_write_bw -a -F $server_IP -d mlx5_0 --report_gbits

This way you can run bandwidth stress over RoCE between containers.

Configuring the Environment

NOTE: The SparkUCX and HADOOP configuration steps are done from Master node

Installing Unified Communication X(UCX) on the Master and Worker Servers (From sources)

Server Console
cd /tmp
git clone –b v1.8.x git@github.com:openucx/ucx.git
cd ucx
mkdir build
./autogen.sh
cd build
../contrib/configure-devel –with-java –prefix=$PWD
make –j `nproc`
make install

Untar the Spark and archives

Server Console
cd /share/sparkucx
tar -xzvf spark-2.4.4-bin-hadoop2.7.tgz

All the scripts, jars, and configuration files are available in the newly created directory “spark-2.4.4-bin-hadoop2.7”

Unpack the SparkUCX jar file to /tmp directory and copy to Spark directory

spark-ucx-1.0-for-spark-2.4-jar-with-dependencies.jar.zip

Server Console
cp /tmp/spark-ucx-1.0-for-spark-2.4-jar-with-dependencies.jar /share/sparkucx/spark-2.4.4-bin-hadoop2.7/spark-ucx-1.0-for-spark-2.4-jar-with-dependencies.jar

Spark Configuration

  • Update your bash file, run:

    Server Console
    vim ~/.bashrc

    It will open your bash file in a text editor which you will scroll to the bottom and add these lines:

    Server Console
    export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
    export SPARK_HOME=/share/sparkucx/spark-2.4.4-bin-hadoop2.7/


  • Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.

    Server Console
    source ~/.bashrc
  • Check that the paths have been properly modified.

    Server Console
    echo $JAVA_HOME
    echo $SPARK_HOME
    echo $LD_LIBRARY_PATH
  • Edit  spark-env.sh

Now edit the configuration file spark-env.sh (in $SPARK_HOME/conf/) and set the following parameters:
Create a copy of the template of spark-env.sh and rename it. Add a master hostname, interface and spark_tmp directory.

Server Console
cd /share/sparkucx/spark-2.4.4-bin-hadoop2.7/conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $vuhuong)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)

#export SPARK_MASTER_HOST=spark1-r
export SPARK_MASTER_HOST=namenoder
export SPARK_LOCAL_IP=`/sbin/ip addr show enp1f0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1`
export SPARK_LOCAL_DIRS=/data/spark-tmp

    Add slaves
    Create a copy of the template of slaves configuration file, rename it to slaves (in $SPARK_HOME/conf/) and add the following entries:
    $ vim /share/sparkucx/spark-2.4.4-bin-hadoop2.7/conf/slaves
    clx-mld-41-r
    clx-mld-42-r
    clx-mld-43-r
    ...
    clx-mld-55-r

Hadoop Configuration

  • Copy  conf/slaves to hadoop-2.7.4/etc/hadoop/slaves

    Server Console
    cp /share/sparkucx/spark-2.4.4-bin-hadoop2.7/conf/slaves /share/sparkucx/hadoop-2.7.4/etc/hadoop/slaves
  • Create a distributed file system

    Server Console
    sbin/slaves.sh mkdir /data/hadoop_tmp  # on NVMe disk
  • Edit current_config/core-site.xml. Add both property tags in the configuration tags <configuration>. 

    Server Console
    cd /share/sparkucx/hadoop-2.7.7
    vim current_config/core-site.xml
    
    
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License. See accompanying LICENSE file.
    -->
    <!-- Put site-specific property overrides in this file. -->
    <configuration>
            <property>
                    <name>fs.default.name</name>
                    <value>hdfs://namenoder:9000</value>
            </property>
            <property>
                    <name>hadoop.tmp.dir</name>
                    <value>/data/hadoop_tmp</value>
            </property>
    </configuration>
  • Update your bash file, run:

    Server Console
    vim ~/.bashrc

    This will open your bash file in a text editor which you will scroll to the bottom and add these lines:

    Server Console
    export HADOOP_HOME=/share/sparkucx/hadoop-2.7.7/
  • Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.

    Server Console
    source ~/.bashrc
  • Check that the paths have been properly modified.

    Server Console
    echo $HADOOP_HOME
  • Edit current_config /hadoop-env.sh
    Now edit configuration file current_config /hadoop-env.sh (in $HADOOP_HOME/etc/hadoop) and set following parameters:
    Add a HADOOP_HOME directory.

    Server Console
    vim current_config/hadoop-env.sh
    
    # Licensed to the Apache Software Foundation (ASF) under one
    # or more contributor license agreements.  See the NOTICE file
    # distributed with this work for additional information
    # regarding copyright ownership.  The ASF licenses this file
    # to you under the Apache License, Version 2.0 (the
    # "License"); you may not use this file except in compliance
    # with the License.  You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    # Set Hadoop-specific environment variables here.
    # The only required environment variable is JAVA_HOME.  All others are
    # optional.  When running a distributed configuration it is best to
    # set JAVA_HOME in this file, so that it is correctly defined on
    # remote nodes.
    # The java implementation to use.
    export JAVA_HOME=${JAVA_HOME}
    export HADOOP_HOME=/share/sparkucx/hadoop-2.7.7/
    export HADOOP_PREFIX=$HADOOP_HOME
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/

HDFS Configuration

  • Edit  current_config/hdfs-site.xml

    Server Console
    vim current_config/hdfs-site.xml
    
    <!-- Put site-specific property overrides in this file. -->
    <configuration>
            <property>
                    <name>dfs.datanode.dns.interface</name>
                    <value>enp1f0</value>
            </property>
            <property>
                    <name>dfs.replication</name>
                    <value>1</value>
            </property>
            <property>
                    <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
                    <value>false</value>
            </property>
            <property>
                    <name>dfs.permissions</name>
                    <value>false</value>
            </property>
            <property>
                    <name>dfs.datanode.data.dir</name>
                    <value>/data/hadoop_tmp</value>
            </property>
            </configuration>

YARN Configuration

We do not use YARN in our example but you can use it in your deployment.

  • Edit  current_config/yarn-site.xml 

    Server Console
    vim current_config/yarn-site.xml
    
    <?xml version="1.0"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
    <configuration>
    <!-- Site specific YARN configuration properties -->
           <property>
                    <name>yarn.nodemanager.aux-services</name>
                    <value>mapreduce_shuffle</value>
            </property>
            <property>
                    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
            </property>
            <property>
                    <name>yarn.resourcemanager.resource-tracker.address</name>
                    <value>namenoder:8025</value>
            </property>
            <property>
                    <name>yarn.resourcemanager.scheduler.address</name>
                    <value>namenoder:8030</value>
            </property>
            <property>
                    <name>yarn.resourcemanager.admin.address</name>
                    <value>namenoder:8032</value>
            </property>
            <property>
                    <name>yarn.resourcemanager.webapp.address</name>
                    <value>namenoder:8034</value>
            </property>
            <property>
                    <name>yarn.resourcemanager.address</name>
                    <value>namenoder:8101</value>
            </property>
            <property>
                    <name>yarn.nodemanager.resource.memory-mb</name>
                    <value>40960</value>
            </property>
            <property>
                    <name>yarn.scheduler.maximum-allocation-mb</name>
                    <value>40960</value>
            </property>
            <property>
                    <name>yarn.scheduler.minimum-allocation-mb</name>
                    <value>2048</value>
            </property>
           <property>
                   <name>yarn.nodemanager.resource.cpu-vcores</name>
                   <value>20</value>
            </property>
            <property>
                    <name>yarn.nodemanager.disk-health-checker.enable</name>
                    <value>false</value>
            </property>
            <property>
                    <name>yarn.nodemanager.log-dirs</name>
                    <value>/tmp/yarn_nm/</value>
            </property>
            <property>
                    <name>yarn.log-aggregation-enable</name>
                    <value>true</value>
            </property>
    </configuration>
  • Edit current_config/yarn-env.shSpecify where hadoop directory. 

    Server Console
    vim current_config/yarn-env.sh
    ...
    export HADOOP_HOME=/share/sparkucx/hadoop-2.7.7/
    RDMA_IP=`/usr/sbin/ip addr show ens13f0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1` 
    export YARN_NODEMANAGER_OPTS="-Dyarn.nodemanager.hostname=$RDMA_IP"
    ....
  • Copy  current_config/* to etc/hadoop/.

    Server Console
    cp current_config/* etc/hadoop/
  • Execute the following command on the NameNode host machine to HDFS format:

    Server Console
    bin/hdfs namenode -format

Start Spark Standalone Cluster

Run SparkUCX on Top of a Hadoop Cluster

To start a Spark Services run the following commands on Master:

Server Console
cd /share/sparkucx/spark-2.4.4-bin-hadoop2.7
sbin/start-all.sh

Check whether Services have been Started

To check daemons on Master, run:

Server Console
jps

33970 Jps
47928 ResourceManager
48121 NodeManager
47529 DataNode
47246 NameNode

 To check daemons on Slaves, run:

Server Console
jps

1846 NodeManager
16491 Jps
1659 DataNode


To check HDFS and YARN (optional) statuses, run:

Server Console
cd /share/sparkucx/hadoop-2.7.7
bin/hdfs dfsadmin -report | grep Name
Name: 31.31.31.43:50010 (clx-mld-43-r)
Name: 31.31.31.47:50010 (clx-mld-47-r)
Name: 31.31.31.48:50010 (clx-mld-48-r)
Name: 31.31.31.45:50010 (clx-mld-45-r)
Name: 31.31.31.42:50010 (clx-mld-42-r)
Name: 31.31.31.41:50010 (namenoder)
...
Name: 31.31.31.55:50010 (clx-mld-53-r)

bin/hdfs dfsadmin -report | grep Name -c

15

bin/yarn node -list

18/02/20 16:56:31 INFO client.RMProxy: Connecting to ResourceManager at namenoder/31.31.31.41:8101

Total Nodes:15

         Node-Id      Node-State Node-Http-Address Number-of-Running-Containers

clx-mld-42-r:34873         RUNNING clx-mld-42-r:8042                            0

clx-mld-47-r:35045         RUNNING clx-mld-47-r:8042                            0

clx-mld-48-r:44996         RUNNING clx-mld-48-r:8042                            0

clx-mld-46-r:45432         RUNNING clx-mld-46-r:8042                            0

clx-mld-45-r:41307         RUNNING clx-mld-45-r:8042                            0

...

clx-mld-55-r:44311         RUNNING clx-mld-55-r:8042                            0

namenoder:41409         RUNNING    namenoder:8042                          0

Stop the Cluster

To stop a Spark Services run the following commands on Master:

Server Console
cd /share/sparkucx/spark-2.4.4-bin-hadoop2.7
sbin/stop-all.sh

Performance Tuning for NVIDIA Adapters

It is recommended to run the mlnx_tune utility, which will run several system checks and provide notification of any potential settings which will cause performance degradation.

You are welcome to act accordingly with findings.

For more information about the mlnx_tune you can read this post: HowTo Tune Your Linux Server for Best Performance Using the mlnx_tune Tool

Also the command will display CPU cores which are used by the network driver. This information will be used for Spark performance tuning.

Server Console
sudo mlnx_tune

2017-08-16 14:47:17,023 INFO Collecting node information

2017-08-16 14:47:17,023 INFO Collecting OS information

2017-08-16 14:47:17,026 INFO Collecting CPU information

2017-08-16 14:47:17,104 INFO Collecting IRQ Balancer information

. . .

Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

. . .
2018-02-16 14:47:18,777 INFO System info file: /tmp/mlnx_tune_180416_144716.log


After running the mlnx_tune command, it is highly recommended to set the cpuList parameter.
Change the spark.conf file to use the NUMA cores associated with the NVIDIA device.

Server Console
Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]  
spark.shuffle.rdma.cpuList 0-15

More in-depth performance resources can be found in the NVIDIA Community postPerformance Tuning for NVIDIA Adapters

SparkUCX Performance Tips

  • Compression:
    Spark enables compression as the default runtime option. 
    Using compression will result in smaller packet sizes to be sent between the nodes, but at the expense of having higher CPU utilization in order to compress the data. 
    Due to the high performance and low CPU overhead network properties of an RDMA network, it is recommended to disable compression when using SparkUCX. 
    In your spark.conf file set:

    Server Console
    spark.shuffle.compress         
    false spark.shuffle.spill.compress    false

    By disabling compression, you will be able to reclaim precious CPU cycles that were previously used for data compression/decompression, 
    and will also see additional performance benefits in the RDMA data transfer speeds.

  • Disk Media:
    In order to see the highest and most consistent performance results possible, it is recommended to use the highest performance disk media available. 
    Using a ramdrive or NVMe device for the spark-tmp and hadoop tmp files should be explored whenever possible.


Congratulations, you now have a RDMA accelerated Spark cluster ready to work.

Now we will run the HiBench Suite benchmarks for TCP vs. RoCE comparison.

Appendix A: Running HiBench with SparkUCX

HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. 
It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.

Environment

  • Instance type and Environment: See setup overview
  • OS: Ubuntu 18.04.3 LTS
  • Apache hadooop: 2.7.4, hdfs (1 namenode, 14 datanodes)
  • Spark: 2.4.4 standalone 15 nodes
  • Benchmark : Setup HiBench
  • Test Date: Feb 2020

Benchmarks runs

Steps to reproduce Terasort result:

  1. Setup HiBench

  2. Configure Hadoop and Spark settings in HiBench conf directory.

  3. In HiBench/conf/hibench.conf set:

    Server Console
    hibench.scale.profile bigdata 
    # Mapper number in hadoop, partition number in Spark 
    hibench.default.map.parallelism 3000 
    # Reducer nubmer in hadoop, shuffle partition number in Spark 
    hibench.default.shuffle.parallelism 15000
  4. Set in HiBench/conf/workloads/micro/terasort.conf:

    Server Console
    hibench.terasort.bigdata.datasize               1890000000
  5. Run HiBench/bin/workloads/micro/terasort/prepare/prepare.sh and HiBench/bin/workloads/micro/terasort/spark/run.sh


  6. Add to HiBench/conf/spark.conf:

    Server Console
    spark.driver.extraClassPath /share/sparkucx/spark-2.4.4-bin-hadoop2.7/spark-ucx-1.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar:/share/sparkucx/spark-2.4.4-bin-hadoop2.7/build/lib/spark.executor.extraClassPath
    spark.executor.extraClassPath /share/sparkucx/spark-2.4.4-bin-hadoop2.7/spark-ucx-1.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar:/share/sparkucx/spark-2.4.4-bin-hadoop2.7/build/lib/spark.shuffle.manager org.apache.spark.shuffle.UcxShuffleManager
    spark.shuffle.manager org.apache.spark.shuffle.UcxShuffleManager 
    spark.shuffle.compress false 
    spark.shuffle.spill.compress false
    spark.shuffle.sort.io.plugin.class org.apache.spark.shuffle.compat.spark_3_0.UcxLocalDiskShuffleDataIO
  7. Run HiBench/bin/workloads/micro/terasort/spark/run.sh

Steps to reproduce Pagerank sort result:

  1. In HiBench/conf/hibench.conf set:

    Server Console
    hibench.scale.profile bigdata 
    # Mapper number in hadoop, partition number in Spark 
    hibench.default.map.parallelism 3000
    # Reducer nubmer in hadoop, shuffle partition number in Spark 
    hibench.default.shuffle.parallelism 15000
  2. Run HiBench/bin/workloads/micro/sort/prepare/prepare.sh and HiBench/bin/workloads/micro/sort/spark/run.sh

Benchmarks Results


Done!

About the Authors

About Boris Kovalev

For the past several years, Boris Kovalev has worked as a solution architect at NVIDIA technology, responsible for complex machine learning and advanced VMware-based cloud research and design. Previously, he spend more than 15 years as a senior consultant and solution architect at multiple companies, most recently at VMware. He's written multiple reference designs covering VMware, machine learning, Kubernetes, and container solutions, which are available at the NVIDIA Documents website. 

About Peter Rudenko


Peter Rudenko is a software engineer in High Performance Computing team, focusing on accelerating data intensive applications, developing UCX communication library and various big data solutions.

 step-by-step procedure to prepare the the network for RoCE traffic using NVIDIA recommended settings on both host and switch sides.wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. Neither NVIDIA Corporation nor any of its direct or indirect subsidiaries and affiliates (collectively: “NVIDIA”) make any representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Trademarks
NVIDIA, the NVIDIA logo, and Mellanox are trademarks and/or registered trademarks of NVIDIA Corporation and/or Mellanox Technologies Ltd. in the U.S. and in other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright
© 2022 NVIDIA Corporation & affiliates. All Rights Reserved.