Created on Jun 30, 2019

Introduction

This Reference Deployment Guide (RDG) will demonstrate a multi-node cluster deployment procedure of RoCE Accelerated Apache Spark 2.2.0 and NVIDIA end-to-end 100 Gb/s Ethernet solution.

This document describes the process of installing a pre-builded Spark 2.2.0 standalone cluster of 17 physical nodes running Ubuntu 16.04.3 LTS .

The HDFS cluster includes 1 namenode server and 16 datanodes.

We will show how to prepare network for RoCE traffic accordingly with NVIDIA recommendations and will provide all steps required on host and switch sides.


References

Overview

What is Apache Spark™?

Apache Spark™ is an open-source, fast and general engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

NVIDIA SparkRDMA Plugin

Apache Spark™ replaces MapReduce

MapReduce, as implemented in Hadoop, is a popular and widely-used engine. In spite of its popularity, MapReduce suffers from high-latency and its batch-mode response is painful for lots of applications that process and analyze data. 
Apache Spark is a general purpose engine like MapReduce, but is designed to run much faster and with many more workloads. 
One of the most interesting features of Spark is its efficient use of memory, while MapReduce has always worked primarily with data stored on disk.

Accelerating Spark Shuffle

Shuffling is the process of redistributing data across partitions (that is, re-partitioning) between stages of computation. It is a costly process that should be avoided when possible. 
In Hadoop, shuffle writes intermediate files to the disk. These files are pulled by the next step/stage. With Spark shuffle, datasets are kept in memory and make the data within reach. 
However, when working in a cluster, network resources are required for fetching data blocks, adding on overall execution time. 
The SparkRDMA plugin accelerates the network fetch of data blocks using RDMA/RoCE technology, which reduces CPU usage and overall execution time.

SparkRDMA Plugin

SparkRDMA plugin is a high-performance, scalable and efficient ShuffleManager open-source plugin for Apache Spark.

It utilizes RDMA/RoCE (Remote Direct Memory Access/ RDMA over Converged Ethernet) technology to reduce CPU cycles needed for Shuffle data transfers, reducing memory usage by reusing memory for transfers rather than copying data multiple times as the traditional TCP-stack does.
SparkRDMA plugin is built to provide the best performance out-of-the-box. Additionally, it provides multiple configuration options to further tune SparkRDMA on a per-job basis.

SparkRDMA is build to provide the best performance out-of-the-box.

Performance

TeraSort benchmark show x1.53 overall reduced in execution time, Sort benchmark show x1.28 reduction in execution time.

Setup Overview

Before you start, make sure you are aware of the Apache Cluster multi-node cluster architecture, see Overview - Spark 2.2.0 Documentation for more info.

Logical Design

Bill of Materials - BOM

In the distributed Spark/HDFS configuration described in this guide, we are using the following hardware specification.

 
 

This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size)

Physical Network Connections


Network Configuration

In our reference we will use a single port per server. In case of a single port NIC we will wire the available port. In case of dual port NIC we will wire the 1st port to an Ethernet switch and will not use the 2nd port.

We will cover the procedure later in the Installing NVIDIA OFED section.

Each server is connected to the SN2700 switch by a 100GbE copper cable.

The switch port connectivity in our case is as follow:

  • 1th port – connected to the Namenode Server
  • 2st -17th ports – connected to Worker Servers

Server names with network configuration provided below

Server type

Server name

IP and NICs

Internal network -

100 GigE

Management network -

1 GigE

Node 01 (master)

clx-mld-41

enp1f031.31.31.41

eno0: From DHCP (reserved)

Node 02

clx-mld-42

enp1f031.31.31.42

eno0: From DHCP (reserved)

Node 03

clx-mld-43

enp1f031.31.31.43

eno0: From DHCP (reserved)

Node 04

clx-mld-44

enp1f031.31.31.44

eno0: From DHCP (reserved)

Node 05

clx-mld-45

enp1f0: 31.31.31.45

eno0: From DHCP (reserved)

Node 06clx-mld-46enp1f0: 31.31.31.46 eno0: From DHCP (reserved)
Node 07clx-mld-47 enp1f031.31.31.47 eno0: From DHCP (reserved)
Node 08clx-mld-48 enp1f0 31.31.31. 48 eno0: From DHCP (reserved)
Node 09clx-mld-49 enp1f0 31.31.31. 49 eno0: From DHCP (reserved)
Node 10clx-mld-50 enp1f0 31.31.31. 50 eno0: From DHCP (reserved)
Node 11clx-mld-51 enp1f0 31.31.31. 51 eno0: From DHCP (reserved)
Node 12clx-mld-52 enp1f0 31.31.31. 52 eno0: From DHCP (reserved)
Node 13clx-mld-53 enp1f0 31.31.31. 53 eno0: From DHCP (reserved)
Node 14clx-mld-54 enp1f0 31.31.31. 54 eno0: From DHCP (reserved)
Node 15clx-mld-55 enp1f0 31.31.31. 55 eno0: From DHCP (reserved)
Node 16clx-mld-56 enp1f0 31.31.31. 56 eno0: From DHCP (reserved)
Node 17clx-mld-57 enp1f0 31.31.31. 57 eno0: From DHCP (reserved)

Network Switch Configuration


Please start from the  HowTo Get Started with NVIDIA switches  guide if you don't familiar with NVIDIA switch software. For more information please refer to the MLNX-OS User Manual located at  support.mellanox.com  or  www.mellanox.com -> Switches

In first step please update your switch OS to the latest ONYX OS software. Please use this guide HowTo Upgrade MLNX-OS Software on NVIDIA switch systems.


We will accelerate Spark by using RDMA transport. 
There are several industry standard network configuration for RoCE deployment.

You are welcome to follow the Recommended Network Configuration Examples for RoCE Deployment for our recommendations and instructions.

In our deployment we will configure our network to be lossless and will use DSCP on host and switch sides:

Below is our switch configuration you can use as reference. You can copy/paste it to you switch but please be aware that this is clean switch configuration and if you may corrupt your existing configuration.


swx-mld-1-2 [standalone: master] > enable
swx-mld-1-2 [standalone: master] # configure terminal
swx-mld-1-2 [standalone: master] (config) # show running-config
##
## Running database "initial"
## Generated at 2018/03/10 09:38:38 +0000
## Hostname: swx-mld-1-2  
##
##
## Running-config temporary prefix mode setting
##                                          
no cli default prefix-modes enable          
##
## License keys
##          
   license install LK2-RESTRICTED_CMDS_GEN2-44T1-4H83-RWA5-G423-GY7U-8A60-E0AH-ABCD
##
## Interface Ethernet buffer configuration
##
   traffic pool roce type lossless
   traffic pool roce memory percent 50.00
   traffic pool roce map switch-priority 3
##
## LLDP configuration
##
   lldp
##
## QoS switch configuration
##
   interface ethernet 1/1-1/32 qos trust L3
   interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500
##
## DCBX ETS configuration
##
   interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict
##
## Other IP configuration
##
   hostname swx-mld-1-2
##
## AAA remote server configuration
##
# ldap bind-password ********
# radius-server key ********
# tacacs-server key ********
##
## Network management configuration
##
# web proxy auth basic password ********
##
## X.509 certificates configuration
##
#
# Certificate name system-self-signed, ID 108bb9eb3e99edff47fc86e71cba530b6a6b8991
# (public-cert config omitted since private-key config is hidden)
##
## Persistent prefix mode setting
##
cli default prefix-modes enable


Master and Worker Servers installation and Configuration

Prerequisites

Updating Ubuntu Software Packages on the Master and Worker Servers

To update/upgrade Ubuntu software packages, run the commands below.

sudo apt-get update            # Fetches the list of available update

sudo apt-get upgrade -y        # Strictly upgrades the current packages


Installing General Dependencies on the Master and Worker Servers

To install general dependencies, run the commands below or paste each line.

sudo apt-get install git bc

Installing Java 8 (Recommended Oracle Java) on the Master and Worker Servers

sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

Adding Entries to Host files on the Master and Worker Servers

Edit the host files:

sudo vim /etc/hosts

Now add entries of namenoder (master) and worker servers

127.0.0.1       localhost

127.0.1.1       clx-mld-42.local.domain      clx-mld-42

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters
31.31.31.41  namenoder
31.31.31.42  clx-mld-42-r
31.31.31.43  clx-mld-43-r
31.31.31.44  clx-mld-44-r
31.31.31.45  clx-mld-45-r
31.31.31.46  clx-mld-46-r
31.31.31.47  clx-mld-47-r
31.31.31.48  clx-mld-48-r
31.31.31.49  clx-mld-49-r
31.31.31.50  clx-mld-50-r
31.31.31.51  clx-mld-51-r
31.31.31.52  clx-mld-52-r
31.31.31.53  clx-mld-53-r
31.31.31.54  clx-mld-54-r
31.31.31.55  clx-mld-55-r
31.31.31.56  clx-mld-56-r
31.31.31.57  clx-mld-57-r


Required Software

Prior to install and configure a Apache Spark and  SparkRDMA environment, the following software must be downloaded.

 

Creating Network File System (NFS) Share

Install the NFS server on the Master server. Create the directory /share/spark_rdma and export it to all Worker servers.
Install the NFS client on all Worker servers. Mount /share/spark_rdma export from Master server to /share/spark_rdma local directory (the same path as on Master).

Configuring SSH

We will configure passwordless SSH access from Master to all Slave nodes.

  1. Install OpenSSH Server-Client on both Master and Slaves nodes

    sudo apt-get install openssh-server openssh-client
  2. On Master node - Generate Key Pairs

    ssh-keygen -t rsa -P ""
  3. Copy the content of .ssh/id_rsa.pub file on master to .ssh/authorized_keys file on all nodes (Master and Slave nodes)
  4. Check you can access Slave nodes from Master

    ssh clx-mld-41-r
    ssh clx-mld-42-r
    ssh clx-mld-43-r
    ...
    ssh clx-mld-57-r

Downloading Apache Spark

  1. Go to Downloads | Apache Spark and download the Apache Spark™ and the spark-2.2.0-bin-hadoop2.7.tgz  to share the /share/spark_rdma shared folder.
  2. Choose a Spark release:
    2.2.1 (Dec 01 2017)2.2.0 (Jul 11 2017)2.1.2 (Oct 09 2017)2.1.1 (May 02 2017)2.1.0 (Dec 28 2016)2.0.2 (Nov 14 2016)2.0.1 (Oct 03 2016)2.0.0 (Jul 26 2016)1.6.3 (Nov 07 2016)1.6.2 (Jun 25 2016)1.6.1 (Mar 09 2016)1.6.0 (Jan 04 2016)
  3. Choose a package type:
    Pre-built for Apache Hadoop 2.7 and laterPre-built for Apache Hadoop 2.6Pre-build with user-provided Apache HadoopSource Code
  4. Download Spark: spark-2.2.0-bin-hadoop2.7.tgz
    https://www.apache.org/dyn/closer.lua/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
      
  5. Verify this release using the 2.2.0 signatures and checksums and project release KEYS.
Note: Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should download the Spark source package and build with Scala 2.10 support.

Downloading NVIDIA SparkRDMA 2.0

Download the SparkRDMA Release Version 2.0 and save it in the  /share/spark_rdma shared folder.

SparkRDMA Release Version 2.0

  petro-rudenko  released this 16 days ago ·  1 commit  to master since this release

Assets

All-new implementation of SparkRDMA, redesigned from the ground up to further increase scalability, robustness and most importantly - performance.
Among the new features and capabilities introduced in this release:

  • All-new Metadata (Map Output) fetching protocol - now allows scaling to the tens of thousands of partitions, with superior performance and recoverability
  • Software-level flow control in RdmaChannel - eliminates pause storms in the fabric
  • ODP (On-Demand Paging) support - improves memory efficiency

Attached are pre-built binaries. Please follow the README page for instructions.

Cloning the HiBench Suite 7.0 Repository

To clone the latest HiBench repository, run the following command:

cd /share/spark_rdma
git clone https://github.com/intel-hadoop/HiBench.git

The preceding git clone command creates a subdirectory called “HiBench”. After cloning, you may optionally build a specific branch (such as a release branch) by invoking the following commands:


cd HiBench
git checkout master        # where master is the desired branch (by default)

cd <Path to NFS share>


Installing MLNX_OFED for Ubuntu on the Master and Workers

This chapter describes how to install and test MLNX_OFED for Linux package on a single host machine with NVIDIA ConnectX®-5 adapter card installed. 
For more information click on NVIDIA OFED for Linux User Manual.


Downloading NVIDIA OFED

  1. Verify that the system has a NVIDIA network adapter (HCA/NIC) installed:

    lspci -v | grep Mellanox

    The following example shows a system with an installed NVIDIA HCA:
     

  2. Download the ISO image according to you OS to your host.
    The image’s name has the format MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso
    You can download it from:  http://www.mellanox.com > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.

  3. Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page:

    md5sum MLNX_OFED_LINUX-<ver>-<OS label>.iso

Installing NVIDIA OFED

MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:

  • Discovers the currently installed kernel.
  • Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack.
  • Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel).
  • Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware.

The installation script removes all previously installed NVIDIA OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages

1. Log into the installation machine as root.

2. Copy the downloaded ISO to /root

3. Mount the ISO image on your machine:

mkdir /mnt/iso
mount -o loop /root/MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64.iso /mnt/iso
cd /mnt/iso

4.Run the installation script:

./mlnxofedinstall --all

5.Reboot after the installation finished successfully:

# /etc/init.d/openibd restart
# reboot

ConnectX®-5 ports can be individually configured to work as Infiniband or Ethernet ports. By default both ConnectX-5 VPI ports are initialized as InfiniBand ports.

6. Check that the port modes are Ethernet:

ibv_devinfo

7.If you see the following - You need to change the interfaces port type to Ethernet

Change the interfaces port type to Ethernet mode.

Change the mode to Ethernet. Use the mlxconfig script after the driver is loaded.

* LINK_TYPE_P1=2 is an Ethernet mode

a. Start mst and see ports names:

mst start
mst status

b. Change the port mode to Ethernet:

mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=2 
    Port 1 set to ETH mode
reboot

c. Query Ethernet devices and print information on them that is available for use from userspace:

ibv_devinfo

d. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the InfiniBand devices/ports:

ibdev2netdev

e. Configure the network interface:

ifconfig ens13f0 31.31.31.28 netmask 255.255.255.0

f.Insert the lines below to the /etc/network/interfaces file after the following lines:

vim /etc/network/interfaces

auto eno1
iface eno1 inet dhcp

The new lines:

auto ens13f0
iface ens13f0 inet static
address 31.31.31.28
netmask 255.255.255.0

Example:

vim /etc/network/interfaces
 
auto eno1
iface eno1 inet dhcp
auto ens13f0
iface ens13f0 inet static
address 31.31.31.28
netmask 255.255.255.0

g. Check the network configuration is set correctly:

ifconfig -a

Lossless Fabric with L3 (DSCP) Configuration

This post provides a configuration example for NVIDIA devices installed with MLNX_OFED running RoCE over a lossless network, in DSCP-based QoS mode.

Sample for our environment:

Notations

<interface> refers to parent interface (for example ens13f0) and

<mlx-device> refers to mlx device (for example mlx5_0) by running:

ibdev2netdev

<mst-device> refers to MST device (for example /dev/mst/ mt4121_pciconf0) by running: 

mst start
mst status
Configuration:
mlnx_qos -i ens13f0 --trust dscp
echo 106 > /sys/class/infiniband/mlx5_0/tc/1/traffic_class
cma_roce_tos -d mlx5_0 -t 106
sysctl -w net.ipv4.tcp_ecn=1
mlnx_qos -i ens13f0 --pfc 0,0,0,1,0,0,0,0

Validate MOFED

Check the mofed version and uverbs:

ofed_info -s
MLNX_OFED_LINUX-4.5-1.0.1.0
ls /dev/infiniband/uverbs1

Run bandwidth stress over InfiniBand in container.

Server

ib_write_bw -a -d mlx5_0 &

Client

ib_write_bw -a -F $server_IP -d mlx5_0 --report_gbits

In this way you can run bandwidth stress over RoCE between containers.

Configuring the Environment

NOTE: The Spark , SparkRDMA Plugin and HADOOP configuration steps are done from Master node

Untar the Spark and SparkRDMA archives

cd /share/spark_rdma
tar -xzvf spark-2.2.0-bin-hadoop2.7.tgz
tar -xzvf spark-rdma-2.0.tgz

All the scripts, jars, and configuration files are available in newly created directory “spark-2.2.0-bin-hadoop2.7”

Spark Configuration

  • Update your bash file.

    vim ~/.bashrc

    It will open your bash file in a text editor which you will scroll to the bottom and add these lines:

    export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
    export SPARK_HOME=/share/spark_rdma/spark-2.2.0-bin-hadoop2.7/


  • Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.

    source ~/.bashrc
  • Check that the paths have been properly modified.

    echo $JAVA_HOME
    echo $SPARK_HOME
    echo $LD_LIBRARY_PATH
  • Edit  spark-env.sh

Now edit configuration file spark-env.sh (in $SPARK_HOME/conf/) and set following parameters:
Create a copy of template of spark-env.sh and rename it. Add a master hostname, interface and spark_tmp directory.

cd /share/spark_rdma/spark-2.2.0-bin-hadoop2.7/conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $vuhuong)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)

#export SPARK_MASTER_HOST=spark1-r
export SPARK_MASTER_HOST=namenoder
export SPARK_LOCAL_IP=`/sbin/ip addr show enp1f0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1`
export SPARK_LOCAL_DIRS=/tmp/spark-tmp

    Add slaves
    Create a copy of the template of slaves configuration file, rename it to slaves (in $SPARK_HOME/conf/) and add the following entries:
    $ vim /share/spark_rdma/spark-2.2.0-bin-hadoop2.7/conf/slaves
    clx-mld-41-r
    clx-mld-42-r
    clx-mld-43-r
    ...
    clx-mld-57-r

Hadoop Configuration

  • Copy  conf/slaves to hadoop-2.7.4/etc/hadoop/slaves


    cp /share/spark_rdma/spark-2.2.0-bin-hadoop2.7/conf/slaves /share/spark_rdma/hadoop-2.7.4/etc/hadoop/slaves
  • Create a distributed file system

    sbin/slaves.sh mkdir /data/hadoop_tmp  # on NVMe disk
  • Edit current_config/core-site.xml. Add both property tags in the configuration tags <configuration>.


    cd /share/spark_rdma/hadoop-2.7.4
    vim current_config/core-site.xml
    
    
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.
    You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License. See accompanying LICENSE file.
    -->
    <!-- Put site-specific property overrides in this file. -->
    <configuration>
            <property>
                    <name>fs.default.name</name>
                    <value>hdfs://namenoder:9000</value>
            </property>
            <property>
                    <name>hadoop.tmp.dir</name>
                    <value>/data/hadoop_tmp</value>
            </property>
    </configuration>
  • Update your bash file.

    vim ~/.bashrc

    This will open your bash file in a text editor which you will scroll to the bottom and add these lines:

    export HADOOP_HOME=/share/spark_rdma/hadoop-2.7.4/
  • Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.

    source ~/.bashrc
  • Check that the paths have been properly modified.

    echo $HADOOP_HOME
  • Edit current_config /hadoop-env.sh
    Now edit configuration file current_config /hadoop-env.sh (in $HADOOP_HOME/etc/hadoop) and set following parameters:
    Add a HADOOP_HOME directory.

    vim current_config/hadoop-env.sh
    
    # Licensed to the Apache Software Foundation (ASF) under one
    # or more contributor license agreements.  See the NOTICE file
    # distributed with this work for additional information
    # regarding copyright ownership.  The ASF licenses this file
    # to you under the Apache License, Version 2.0 (the
    # "License"); you may not use this file except in compliance
    # with the License.  You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    # Set Hadoop-specific environment variables here.
    # The only required environment variable is JAVA_HOME.  All others are
    # optional.  When running a distributed configuration it is best to
    # set JAVA_HOME in this file, so that it is correctly defined on
    # remote nodes.
    # The java implementation to use.
    export JAVA_HOME=${JAVA_HOME}
    export HADOOP_HOME=/share/spark_rdma/hadoop-2.7.4/
    export HADOOP_PREFIX=$HADOOP_HOME
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/


HDFS Configuration

  • Edit  current_config/hdfs-site.xml

    vim current_config/hdfs-site.xml
    
    <!-- Put site-specific property overrides in this file. -->
    <configuration>
            <property>
                    <name>dfs.datanode.dns.interface</name>
                    <value>enp1f0</value>
            </property>
            <property>
                    <name>dfs.replication</name>
                    <value>1</value>
            </property>
            <property>
                    <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
                    <value>false</value>
            </property>
            <property>
                    <name>dfs.permissions</name>
                    <value>false</value>
            </property>
            <property>
                    <name>dfs.datanode.data.dir</name>
                    <value>/data/hadoop_tmp</value>
            </property>
            </configuration>

YARN Configuration

NOTE: we do not use YARN in our example but you can use it in your deployment

  • Edit  current_config/yarn-site.xml 

    vim current_config/yarn-site.xml
    
    <?xml version="1.0"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
    <configuration>
    <!-- Site specific YARN configuration properties -->
           <property>
                    <name>yarn.nodemanager.aux-services</name>
                    <value>mapreduce_shuffle</value>
            </property>
            <property>
                    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
            </property>
            <property>
                    <name>yarn.resourcemanager.resource-tracker.address</name>
                    <value>namenoder:8025</value>
            </property>
            <property>
                    <name>yarn.resourcemanager.scheduler.address</name>
                    <value>namenoder:8030</value>
            </property>
            <property>
                    <name>yarn.resourcemanager.admin.address</name>
                    <value>namenoder:8032</value>
            </property>
            <property>
                    <name>yarn.resourcemanager.webapp.address</name>
                    <value>namenoder:8034</value>
            </property>
            <property>
                    <name>yarn.resourcemanager.address</name>
                    <value>namenoder:8101</value>
            </property>
            <property>
                    <name>yarn.nodemanager.resource.memory-mb</name>
                    <value>40960</value>
            </property>
            <property>
                    <name>yarn.scheduler.maximum-allocation-mb</name>
                    <value>40960</value>
            </property>
            <property>
                    <name>yarn.scheduler.minimum-allocation-mb</name>
                    <value>2048</value>
            </property>
           <property>
                   <name>yarn.nodemanager.resource.cpu-vcores</name>
                   <value>20</value>
            </property>
            <property>
                    <name>yarn.nodemanager.disk-health-checker.enable</name>
                    <value>false</value>
            </property>
            <property>
                    <name>yarn.nodemanager.log-dirs</name>
                    <value>/tmp/yarn_nm/</value>
            </property>
            <property>
                    <name>yarn.log-aggregation-enable</name>
                    <value>true</value>
            </property>
    </configuration>
  • Edit current_config/yarn-env.shSpecify where hadoop directory. 

    vim current_config/yarn-env.sh
    ...
    export HADOOP_HOME=/share/data/hadoop-2.7.4
    ....
  • Copy  current_config/* to etc/hadoop/.

    cp current_config/* etc/hadoop/
  • Execute the following command on the NameNode host machine to HDFS format:

    bin/hdfs namenode -format

Start Spark Standalone Cluster

Run Spark on Top of a Hadoop Cluster

 


To start a Spark Services run following commands on Master:

cd /share/spark_rdma/spark-2.2.0-bin-hadoop2.7
sbin/start-all.sh

Check whether Services have been Started

Check daemons on Master

jps

33970 Jps
47928 ResourceManager
48121 NodeManager
47529 DataNode
47246 NameNode

 Check daemons on Slaves

jps

1846 NodeManager
16491 Jps
1659 DataNode


Check HDFS and YARN (optional) statuses


cd /share/spark_rdma/hadoop-2.7.4

bin/hdfs dfsadmin -report | grep Name

Name: 31.31.31.43:50010 (clx-mld-43-r)

Name: 31.31.31.47:50010 (clx-mld-47-r)

Name: 31.31.31.48:50010 (clx-mld-48-r)

Name: 31.31.31.45:50010 (clx-mld-45-r)

Name: 31.31.31.42:50010 (clx-mld-42-r)

Name: 31.31.31.41:50010 (namenoder)

...
Name: 31.31.31.57:50010 (clx-mld-57-r)


bin/hdfs dfsadmin -report | grep Name -c

8


bin/yarn node -list

18/02/20 16:56:31 INFO client.RMProxy: Connecting to ResourceManager at namenoder/31.31.31.41:8101

Total Nodes:8

         Node-Id      Node-State Node-Http-Address Number-of-Running-Containers

clx-mld-42-r:34873         RUNNING clx-mld-42-r:8042                            0

clx-mld-47-r:35045         RUNNING clx-mld-47-r:8042                            0

clx-mld-48-r:44996         RUNNING clx-mld-48-r:8042                            0

clx-mld-46-r:45432         RUNNING clx-mld-46-r:8042                            0

clx-mld-45-r:41307         RUNNING clx-mld-45-r:8042                            0

...

clx-mld-57-r:44311         RUNNING clx-mld-57-r:8042                            0

namenoder:41409         RUNNING    namenoder:8042                          0


Stop the Cluster

To stop a Spark Services run the following commands on Master:


cd /share/spark_rdma/spark-2.2.0-bin-hadoop2.7

sbin/stop-all.sh


Performance Tuning for NVIDIA Adapters

It is recommended to run the mlnx_tune utility, which will run several system checks and provide notification of any potential settings which will cause performance degradation.

You are welcome to act accordingly with findings.

For more information about the mlnx_tune you can read this post: HowTo Tune Your Linux Server for Best Performance Using the mlnx_tune Tool

Also the command will display CPU cores which are used by network driver. This information will be used below for Spark performance tuning.


sudo mlnx_tune

2017-08-16 14:47:17,023 INFO Collecting node information

2017-08-16 14:47:17,023 INFO Collecting OS information

2017-08-16 14:47:17,026 INFO Collecting CPU information

2017-08-16 14:47:17,104 INFO Collecting IRQ Balancer information

. . .

Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

. . .
2018-02-16 14:47:18,777 INFO System info file: /tmp/mlnx_tune_180416_144716.log


After running the mlnx_tune command, it is highly recommended to set the cpuList parameter (described in Configuration Properties section of SparkRDMA plugin documentation).
Change the spark.conf file to use the NUMA cores associated with the NVIDIA device.

Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]  
spark.shuffle.rdma.cpuList 0-15

More in-depth performance resources can be found in the NVIDIA Community postPerformance Tuning for NVIDIA Adapters

SparkRDMA Performance Tips

  • Compression! Spark enables compression as the default runtime option. 
    Using compression will result in smaller packet sizes to be sent between the nodes, but at the expense of having higher CPU utilization in order to compress the data. 
    Due to the high performance and low CPU overhead network properties of an RDMA network, it is recommended to disable compression when using SparkRDMA. 
    In your spark.conf file set:


    spark.shuffle.compress         
    false spark.shuffle.spill.compress    false


    By disabling compression, you will be able to reclaim precious CPU cycles that were previously used for data compression/decompression, 
    and will also see additional performance benefits in the RDMA data transfer speeds.

  • Disk Media! In order to see the highest and most consistent performance results possible, it is recommended to use the highest performance disk media available. 
    Using a ramdrive or NVMe device for the spark-tmp and hadoop tmp files should be explored whenever possible.

Conclusion

Congratulations, you now have  a RDMA accelerated Spark cluster ready to work.

Now we will run the HiBench Suite benchmarks for TCP vs. RoCE comparison.

Appendix A: Running HiBench with SparkRDMA

HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. 
It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.

Environment

  • Instance type and Environment: See setup overview
  • OS: Ubuntu 16.04.3 LTS
  • Apache hadooop: 2.7.4, hdfs (1 namenode, 16 datanodes)
  • Spark: 2.2 standalone 17 nodes
  • Benchmark : Setup HiBench
  • Test Date: Mar 2018

Environment

  • Instance type and Environment: See setup overview
  • OS: Ubuntu 16.04.3 LTS
  • Apache hadooop: 2.7.4, hdfs (1 namenode, 16 datanodes)
  • Spark: 2.2 standalone 17 nodes
  • Benchmark : Setup HiBench
  • Test Date: April 2018

Benchmarks runs

Steps to reproduce Terasort result:

  1. Setup HiBench

  2. Configure Hadoop and Spark settings in HiBench conf directory.

  3. In HiBench/conf/hibench.conf set:

    hibench.scale.profile bigdata 
    # Mapper number in hadoop, partition number in Spark 
    hibench.default.map.parallelism 1000 
    # Reducer nubmer in hadoop, shuffle partition number in Spark 
    hibench.default.shuffle.parallelism 700



  4. Set in HiBench/conf/workloads/micro/terasort.conf:

    hibench.terasort.bigdata.datasize               1890000000
  5. Run HiBench/bin/workloads/micro/terasort/prepare/prepare.sh and HiBench/bin/workloads/micro/terasort/spark/run.sh


  6. Open HiBench/report/hibench.report:

    Type               Date       Time       Input_data_size   Duration(s)    Throughput(bytes/s)     Throughput/node 
    ScalaSparkTerasort 2018-03-26 19:13:52   189000000000      79.931         2364539415              2364539415
  7. Add to HiBench/conf/spark.conf:

    spark.driver.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar 
    spark.executor.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar 
    spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager 
    spark.shuffle.compress false 
    spark.shuffle.spill.compress false
  8. Run HiBench/bin/workloads/micro/terasort/spark/run.sh

  9. Open HiBench/report/hibench.report:

    Type               Date       Time     Input_data_size     Duration(s)     Throughput(bytes/s)      Throughput/node 
    ScalaSparkTerasort 2018-03-26 19:13:52 189000000000        79.931          2364539415               2364539415 
    ScalaSparkTerasort 2018-03-26 19:17:13 189000000000        52.166          3623049495               3623049495


    Overall improvement:


Steps to reproduce Scala sort result:

  1. In HiBench/conf/hibench.conf set:

    hibench.scale.profile bigdata 
    # Mapper number in hadoop, partition number in Spark 
    hibench.default.map.parallelism 1000
    # Reducer nubmer in hadoop, shuffle partition number in Spark 
    hibench.default.shuffle.parallelism 7000
  2. Run HiBench/bin/workloads/micro/sort/prepare/prepare.sh and HiBench/bin/workloads/micro/sort/spark/run.sh

  3. Open HiBench/report/hibench.report:

    Type           Date       Time     Input_data_size    Duration(s)     Throughput(bytes/s)     Throughput/node 
    ScalaSparkSort 2018-04-03 19:13:24 307962225944       37.898          8126081216              8126081216 
    ScalaSparkSort 2018-04-03 21:16:56 307962098703       48.608          6335625796              6335625796


    Overall improvement:


Done!


Related Documents