Created on Jun 30, 2019
Introduction
This Reference Deployment Guide (RDG) will demonstrate a multi-node cluster deployment procedure of RoCE Accelerated Apache Spark 2.2.0 and NVIDIA end-to-end 100 Gb/s Ethernet solution.
This document describes the process of installing a pre-builded Spark 2.2.0 standalone cluster of 17 physical nodes running Ubuntu 16.04.3 LTS .
The HDFS cluster includes 1 namenode server and 16 datanodes.
We will show how to prepare network for RoCE traffic accordingly with NVIDIA recommendations and will provide all steps required on host and switch sides.
References
- Apache Spark™
- http://spark.apache.org/ Running Spark on YARN - Spark 2.2.0 Documentation
- Cluster Mode Overview - Spark 2.2.0 Documentation
- SparkRDMA
- Apache Spark RDMA plugin
- Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark
- Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark
- HiBench - the big data benchmark suite
- NVIDIA Onyx™ Advanced Ethernet Operating System
- NVIDIA OpenFabrics Enterprise Distribution for Linux (MLNX_OFED)
Overview
What is Apache Spark™?
Apache Spark™ is an open-source, fast and general engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
NVIDIA SparkRDMA Plugin
Apache Spark™ replaces MapReduce
MapReduce, as implemented in Hadoop, is a popular and widely-used engine. In spite of its popularity, MapReduce suffers from high-latency and its batch-mode response is painful for lots of applications that process and analyze data.
Apache Spark is a general purpose engine like MapReduce, but is designed to run much faster and with many more workloads.
One of the most interesting features of Spark is its efficient use of memory, while MapReduce has always worked primarily with data stored on disk.
Accelerating Spark Shuffle
Shuffling is the process of redistributing data across partitions (that is, re-partitioning) between stages of computation. It is a costly process that should be avoided when possible.
In Hadoop, shuffle writes intermediate files to the disk. These files are pulled by the next step/stage. With Spark shuffle, datasets are kept in memory and make the data within reach.
However, when working in a cluster, network resources are required for fetching data blocks, adding on overall execution time.
The SparkRDMA plugin accelerates the network fetch of data blocks using RDMA/RoCE technology, which reduces CPU usage and overall execution time.
SparkRDMA Plugin
SparkRDMA plugin is a high-performance, scalable and efficient ShuffleManager open-source plugin for Apache Spark.
It utilizes RDMA/RoCE (Remote Direct Memory Access/ RDMA over Converged Ethernet) technology to reduce CPU cycles needed for Shuffle data transfers, reducing memory usage by reusing memory for transfers rather than copying data multiple times as the traditional TCP-stack does.
SparkRDMA plugin is built to provide the best performance out-of-the-box. Additionally, it provides multiple configuration options to further tune SparkRDMA on a per-job basis.
SparkRDMA is build to provide the best performance out-of-the-box.
Performance
TeraSort benchmark show x1.53 overall reduced in execution time, Sort benchmark show x1.28 reduction in execution time.
Setup Overview
Before you start, make sure you are aware of the Apache Cluster multi-node cluster architecture, see Overview - Spark 2.2.0 Documentation for more info.
Logical Design
Bill of Materials - BOM
In the distributed Spark/HDFS configuration described in this guide, we are using the following hardware specification.
This document, does not cover the server’s storage aspect. You should configure the servers with the storage components appropriate to your use case (Data Set size)
Physical Network Connections
Network Configuration
In our reference we will use a single port per server. In case of a single port NIC we will wire the available port. In case of dual port NIC we will wire the 1st port to an Ethernet switch and will not use the 2nd port.
We will cover the procedure later in the Installing NVIDIA OFED section.
Each server is connected to the SN2700 switch by a 100GbE copper cable.
The switch port connectivity in our case is as follow:
- 1th port – connected to the Namenode Server
- 2st -17th ports – connected to Worker Servers
Server names with network configuration provided below
Server type | Server name | IP and NICs | |
---|---|---|---|
Internal network - 100 GigE | Management network - 1 GigE | ||
Node 01 (master) | clx-mld-41 | enp1f0: 31.31.31.41 | eno0: From DHCP (reserved) |
Node 02 | clx-mld-42 | enp1f0: 31.31.31.42 | eno0: From DHCP (reserved) |
Node 03 | clx-mld-43 | enp1f0: 31.31.31.43 | eno0: From DHCP (reserved) |
Node 04 | clx-mld-44 | enp1f0: 31.31.31.44 | eno0: From DHCP (reserved) |
Node 05 | clx-mld-45 | enp1f0: 31.31.31.45 | eno0: From DHCP (reserved) |
Node 06 | clx-mld-46 | enp1f0: 31.31.31.46 | eno0: From DHCP (reserved) |
Node 07 | clx-mld-47 | enp1f0: 31.31.31.47 | eno0: From DHCP (reserved) |
Node 08 | clx-mld-48 | enp1f0: 31.31.31. 48 | eno0: From DHCP (reserved) |
Node 09 | clx-mld-49 | enp1f0: 31.31.31. 49 | eno0: From DHCP (reserved) |
Node 10 | clx-mld-50 | enp1f0: 31.31.31. 50 | eno0: From DHCP (reserved) |
Node 11 | clx-mld-51 | enp1f0: 31.31.31. 51 | eno0: From DHCP (reserved) |
Node 12 | clx-mld-52 | enp1f0: 31.31.31. 52 | eno0: From DHCP (reserved) |
Node 13 | clx-mld-53 | enp1f0: 31.31.31. 53 | eno0: From DHCP (reserved) |
Node 14 | clx-mld-54 | enp1f0: 31.31.31. 54 | eno0: From DHCP (reserved) |
Node 15 | clx-mld-55 | enp1f0: 31.31.31. 55 | eno0: From DHCP (reserved) |
Node 16 | clx-mld-56 | enp1f0: 31.31.31. 56 | eno0: From DHCP (reserved) |
Node 17 | clx-mld-57 | enp1f0: 31.31.31. 57 | eno0: From DHCP (reserved) |
Network Switch Configuration
Please start from the HowTo Get Started with NVIDIA switches guide if you don't familiar with NVIDIA switch software. For more information please refer to the MLNX-OS User Manual located at support.mellanox.com or www.mellanox.com -> Switches
In first step please update your switch OS to the latest ONYX OS software. Please use this guide HowTo Upgrade MLNX-OS Software on NVIDIA switch systems.
We will accelerate Spark by using RDMA transport.
There are several industry standard network configuration for RoCE deployment.
You are welcome to follow the Recommended Network Configuration Examples for RoCE Deployment for our recommendations and instructions.
In our deployment we will configure our network to be lossless and will use DSCP on host and switch sides:
- For switch side please configure your switch accordingly with the Lossless RoCE Configuration for MLNX-OS Switches in DSCP-Based QoS Mode document
- The host side will be covered later in the Installing MLNX_OFED for Ubuntu on the Master and Workers section.
Below is our switch configuration you can use as reference. You can copy/paste it to you switch but please be aware that this is clean switch configuration and if you may corrupt your existing configuration.
Switch Console Expand sourceswx-mld-1-2 [standalone: master] > enable swx-mld-1-2 [standalone: master] # configure terminal swx-mld-1-2 [standalone: master] (config) # show running-config ## ## Running database "initial" ## Generated at 2018/03/10 09:38:38 +0000 ## Hostname: swx-mld-1-2 ## ## ## Running-config temporary prefix mode setting ## no cli default prefix-modes enable ## ## License keys ## license install LK2-RESTRICTED_CMDS_GEN2-44T1-4H83-RWA5-G423-GY7U-8A60-E0AH-ABCD ## ## Interface Ethernet buffer configuration ## traffic pool roce type lossless traffic pool roce memory percent 50.00 traffic pool roce map switch-priority 3 ## ## LLDP configuration ## lldp ## ## QoS switch configuration ## interface ethernet 1/1-1/32 qos trust L3 interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500 ## ## DCBX ETS configuration ## interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict ## ## Other IP configuration ## hostname swx-mld-1-2 ## ## AAA remote server configuration ## # ldap bind-password ******** # radius-server key ******** # tacacs-server key ******** ## ## Network management configuration ## # web proxy auth basic password ******** ## ## X.509 certificates configuration ## # # Certificate name system-self-signed, ID 108bb9eb3e99edff47fc86e71cba530b6a6b8991 # (public-cert config omitted since private-key config is hidden) ## ## Persistent prefix mode setting ## cli default prefix-modes enable
Master and Worker Servers installation and Configuration
Prerequisites
Updating Ubuntu Software Packages on the Master and Worker Servers
To update/upgrade Ubuntu software packages, run the commands below.
sudo apt-get update # Fetches the list of available update sudo apt-get upgrade -y # Strictly upgrades the current packages
Installing General Dependencies on the Master and Worker Servers
To install general dependencies, run the commands below or paste each line.
sudo apt-get install git bc
Installing Java 8 (Recommended Oracle Java) on the Master and Worker Servers
sudo apt-get install python-software-properties sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer
Adding Entries to Host files on the Master and Worker Servers
Edit the host files:
sudo vim /etc/hosts
Now add entries of namenoder (master) and worker servers
127.0.0.1 localhost 127.0.1.1 clx-mld-42.local.domain clx-mld-42 # The following lines are desirable for IPv6 capable hosts ::1 localhost ip6-localhost ip6-loopback #ff02::1 ip6-allnodes #ff02::2 ip6-allrouters 31.31.31.41 namenoder 31.31.31.42 clx-mld-42-r 31.31.31.43 clx-mld-43-r 31.31.31.44 clx-mld-44-r 31.31.31.45 clx-mld-45-r 31.31.31.46 clx-mld-46-r 31.31.31.47 clx-mld-47-r 31.31.31.48 clx-mld-48-r 31.31.31.49 clx-mld-49-r 31.31.31.50 clx-mld-50-r 31.31.31.51 clx-mld-51-r 31.31.31.52 clx-mld-52-r 31.31.31.53 clx-mld-53-r 31.31.31.54 clx-mld-54-r 31.31.31.55 clx-mld-55-r 31.31.31.56 clx-mld-56-r 31.31.31.57 clx-mld-57-r
Required Software
Prior to install and configure a Apache Spark and SparkRDMA environment, the following software must be downloaded.
Creating Network File System (NFS) Share
Install the NFS server on the Master server. Create the directory /share/spark_rdma and export it to all Worker servers.
Install the NFS client on all Worker servers. Mount /share/spark_rdma export from Master server to /share/spark_rdma local directory (the same path as on Master).
Configuring SSH
We will configure passwordless SSH access from Master to all Slave nodes.
Install OpenSSH Server-Client on both Master and Slaves nodes
Server Consolesudo apt-get install openssh-server openssh-client
On Master node - Generate Key Pairs
Server Consolessh-keygen -t rsa -P ""
- Copy the content of .ssh/id_rsa.pub file on master to .ssh/authorized_keys file on all nodes (Master and Slave nodes)
Check you can access Slave nodes from Master
Server Consolessh clx-mld-41-r ssh clx-mld-42-r ssh clx-mld-43-r ... ssh clx-mld-57-r
Downloading Apache Spark
- Go to Downloads | Apache Spark and download the Apache Spark™ and the spark-2.2.0-bin-hadoop2.7.tgz to share the /share/spark_rdma shared folder.
- Choose a Spark release:
2.2.1 (Dec 01 2017)2.2.0 (Jul 11 2017)2.1.2 (Oct 09 2017)2.1.1 (May 02 2017)2.1.0 (Dec 28 2016)2.0.2 (Nov 14 2016)2.0.1 (Oct 03 2016)2.0.0 (Jul 26 2016)1.6.3 (Nov 07 2016)1.6.2 (Jun 25 2016)1.6.1 (Mar 09 2016)1.6.0 (Jan 04 2016) - Choose a package type:
Pre-built for Apache Hadoop 2.7 and laterPre-built for Apache Hadoop 2.6Pre-build with user-provided Apache HadoopSource Code - Download Spark: spark-2.2.0-bin-hadoop2.7.tgz
https://www.apache.org/dyn/closer.lua/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz - Verify this release using the 2.2.0 signatures and checksums and project release KEYS.
Downloading NVIDIA SparkRDMA 2.0
Download the SparkRDMA Release Version 2.0 and save it in the /share/spark_rdma shared folder.
petro-rudenko released this 16 days ago · 1 commit to master since this release
Assets
All-new implementation of SparkRDMA, redesigned from the ground up to further increase scalability, robustness and most importantly - performance.
Among the new features and capabilities introduced in this release:
- All-new Metadata (Map Output) fetching protocol - now allows scaling to the tens of thousands of partitions, with superior performance and recoverability
- Software-level flow control in RdmaChannel - eliminates pause storms in the fabric
- ODP (On-Demand Paging) support - improves memory efficiency
Attached are pre-built binaries. Please follow the README page for instructions.
Cloning the HiBench Suite 7.0 Repository
To clone the latest HiBench repository, run the following command:
cd /share/spark_rdma git clone https://github.com/intel-hadoop/HiBench.git
The preceding git clone command creates a subdirectory called “HiBench”. After cloning, you may optionally build a specific branch (such as a release branch) by invoking the following commands:
cd HiBench git checkout master # where master is the desired branch (by default) cd <Path to NFS share>
Installing MLNX_OFED for Ubuntu on the Master and Workers
This chapter describes how to install and test MLNX_OFED for Linux package on a single host machine with NVIDIA ConnectX®-5 adapter card installed.
For more information click on NVIDIA OFED for Linux User Manual.
Downloading NVIDIA OFED
Verify that the system has a NVIDIA network adapter (HCA/NIC) installed:
Shelllspci -v | grep Mellanox
The following example shows a system with an installed NVIDIA HCA:
- Download the ISO image according to you OS to your host.
The image’s name has the format MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso.
You can download it from: http://www.mellanox.com > Products > Software > InfiniBand/VPI Drivers > Mellanox OFED Linux (MLNX_OFED) > Download.
Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page:
Shellmd5sum MLNX_OFED_LINUX-<ver>-<OS label>.iso
Installing NVIDIA OFED
MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:
- Discovers the currently installed kernel.
- Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack.
- Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel).
- Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware.
The installation script removes all previously installed NVIDIA OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages
1. Log into the installation machine as root.
2. Copy the downloaded ISO to /root
3. Mount the ISO image on your machine:
mkdir /mnt/iso mount -o loop /root/MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64.iso /mnt/iso cd /mnt/iso
4.Run the installation script:
./mlnxofedinstall --all
5.Reboot after the installation finished successfully:
# /etc/init.d/openibd restart # reboot
ConnectX®-5 ports can be individually configured to work as Infiniband or Ethernet ports. By default both ConnectX-5 VPI ports are initialized as InfiniBand ports.
6. Check that the port modes are Ethernet:
ibv_devinfo
7.If you see the following - You need to change the interfaces port type to Ethernet
Change the interfaces port type to Ethernet mode.
Change the mode to Ethernet. Use the mlxconfig script after the driver is loaded.
* LINK_TYPE_P1=2 is an Ethernet mode
a. Start mst and see ports names:
mst start mst status
b. Change the port mode to Ethernet:
mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=2 Port 1 set to ETH mode reboot
c. Query Ethernet devices and print information on them that is available for use from userspace:
ibv_devinfo
d. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the InfiniBand devices/ports:
ibdev2netdev
e. Configure the network interface:
ifconfig ens13f0 31.31.31.28 netmask 255.255.255.0
f.Insert the lines below to the /etc/network/interfaces file after the following lines:
vim /etc/network/interfaces auto eno1 iface eno1 inet dhcp
The new lines:
auto ens13f0 iface ens13f0 inet static address 31.31.31.28 netmask 255.255.255.0
Example:
vim /etc/network/interfaces auto eno1 iface eno1 inet dhcp auto ens13f0 iface ens13f0 inet static address 31.31.31.28 netmask 255.255.255.0
g. Check the network configuration is set correctly:
ifconfig -a
Lossless Fabric with L3 (DSCP) Configuration
This post provides a configuration example for NVIDIA devices installed with MLNX_OFED running RoCE over a lossless network, in DSCP-based QoS mode.
Sample for our environment:
Notations
<interface> refers to parent interface (for example ens13f0) and
<mlx-device> refers to mlx device (for example mlx5_0) by running:
ibdev2netdev
<mst-device> refers to MST device (for example /dev/mst/ mt4121_pciconf0) by running:
mst start mst status
Configuration:
mlnx_qos -i ens13f0 --trust dscp echo 106 > /sys/class/infiniband/mlx5_0/tc/1/traffic_class cma_roce_tos -d mlx5_0 -t 106 sysctl -w net.ipv4.tcp_ecn=1 mlnx_qos -i ens13f0 --pfc 0,0,0,1,0,0,0,0
Validate MOFED
Check the mofed version and uverbs:
ofed_info -s MLNX_OFED_LINUX-4.5-1.0.1.0 ls /dev/infiniband/uverbs1
Run bandwidth stress over InfiniBand in container.
Server | ib_write_bw -a -d mlx5_0 & |
---|---|
Client | ib_write_bw -a -F $server_IP -d mlx5_0 --report_gbits |
In this way you can run bandwidth stress over RoCE between containers.
Configuring the Environment
Untar the Spark and SparkRDMA archives
cd /share/spark_rdma tar -xzvf spark-2.2.0-bin-hadoop2.7.tgz tar -xzvf spark-rdma-2.0.tgz
All the scripts, jars, and configuration files are available in newly created directory “spark-2.2.0-bin-hadoop2.7”
Spark Configuration
Update your bash file.
Server Consolevim ~/.bashrc
It will open your bash file in a text editor which you will scroll to the bottom and add these lines:
Server Consoleexport JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export SPARK_HOME=/share/spark_rdma/spark-2.2.0-bin-hadoop2.7/
Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.
Server Consolesource ~/.bashrc
Check that the paths have been properly modified.
Server Consoleecho $JAVA_HOME echo $SPARK_HOME echo $LD_LIBRARY_PATH
- Edit spark-env.sh
Now edit configuration file spark-env.sh (in $SPARK_HOME/conf/) and set following parameters:
Create a copy of template of spark-env.sh and rename it. Add a master hostname, interface and spark_tmp directory.
cd /share/spark_rdma/spark-2.2.0-bin-hadoop2.7/conf cp spark-env.sh.template spark-env.sh vim spark-env.sh # Generic options for the daemons used in the standalone deploy mode # - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf) # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) # - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $vuhuong) # - SPARK_NICENESS The scheduling priority for daemons. (Default: 0) #export SPARK_MASTER_HOST=spark1-r export SPARK_MASTER_HOST=namenoder export SPARK_LOCAL_IP=`/sbin/ip addr show enp1f0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1` export SPARK_LOCAL_DIRS=/tmp/spark-tmp Add slaves Create a copy of the template of slaves configuration file, rename it to slaves (in $SPARK_HOME/conf/) and add the following entries: $ vim /share/spark_rdma/spark-2.2.0-bin-hadoop2.7/conf/slaves clx-mld-41-r clx-mld-42-r clx-mld-43-r ... clx-mld-57-r
Hadoop Configuration
Copy conf/slaves to hadoop-2.7.4/etc/hadoop/slaves
Server Consolecp /share/spark_rdma/spark-2.2.0-bin-hadoop2.7/conf/slaves /share/spark_rdma/hadoop-2.7.4/etc/hadoop/slaves
Create a distributed file system
Server Consolesbin/slaves.sh mkdir /data/hadoop_tmp # on NVMe disk
Edit current_config/core-site.xml. Add both property tags in the configuration tags <configuration>.
Server Console Expand sourcecd /share/spark_rdma/hadoop-2.7.4 vim current_config/core-site.xml <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://namenoder:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/hadoop_tmp</value> </property> </configuration>
Update your bash file.
Server Consolevim ~/.bashrc
This will open your bash file in a text editor which you will scroll to the bottom and add these lines:
Server Consoleexport HADOOP_HOME=/share/spark_rdma/hadoop-2.7.4/
Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.
Server Consolesource ~/.bashrc
Check that the paths have been properly modified.
Server Consoleecho $HADOOP_HOME
Edit current_config /hadoop-env.sh
Now edit configuration file current_config /hadoop-env.sh (in $HADOOP_HOME/etc/hadoop) and set following parameters:
Add a HADOOP_HOME directory.Server Console Expand sourcevim current_config/hadoop-env.sh # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. export JAVA_HOME=${JAVA_HOME} export HADOOP_HOME=/share/spark_rdma/hadoop-2.7.4/ export HADOOP_PREFIX=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
HDFS Configuration
Edit current_config/hdfs-site.xml
# Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0# # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. export JAVA_HOME=${JAVA_HOME} export HADOOP_HOME=/share/spark_rdma/hadoop-2.7.4/ export HADOOP_PREFIX=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop Expand source
vim current_config/hdfs-site.xml <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.datanode.dns.interface</name> <value>enp1f0</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.datanode.registration.ip-hostname-check</name> <value>false</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/hadoop_tmp</value> </property> </configuration>
YARN Configuration
NOTE: we do not use YARN in our example but you can use it in your deployment
Edit current_config/yarn-site.xml
Server Console Expand sourcevim current_config/yarn-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>namenoder:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>namenoder:8030</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>namenoder:8032</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>namenoder:8034</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>namenoder:8101</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>40960</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>40960</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>20</value> </property> <property> <name>yarn.nodemanager.disk-health-checker.enable</name> <value>false</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/tmp/yarn_nm/</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration>
Edit current_config/yarn-env.sh. Specify where hadoop directory.
Server Consolevim current_config/yarn-env.sh ... export HADOOP_HOME=/share/data/hadoop-2.7.4 ....
Copy current_config/* to etc/hadoop/.
Server Consolecp current_config/* etc/hadoop/
Execute the following command on the NameNode host machine to HDFS format:
Server Consolebin/hdfs namenode -format
Start Spark Standalone Cluster
Run Spark on Top of a Hadoop Cluster
To start a Spark Services run following commands on Master:
cd /share/spark_rdma/spark-2.2.0-bin-hadoop2.7 sbin/start-all.sh
Check whether Services have been Started
Check daemons on Master
jps 33970 Jps 47928 ResourceManager 48121 NodeManager 47529 DataNode 47246 NameNode
Check daemons on Slaves
jps 1846 NodeManager 16491 Jps 1659 DataNode
Check HDFS and YARN (optional) statuses
cd /share/spark_rdma/hadoop-2.7.4 bin/hdfs dfsadmin -report | grep Name Name: 31.31.31.43:50010 (clx-mld-43-r) Name: 31.31.31.47:50010 (clx-mld-47-r) Name: 31.31.31.48:50010 (clx-mld-48-r) Name: 31.31.31.45:50010 (clx-mld-45-r) Name: 31.31.31.42:50010 (clx-mld-42-r) Name: 31.31.31.41:50010 (namenoder) ... Name: 31.31.31.57:50010 (clx-mld-57-r) bin/hdfs dfsadmin -report | grep Name -c 8 bin/yarn node -list 18/02/20 16:56:31 INFO client.RMProxy: Connecting to ResourceManager at namenoder/31.31.31.41:8101 Total Nodes:8 Node-Id Node-State Node-Http-Address Number-of-Running-Containers clx-mld-42-r:34873 RUNNING clx-mld-42-r:8042 0 clx-mld-47-r:35045 RUNNING clx-mld-47-r:8042 0 clx-mld-48-r:44996 RUNNING clx-mld-48-r:8042 0 clx-mld-46-r:45432 RUNNING clx-mld-46-r:8042 0 clx-mld-45-r:41307 RUNNING clx-mld-45-r:8042 0 ... clx-mld-57-r:44311 RUNNING clx-mld-57-r:8042 0 namenoder:41409 RUNNING namenoder:8042 0
Stop the Cluster
To stop a Spark Services run the following commands on Master:
cd /share/spark_rdma/spark-2.2.0-bin-hadoop2.7 sbin/stop-all.sh
Performance Tuning for NVIDIA Adapters
It is recommended to run the mlnx_tune utility, which will run several system checks and provide notification of any potential settings which will cause performance degradation.
You are welcome to act accordingly with findings.
For more information about the mlnx_tune you can read this post: HowTo Tune Your Linux Server for Best Performance Using the mlnx_tune Tool
Also the command will display CPU cores which are used by network driver. This information will be used below for Spark performance tuning.
sudo mlnx_tune 2017-08-16 14:47:17,023 INFO Collecting node information 2017-08-16 14:47:17,023 INFO Collecting OS information 2017-08-16 14:47:17,026 INFO Collecting CPU information 2017-08-16 14:47:17,104 INFO Collecting IRQ Balancer information . . . Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] . . . 2018-02-16 14:47:18,777 INFO System info file: /tmp/mlnx_tune_180416_144716.log
After running the mlnx_tune command, it is highly recommended to set the cpuList parameter (described in Configuration Properties section of SparkRDMA plugin documentation).
Change the spark.conf file to use the NUMA cores associated with the NVIDIA device.
Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] spark.shuffle.rdma.cpuList 0-15
More in-depth performance resources can be found in the NVIDIA Community post: Performance Tuning for NVIDIA Adapters
SparkRDMA Performance Tips
Compression! Spark enables compression as the default runtime option.
Using compression will result in smaller packet sizes to be sent between the nodes, but at the expense of having higher CPU utilization in order to compress the data.
Due to the high performance and low CPU overhead network properties of an RDMA network, it is recommended to disable compression when using SparkRDMA.
In your spark.conf file set:Server Consolespark.shuffle.compress false spark.shuffle.spill.compress false
By disabling compression, you will be able to reclaim precious CPU cycles that were previously used for data compression/decompression,
and will also see additional performance benefits in the RDMA data transfer speeds.- Disk Media! In order to see the highest and most consistent performance results possible, it is recommended to use the highest performance disk media available.
Using a ramdrive or NVMe device for the spark-tmp and hadoop tmp files should be explored whenever possible.
Conclusion
Congratulations, you now have a RDMA accelerated Spark cluster ready to work.
Now we will run the HiBench Suite benchmarks for TCP vs. RoCE comparison.
Appendix A: Running HiBench with SparkRDMA
HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations.
It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.
Environment
- Instance type and Environment: See setup overview
- OS: Ubuntu 16.04.3 LTS
- Apache hadooop: 2.7.4, hdfs (1 namenode, 16 datanodes)
- Spark: 2.2 standalone 17 nodes
- Benchmark : Setup HiBench
- Test Date: Mar 2018
Environment
- Instance type and Environment: See setup overview
- OS: Ubuntu 16.04.3 LTS
- Apache hadooop: 2.7.4, hdfs (1 namenode, 16 datanodes)
- Spark: 2.2 standalone 17 nodes
- Benchmark : Setup HiBench
- Test Date: April 2018
Benchmarks runs
Steps to reproduce Terasort result:
Configure Hadoop and Spark settings in HiBench conf directory.
In HiBench/conf/hibench.conf set:
Server Consolehibench.scale.profile bigdata # Mapper number in hadoop, partition number in Spark hibench.default.map.parallelism 1000 # Reducer nubmer in hadoop, shuffle partition number in Spark hibench.default.shuffle.parallelism 700
Set in HiBench/conf/workloads/micro/terasort.conf:
Server Consolehibench.terasort.bigdata.datasize 1890000000
Run HiBench/bin/workloads/micro/terasort/prepare/prepare.sh and HiBench/bin/workloads/micro/terasort/spark/run.sh
Open HiBench/report/hibench.report:
Server ConsoleType Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node ScalaSparkTerasort 2018-03-26 19:13:52 189000000000 79.931 2364539415 2364539415
Add to HiBench/conf/spark.conf:
Server Consolespark.driver.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar spark.executor.extraClassPath /PATH/TO/spark-rdma-2.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager spark.shuffle.compress false spark.shuffle.spill.compress false
- Run HiBench/bin/workloads/micro/terasort/spark/run.sh
Open HiBench/report/hibench.report:
Server ConsoleType Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node ScalaSparkTerasort 2018-03-26 19:13:52 189000000000 79.931 2364539415 2364539415 ScalaSparkTerasort 2018-03-26 19:17:13 189000000000 52.166 3623049495 3623049495
Overall improvement:
Steps to reproduce Scala sort result:
In HiBench/conf/hibench.conf set:
Server Consolehibench.scale.profile bigdata # Mapper number in hadoop, partition number in Spark hibench.default.map.parallelism 1000 # Reducer nubmer in hadoop, shuffle partition number in Spark hibench.default.shuffle.parallelism 7000
Run HiBench/bin/workloads/micro/sort/prepare/prepare.sh and HiBench/bin/workloads/micro/sort/spark/run.sh
Open HiBench/report/hibench.report:
Server ConsoleType Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node ScalaSparkSort 2018-04-03 19:13:24 307962225944 37.898 8126081216 8126081216 ScalaSparkSort 2018-04-03 21:16:56 307962098703 48.608 6335625796 6335625796
Overall improvement:
Done!
Related Documents