Created on Feb 10, 2020 by Boris Kovalev, Peter Rudenko
Introduction
This Reference Deployment Guide (RDG) will demonstrate a multi-node cluster deployment procedure of RoCE/UCX Accelerated Apache Spark 2.4/3.0 over NVIDIA end-to-end 100 Gb/s Ethernet solution.
Below walk through guide for installation process of a pre-built Spark 2.4.4 / Spark 3.0-preview2 standalone cluster of 15 physical nodes running Ubuntu 18.04.3 LTS includes step-by-step procedure to prepare the the network for RoCE traffic using NVIDIA recommended settings on both host and switch sides.
The HDFS cluster includes 15 datanodes and 1 namenode server.
References
- http://spark.apache.org/ Running Spark on YARN - Spark 2.4.x Documentation
- Cluster Mode Overview - Spark 2.4.0 Documentation
- SparkUCX Shuffle Plugin
- HiBench - the big data benchmark suite
- NVIDIA Onyx™ Advanced Ethernet Operating System
- NVIDIA OpenFabrics Enterprise Distribution for Linux (MLNX_OFED)
Overview
What is Apache Spark™?
Apache Spark™ is an open-source, fast and general engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
NVIDIA SparkUCX Plugin
Apache Spark™ replaces MapReduce
Apache Spark is a general purpose engine like MapReduce, but is designed to run much faster and with many more workloads.
One of the most interesting features of Spark is its efficient use of memory, while MapReduce worked primarily with data stored on disk.
MapReduce, as implemented in Hadoop, is a popular and widely-used engine, but it suffers from high-latency. furthermore, its batch-mode response time curbs the performance for many applications that process and analyze data.
Accelerating Spark Shuffle
Shuffling is the process of redistributing data across partitions (repartitioning) between stages of computation. Shuffle is a costly process that should be avoided when possible.
In Hadoop shuffle writes intermediate files to disk, these files are pulled by the next step/stage. With Spark Shuffle, RDDs are kept in-memory and allow data to be within reach, but when working with a cluster, network resources are required for fetching data blocks from remote worker and adding on overall execution time.
Accelerating the network fetch for data blocks with RDMA (InfiniBand or RoCE) using SparkUCX plugin reduces the CPU usage and overall execution time.
SparkUCX Plugin
SparkUCX is a high-performance, scalable and efficient Shuffle-Manager plugin for Apache Spark. It utilizes RDMA (Remote Direct Memory Access) and other high performance transports to reduce CPU cycles needed to Shuffle data transfers. It reduces memory usage by reusing memory for transfers instead of copying data multiple times down the traditional TCP-stack.
SparkUCX plugin is built to provide the best performance out-of-the-box, and provides multiple configuration options to further tune SparkUCX per-job.
SparkRDMA and SparkUCX Comparison
SparkRDMA | SparkUCX |
---|---|
Based on abandoned IBM DiSNi verbs package | Production grade application which is based on UCX high-level API with a dedicated R&D and wide developer community |
Supports IB/ROCE with RC only | Supports IB, RoCE with RC/DC/Shared memory, and TCP as fallback |
Not scalable, CQ and progress thread per connection | Scalable, CQ per executor |
Communications progress on dedicated thread with high CPU consumption | Communications are initiated from application threads and progressed asynchronously by hardware |
RDMA protocols are implemented in Java | Based on standard UCX API and protocols hiding complexity of RDMA |
Registering each data block with different key | Registering all data as a single chunk |
Showed improved vs. worst TCP numbers | Showed improved vs. best TCP numbers |
Performance
TeraSort benchmark: 60% overall reduction and 30% reduction in total execution time. Pagerank benchmark: 27% reduction in execution time.
Setup Overview
Before we start, make sure to familiarize with the Apache Cluster multi-node cluster architecture, see Overview - Spark 2.4.0 Documentation for more info.
Logical Design
Bill of Materials - BoM
In the distributed SparkUCX/HDFS configuration described in this guide, we are using the following hardware setup.
Physical Network
Connectivity
Network Configuration
In our reference design we will use a single port per server.
- In case of a single port NIC we will wire the available port.
- In case of dual port NIC we will wire the 1st port to an Ethernet switch and leave the 2nd port unused.
We will cover the procedure later in the NVIDIA OFED Installation section.
Each server is connected to the SN2700 switch by a 100GbE copper cable.
The switch port connectivity in our case is as follows:
- Port 1 – connected to the Namenode Server
- Port 2 to 15 – connected to Worker Servers
Server names with network configuration provided below
Server Type | Server Name | IP and NICs | |
---|---|---|---|
Internal Network - 100 GigE | Management Network - 1 GigE | ||
Node 01 (master) | clx-mld-41 | enp1f0: 31.31.31.41 | eno0: From DHCP (reserved) |
Node 02 | clx-mld-42 | enp1f0: 31.31.31.42 | eno0: From DHCP (reserved) |
Node 03 | clx-mld-43 | enp1f0: 31.31.31.43 | eno0: From DHCP (reserved) |
Node 04 | clx-mld-44 | enp1f0: 31.31.31.44 | eno0: From DHCP (reserved) |
Node 05 | clx-mld-45 | enp1f0: 31.31.31.45 | eno0: From DHCP (reserved) |
Node 06 | clx-mld-46 | enp1f0: 31.31.31.46 | eno0: From DHCP (reserved) |
Node 07 | clx-mld-47 | enp1f0: 31.31.31.47 | eno0: From DHCP (reserved) |
Node 08 | clx-mld-48 | enp1f0: 31.31.31. 48 | eno0: From DHCP (reserved) |
Node 09 | clx-mld-49 | enp1f0: 31.31.31. 49 | eno0: From DHCP (reserved) |
Node 10 | clx-mld-50 | enp1f0: 31.31.31. 50 | eno0: From DHCP (reserved) |
Node 11 | clx-mld-51 | enp1f0: 31.31.31. 51 | eno0: From DHCP (reserved) |
Node 12 | clx-mld-52 | enp1f0: 31.31.31. 52 | eno0: From DHCP (reserved) |
Node 13 | clx-mld-53 | enp1f0: 31.31.31. 53 | eno0: From DHCP (reserved) |
Node 14 | clx-mld-54 | enp1f0: 31.31.31. 54 | eno0: From DHCP (reserved) |
Node 15 | clx-mld-55 | enp1f0: 31.31.31. 55 | eno0: From DHCP (reserved) |
Network Switch Configuration
If you are not familiar with NVIDIA switch software follow the HowTo Get Started with NVIDIA switches guide. For more information please refer to the NVIDIA Onyx User Manual at https://docs.NVIDIA.com/category/onyx
First we need to update the switch OS to the latest ONYX software version. For more details go to HowTo Upgrade MLNX-OS Software on NVIDIA switch systems.
We will accelerate SparkUCX by using RoCE transport.
In our deployment we will configure our network to be lossy. No additional configuration on the switch is needed.
For lossless configuration and for NVIDIA Onyx version 3.8.2004 and above run:
switch (config) #roce lossless
(PFC+ECN) to run RoCE on lossless fabric are configured by the "roce" command
To see the RoCE configuration run:
show roce
To monitor the RoCE counters run:
show interface ethernet counters roce
Master and Worker Servers installation and Configuration
Prerequisites
Updating Ubuntu Software Packages on the Master and Worker Servers
To update/upgrade Ubuntu software packages, run the the following commands:
sudo apt-get update # Fetches the list of available update sudo apt-get upgrade -y # Strictly upgrades the current packages
Installing General Dependencies on the Master and Worker Servers
To install general dependencies, run the commands below or paste each line.
sudo apt-get install git bc
Installing Java 8 (Recommended Oracle Java) on the Master and Worker Servers
sudo apt-get install python-software-properties sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer
Adding Entries to Host files on the Master and Worker Servers
To edit the host files:
sudo vim /etc/hosts
Now add entries of namenoder (master) and worker servers
127.0.0.1 localhost 127.0.1.1 clx-mld-42.local.domain clx-mld-42 # The following lines are desirable for IPv6 capable hosts ::1 localhost ip6-localhost ip6-loopback #ff02::1 ip6-allnodes #ff02::2 ip6-allrouters 31.31.31.41 namenoder 31.31.31.42 clx-mld-42-r 31.31.31.43 clx-mld-43-r 31.31.31.44 clx-mld-44-r 31.31.31.45 clx-mld-45-r 31.31.31.46 clx-mld-46-r 31.31.31.47 clx-mld-47-r 31.31.31.48 clx-mld-48-r 31.31.31.49 clx-mld-49-r 31.31.31.50 clx-mld-50-r 31.31.31.51 clx-mld-51-r 31.31.31.52 clx-mld-52-r 31.31.31.53 clx-mld-53-r 31.31.31.54 clx-mld-54-r 31.31.31.55 clx-mld-55-r
Creating Network File System (NFS) Share
Install the NFS server on the Master server. Create the directory /share/sparkucx and export it to all Worker servers.
Install the NFS client on all Worker servers. Mount /share/sparkucx export from Master server to /share/sparkucx local directory (the same path as on Master).
Configuring SSH
We will configure passwordless SSH access from Master to all Slave nodes.
Install OpenSSH Server-Client on both Master and Slaves nodes
Server Consolesudo apt-get install openssh-server openssh-client
On Master node - Generate Key Pairs
Server Consolessh-keygen -t rsa -P ""
- Copy the content of .ssh/id_rsa.pub file on master to .ssh/authorized_keys file on all nodes (Master and Slave nodes)
Check you can access Slave nodes from Master
Server Consolessh clx-mld-41-r ssh clx-mld-42-r ssh clx-mld-43-r ... ssh clx-mld-55-r
Install Hadoop
Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster.
Download hadoop-2.7.7.tar.gz to the machine to install it on.
Server Consolewget https://downloads.apache.org/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
Extract the compressed tar file
Server Consolecd /share/sparkucx/ tar -zxvf hadoop-2.7.7.tar.gz
Downloading Apache Spark
- Go to Downloads | Apache Spark and download the Apache Spark™ and the spark-2.4.4-bin-hadoop2.7.tgz to share the /share/sparkucx shared folder.
- Choose a Spark release: 2.4.4 (Aug 30 2019) or 3.0.0-preview2(Decv 23 2019)
- Choose a package type: Pre-built for Apache Hadoop 2.7
- Download Spark: spark-2.4.4-bin-hadoop2.7.tgz
https://www.apache.org/dyn/closer.lua/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz Verify this release using the 2.4.4 signatures, checksums and project release KEYS.
Cloning the HiBench Suite 7.0 Repository
To clone the latest HiBench repository, run the following command:
cd /share/sparkucx git clone https://github.com/intel-hadoop/HiBench.git
The preceding git clone command creates a subdirectory called “HiBench”. After cloning, you may optionally build a specific branch (such as a release branch) by invoking the following commands:
cd HiBench git checkout master # where master is the desired branch (by default) cd <Path to NFS share>
Installing MLNX_OFED for Ubuntu on the Master and Workers
This chapter describes how to install and test MLNX_OFED for Linux package on a single host machine with NVIDIA ConnectX®-5 adapter card installed.
For more information click on NVIDIA OFED for Linux User Manual.
Downloading NVIDIA OFED
Verify that the system has a NVIDIA network adapter (HCA/NIC) installed:
Shelllspci -v | grep Mellanox
The following example shows a system with an installed NVIDIA HCA:
- Download the ISO image according to you OS to your host.
The image’s name has the format MLNX_OFED_LINUX-<ver>-<OS label><CPUarch>.iso.
You can download it from: http://www.NVIDIA.com > Products > Software > InfiniBand/VPI Drivers > NVIDIA OFED Linux (MLNX_OFED) > Download.
Use the MD5SUM utility to confirm the downloaded file’s integrity. Run the following command and compare the result to the value provided on the download page:
Shellmd5sum MLNX_OFED_LINUX-<ver>-<OS label>.iso
Installing NVIDIA OFED
MLNX_OFED is installed by running the mlnxofedinstall script. The installation script, performs the following:
- Discovers the currently installed kernel.
- Uninstalls any software stacks that are part of the standard operating system distribution or another vendor's commercial stack.
- Installs the MLNX_OFED_LINUX binary RPMs (if they are available for the current kernel).
- Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware.
The installation script removes all previously installed NVIDIA OFED packages and re-installs from scratch. You will be prompted to acknowledge the deletion of the old packages
1. Log into the installation machine as root.
2. Copy the downloaded ISO to /root
3. Mount the ISO image on your machine:
mkdir /mnt/iso mount -o loop /root/MLNX_OFED_LINUX-4.7-3.2.9.0-ubuntu18.04-x86_64.iso /mnt/iso cd /mnt/iso
4.Run the installation script:
./mlnxofedinstall --all
5.Reboot after the installation finished successfully:
# /etc/init.d/openibd restart # reboot
ConnectX®-5 ports can be individually configured to work as Infiniband or Ethernet ports. By default both ConnectX-5 VPI ports are initialized as InfiniBand ports.
6. Check that the port modes are Ethernet:
ibv_devinfo
7.If you see the following - You need to change the interfaces port type to Ethernet
Change the interfaces port type to Ethernet mode.
Change the mode to Ethernet. Use the mlxconfig script after the driver is loaded.
* LINK_TYPE_P1=2 is an Ethernet mode
a. Start mst and see ports names:
mst start mst status
b. Change the port mode to Ethernet:
mlxconfig -d /dev/mst/mt4121_pciconf0 s LINK_TYPE_P1=2 Port 1 set to ETH mode reboot
c. Query Ethernet devices and print information on them that is available for use from userspace:
ibv_devinfo
d. Run the ibdev2netdev utility to see all the associations between the Ethernet devices and the InfiniBand devices/ports:
ibdev2netdev
e. Configure the network interface:
ifconfig ens13f0 31.31.31.28 netmask 255.255.255.0
f.Insert the lines below to the /etc/network/interfaces file after the following lines:
vim /etc/network/interfaces auto eno1 iface eno1 inet dhcp
The new lines:
auto ens13f0 iface ens13f0 inet static address 31.31.31.28 netmask 255.255.255.0
Example:
vim /etc/network/interfaces auto eno1 iface eno1 inet dhcp auto ens13f0 iface ens13f0 inet static address 31.31.31.28 netmask 255.255.255.0
g. Check the network configuration is set correctly:
ifconfig -a
Lossless Fabric with L3 (DSCP) Configuration
This post provides a configuration example for NVIDIA devices installed with MLNX_OFED running RoCE over a lossless network, in DSCP-based QoS mode.
Sample for our environment:
Notations
<interface> refers to the parent interface (for example ens13f0)
<mlx-device> refers to mlx device (for example mlx5_0)
To refer the above, run:
ibdev2netdev
<mst-device> refers to MST device (for example /dev/mst/ mt4121_pciconf0) by running:
mst start mst status
Configuration:
mlnx_qos -i ens13f0 --trust dscp echo 106 > /sys/class/infiniband/mlx5_0/tc/1/traffic_class cma_roce_tos -d mlx5_0 -t 106 sysctl -w net.ipv4.tcp_ecn=1 mlnx_qos -i ens13f0 --pfc 0,0,0,1,0,0,0,0
Validate MOFED
Check the "mofed" version and "uverbs":
ofed_info -s MLNX_OFED_LINUX-4.7-3.2.9.0 ls /dev/infiniband/uverbs1
Run bandwidth stress over InfiniBand in container.
Server | ib_write_bw -a -d mlx5_0 & |
---|---|
Client | ib_write_bw -a -F $server_IP -d mlx5_0 --report_gbits |
This way you can run bandwidth stress over RoCE between containers.
Configuring the Environment
Installing Unified Communication X(UCX) on the Master and Worker Servers (From sources)
cd /tmp git clone –b v1.8.x git@github.com:openucx/ucx.git cd ucx mkdir build ./autogen.sh cd build ../contrib/configure-devel –with-java –prefix=$PWD make –j `nproc` make install
Untar the Spark and archives
cd /share/sparkucx tar -xzvf spark-2.4.4-bin-hadoop2.7.tgz
All the scripts, jars, and configuration files are available in the newly created directory “spark-2.4.4-bin-hadoop2.7”
Unpack the SparkUCX jar file to /tmp directory and copy to Spark directory
spark-ucx-1.0-for-spark-2.4-jar-with-dependencies.jar.zip
cp /tmp/spark-ucx-1.0-for-spark-2.4-jar-with-dependencies.jar /share/sparkucx/spark-2.4.4-bin-hadoop2.7/spark-ucx-1.0-for-spark-2.4-jar-with-dependencies.jar
Spark Configuration
Update your bash file, run:
Server Consolevim ~/.bashrc
It will open your bash file in a text editor which you will scroll to the bottom and add these lines:
Server Consoleexport JAVA_HOME=/usr/lib/jvm/java-8-oracle/ export SPARK_HOME=/share/sparkucx/spark-2.4.4-bin-hadoop2.7/
Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.
Server Consolesource ~/.bashrc
Check that the paths have been properly modified.
Server Consoleecho $JAVA_HOME echo $SPARK_HOME echo $LD_LIBRARY_PATH
- Edit spark-env.sh
Now edit the configuration file spark-env.sh (in $SPARK_HOME/conf/) and set the following parameters:
Create a copy of the template of spark-env.sh and rename it. Add a master hostname, interface and spark_tmp directory.
cd /share/sparkucx/spark-2.4.4-bin-hadoop2.7/conf cp spark-env.sh.template spark-env.sh vim spark-env.sh # Generic options for the daemons used in the standalone deploy mode # - SPARK_CONF_DIR Alternate conf dir. (Default: ${SPARK_HOME}/conf) # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) # - SPARK_IDENT_STRING A string representing this instance of spark. (Default: $vuhuong) # - SPARK_NICENESS The scheduling priority for daemons. (Default: 0) #export SPARK_MASTER_HOST=spark1-r export SPARK_MASTER_HOST=namenoder export SPARK_LOCAL_IP=`/sbin/ip addr show enp1f0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1` export SPARK_LOCAL_DIRS=/data/spark-tmp Add slaves Create a copy of the template of slaves configuration file, rename it to slaves (in $SPARK_HOME/conf/) and add the following entries: $ vim /share/sparkucx/spark-2.4.4-bin-hadoop2.7/conf/slaves clx-mld-41-r clx-mld-42-r clx-mld-43-r ... clx-mld-55-r
Hadoop Configuration
Copy conf/slaves to hadoop-2.7.4/etc/hadoop/slaves
Server Consolecp /share/sparkucx/spark-2.4.4-bin-hadoop2.7/conf/slaves /share/sparkucx/hadoop-2.7.4/etc/hadoop/slaves
Create a distributed file system
Server Consolesbin/slaves.sh mkdir /data/hadoop_tmp # on NVMe disk
Edit current_config/core-site.xml. Add both property tags in the configuration tags <configuration>.
Server Console Expand sourcecd /share/sparkucx/hadoop-2.7.7 vim current_config/core-site.xml <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://namenoder:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/hadoop_tmp</value> </property> </configuration>
Update your bash file, run:
Server Consolevim ~/.bashrc
This will open your bash file in a text editor which you will scroll to the bottom and add these lines:
Server Consoleexport HADOOP_HOME=/share/sparkucx/hadoop-2.7.7/
Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file.
Server Consolesource ~/.bashrc
Check that the paths have been properly modified.
Server Consoleecho $HADOOP_HOME
Edit current_config /hadoop-env.sh
Now edit configuration file current_config /hadoop-env.sh (in $HADOOP_HOME/etc/hadoop) and set following parameters:
Add a HADOOP_HOME directory.Server Console Expand sourcevim current_config/hadoop-env.sh # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. export JAVA_HOME=${JAVA_HOME} export HADOOP_HOME=/share/sparkucx/hadoop-2.7.7/ export HADOOP_PREFIX=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
HDFS Configuration
Edit current_config/hdfs-site.xml
Server Console Expand sourcevim current_config/hdfs-site.xml <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.datanode.dns.interface</name> <value>enp1f0</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.datanode.registration.ip-hostname-check</name> <value>false</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/hadoop_tmp</value> </property> </configuration>
YARN Configuration
We do not use YARN in our example but you can use it in your deployment.
Edit current_config/yarn-site.xml
Server Console Expand sourcevim current_config/yarn-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>namenoder:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>namenoder:8030</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>namenoder:8032</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>namenoder:8034</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>namenoder:8101</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>40960</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>40960</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>20</value> </property> <property> <name>yarn.nodemanager.disk-health-checker.enable</name> <value>false</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/tmp/yarn_nm/</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration>
Edit current_config/yarn-env.sh. Specify where hadoop directory.
Server Consolevim current_config/yarn-env.sh ... export HADOOP_HOME=/share/sparkucx/hadoop-2.7.7/ RDMA_IP=`/usr/sbin/ip addr show ens13f0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1` export YARN_NODEMANAGER_OPTS="-Dyarn.nodemanager.hostname=$RDMA_IP" ....
Copy current_config/* to etc/hadoop/.
Server Consolecp current_config/* etc/hadoop/
Execute the following command on the NameNode host machine to HDFS format:
Server Consolebin/hdfs namenode -format
Start Spark Standalone Cluster
Run SparkUCX on Top of a Hadoop Cluster
To start a Spark Services run the following commands on Master:
cd /share/sparkucx/spark-2.4.4-bin-hadoop2.7 sbin/start-all.sh
Check whether Services have been Started
To check daemons on Master, run:
jps 33970 Jps 47928 ResourceManager 48121 NodeManager 47529 DataNode 47246 NameNode
To check daemons on Slaves, run:
jps 1846 NodeManager 16491 Jps 1659 DataNode
To check HDFS and YARN (optional) statuses, run:
cd /share/sparkucx/hadoop-2.7.7 bin/hdfs dfsadmin -report | grep Name Name: 31.31.31.43:50010 (clx-mld-43-r) Name: 31.31.31.47:50010 (clx-mld-47-r) Name: 31.31.31.48:50010 (clx-mld-48-r) Name: 31.31.31.45:50010 (clx-mld-45-r) Name: 31.31.31.42:50010 (clx-mld-42-r) Name: 31.31.31.41:50010 (namenoder) ... Name: 31.31.31.55:50010 (clx-mld-53-r) bin/hdfs dfsadmin -report | grep Name -c 15 bin/yarn node -list 18/02/20 16:56:31 INFO client.RMProxy: Connecting to ResourceManager at namenoder/31.31.31.41:8101 Total Nodes:15 Node-Id Node-State Node-Http-Address Number-of-Running-Containers clx-mld-42-r:34873 RUNNING clx-mld-42-r:8042 0 clx-mld-47-r:35045 RUNNING clx-mld-47-r:8042 0 clx-mld-48-r:44996 RUNNING clx-mld-48-r:8042 0 clx-mld-46-r:45432 RUNNING clx-mld-46-r:8042 0 clx-mld-45-r:41307 RUNNING clx-mld-45-r:8042 0 ... clx-mld-55-r:44311 RUNNING clx-mld-55-r:8042 0 namenoder:41409 RUNNING namenoder:8042 0
Stop the Cluster
To stop a Spark Services run the following commands on Master:
cd /share/sparkucx/spark-2.4.4-bin-hadoop2.7 sbin/stop-all.sh
Performance Tuning for NVIDIA Adapters
It is recommended to run the mlnx_tune utility, which will run several system checks and provide notification of any potential settings which will cause performance degradation.
You are welcome to act accordingly with findings.
For more information about the mlnx_tune you can read this post: HowTo Tune Your Linux Server for Best Performance Using the mlnx_tune Tool
Also the command will display CPU cores which are used by the network driver. This information will be used for Spark performance tuning.
sudo mlnx_tune 2017-08-16 14:47:17,023 INFO Collecting node information 2017-08-16 14:47:17,023 INFO Collecting OS information 2017-08-16 14:47:17,026 INFO Collecting CPU information 2017-08-16 14:47:17,104 INFO Collecting IRQ Balancer information . . . Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] . . . 2018-02-16 14:47:18,777 INFO System info file: /tmp/mlnx_tune_180416_144716.log
After running the mlnx_tune command, it is highly recommended to set the cpuList parameter.
Change the spark.conf file to use the NUMA cores associated with the NVIDIA device.
Local CPUs list [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] spark.shuffle.rdma.cpuList 0-15
More in-depth performance resources can be found in the NVIDIA Community post: Performance Tuning for NVIDIA Adapters
SparkUCX Performance Tips
Compression:
Spark enables compression as the default runtime option.
Using compression will result in smaller packet sizes to be sent between the nodes, but at the expense of having higher CPU utilization in order to compress the data.
Due to the high performance and low CPU overhead network properties of an RDMA network, it is recommended to disable compression when using SparkUCX.
In your spark.conf file set:Server Consolespark.shuffle.compress false spark.shuffle.spill.compress false
By disabling compression, you will be able to reclaim precious CPU cycles that were previously used for data compression/decompression,
and will also see additional performance benefits in the RDMA data transfer speeds.- Disk Media:
In order to see the highest and most consistent performance results possible, it is recommended to use the highest performance disk media available.
Using a ramdrive or NVMe device for the spark-tmp and hadoop tmp files should be explored whenever possible.
Congratulations, you now have a RDMA accelerated Spark cluster ready to work.
Now we will run the HiBench Suite benchmarks for TCP vs. RoCE comparison.
Appendix A: Running HiBench with SparkUCX
HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations.
It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc.
Environment
- Instance type and Environment: See setup overview
- OS: Ubuntu 18.04.3 LTS
- Apache hadooop: 2.7.4, hdfs (1 namenode, 14 datanodes)
- Spark: 2.4.4 standalone 15 nodes
- Benchmark : Setup HiBench
- Test Date: Feb 2020
Benchmarks runs
Steps to reproduce Terasort result:
Configure Hadoop and Spark settings in HiBench conf directory.
In HiBench/conf/hibench.conf set:
Server Consolehibench.scale.profile bigdata # Mapper number in hadoop, partition number in Spark hibench.default.map.parallelism 3000 # Reducer nubmer in hadoop, shuffle partition number in Spark hibench.default.shuffle.parallelism 15000
Set in HiBench/conf/workloads/micro/terasort.conf:
Server Consolehibench.terasort.bigdata.datasize 1890000000
Run HiBench/bin/workloads/micro/terasort/prepare/prepare.sh and HiBench/bin/workloads/micro/terasort/spark/run.sh
Add to HiBench/conf/spark.conf:
Server Consolespark.driver.extraClassPath /share/sparkucx/spark-2.4.4-bin-hadoop2.7/spark-ucx-1.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar:/share/sparkucx/spark-2.4.4-bin-hadoop2.7/build/lib/spark.executor.extraClassPath spark.executor.extraClassPath /share/sparkucx/spark-2.4.4-bin-hadoop2.7/spark-ucx-1.0-for-spark-SPARK_VERSION-jar-with-dependencies.jar:/share/sparkucx/spark-2.4.4-bin-hadoop2.7/build/lib/spark.shuffle.manager org.apache.spark.shuffle.UcxShuffleManager spark.shuffle.manager org.apache.spark.shuffle.UcxShuffleManager spark.shuffle.compress false spark.shuffle.spill.compress false spark.shuffle.sort.io.plugin.class org.apache.spark.shuffle.compat.spark_3_0.UcxLocalDiskShuffleDataIO
Run HiBench/bin/workloads/micro/terasort/spark/run.sh
Steps to reproduce Pagerank sort result:
In HiBench/conf/hibench.conf set:
Server Consolehibench.scale.profile bigdata # Mapper number in hadoop, partition number in Spark hibench.default.map.parallelism 3000 # Reducer nubmer in hadoop, shuffle partition number in Spark hibench.default.shuffle.parallelism 15000
Run HiBench/bin/workloads/micro/sort/prepare/prepare.sh and HiBench/bin/workloads/micro/sort/spark/run.sh
Benchmarks Results
Done!
About the Authors
About Boris Kovalev
For the past several years, Boris Kovalev has worked as a solution architect at NVIDIA technology, responsible for complex machine learning and advanced VMware-based cloud research and design. Previously, he spend more than 15 years as a senior consultant and solution architect at multiple companies, most recently at VMware. He's written multiple reference designs covering VMware, machine learning, Kubernetes, and container solutions, which are available at the NVIDIA Documents website.
About Peter Rudenko
Peter Rudenko is a software engineer in High Performance Computing team, focusing on accelerating data intensive applications, developing UCX communication library and various big data solutions.
Related Documents