Created on Aug 16, 2020
The following Reference Deployment Guide (RDG) demonstrates the process of running Apache Spark 3.0 workload with RAPIDS Accelerator for Apache Spark and 25Gb/s Ethernet RoCE.
The deployment will be provisioned on top of Network accelerated and GPU enabled Kubernetes cluster over NVIDIA Mellanox end-to-end 25 Gb/s Ethernet solution.
In this document we will go through the following processes:
- How to deploy K8s cluster with Kubespray over bare metal nodes running Ubuntu 18.04.
- How to prepare the network for RoCE traffic using NVIDIA recommended settings on both host and switch sides.
- How to deploy and run RAPIDS accelerated Apache Spark 3.0 cluster over NVIDIA accelerated infrastructure.
Abbreviation and Acronym List
|AOC||Active Optical Cable|
Node Feature Discovery
Container Network Interface
|NGC||NVIDIA GPU Cloud|
|DAC||Direct Attach Copper cable|
Dynamic Host Configuration Protocol
Reference Deployment Guide
Domain Name System
|RDMA||Remote Direct Memory Access|
RDMA over Converged Ethernet
Single Root Input Output Virtualization
Graphics Processing Unit
Graphics Processing Unit
Unified Communication X
|MPI||Message Passing Interface|
Virtual Local Area Network
- NVIDIA T4 GPU
- NVIDIA Mellanox OpenFabrics Enterprise Distribution for Linux (MLNX_OFED)
- CUDA Toolkit
- What is Kubernetes?
- NVIDIA GPU Operator
- SR-IOV Network device plugin for Kubernetes
- SR-IOV CNI plugin
- Apache Spark 3.0
- Running Spark on Kubernetes
- RAPIDS - Open GPU Data Science
- TPC-H benchmark
Key Components and Technologies
- NVIDIA® T4 GPU
The NVIDIA® T4 GPU is based on the NVIDIA Turing™ architecture and packaged in an energy-efficient 70-watt small PCIe form factor. T4 is optimized for mainstream computing environments, and features multi-precision Turing Tensor Cores and RT Cores. Combined with accelerated containerized software stacks from NGC, T4 delivers revolutionary performance at scale to accelerate cloud workloads, such as high-performance computing, deep learning training and inference, machine learning, data analytics, and graphics..
- NVIDIA Cumulus Linux
Cumulus Linux is the only open network OS that enables building affordable network fabrics and operating them similarly to the world’s largest data center operators, unlocking web-scale networking for businesses of all sizes.
- NVIDIA Mellanox ConnectX-5 Ethernet Network Interface Cards
NVIDIA Mellanox ConnectX-5 NICs enable the highest performance and efficiency for data centers hyper-scale, public and private clouds, storage, Machine Learning, Deep Learning, Artificial Intelligence, Big Data and Telco platforms and applications
- NVIDIA Mellanox Spectrum® Open Ethernet Switches
The NVIDIA Mellanox Spectrum® switch family provides the most efficient network solution for the ever-increasing performance demands of Data Center applications. The Spectrum product family includes a broad portfolio of Top-of-Rack (TOR) and aggregation switches that range from 16 to 128 physical ports, with Ethernet data rates of 1GbE, 10GbE, 25GbE, 40GbE, 50GbE, 100GbE and 200GbE per port. Spectrum Ethernet switches are ideal to build cost-effective and scalable data center network fabrics that can scale from a few nodes to tens-of-thousands of nodes.
- NVIDIA Mellanox LinkX® Ethernet Cables and Transceivers
NVIDIA Mellanox LinkX cables and transceivers make 100Gb/s deployments as easy and as universal as 10Gb/s links. Because Mellanox offers one of industry’s broadest portfolio of 10, 25, 40, 50,100 and 200Gb/s Direct Attach Copper cables (DACs), Copper Splitter cables, Active Optical Cables (AOCs) and Transceivers, every data center reach from 0.5m to 10km is supported. To maximize system performance. Mellanox tests every product in an end-to-end environment ensuring a Bit Error Rate of less than 1e-15. A BER of 1e-15 is 1000x better than many competitors.
Kubernetes (K8s) is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.
- Kubespray ( From Kubernetes.io, )
Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:
- A highly available cluster
- Composable attributes
- Support for most popular Linux distributions
- NVIDIA GPU Operator
The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others.
Multus is a meta CNI plugin that provides multiple network interface support to pods. For each interface Multus delegates CNI calls to secondary CNI plugins such as Calico, SR-IOV, etc.
- SR-IOV Network Device Plugin
Kubernetes device plugin for discovering and advertising SR-IOV virtual functions (VFs) available on a Kubernetes host. The plugin was enhanced by the NVIDIA Mellanox R&D team to use RDMA applications in Kubernetes.
RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire directly to application memory or from application memory directly to the wire, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. This reduces latency in message transfer
- SR-IOV CNI Plugin
The SR-IOV CNI plugin works with SR-IOV network device plugin for VF allocation in Kubernetes. It enables the configuration and usage of SR-IOV VF networks in containers and orchestration systems like Kubernetes.
- Apache Spark™
Apache Spark™ is an open-source, fast and general engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
The RAPIDS suite of open source software libraries and APIs provide the ability to execute end-to-end data science and analytics pipelines entirely on GPUs. Licensed under Apache 2.0, RAPIDS is incubated by NVIDIA® based on extensive hardware and data science experience. RAPIDS utilizes NVIDIA CUDA® primitives for low-level compute optimization, and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
RAPIDS also focuses on common data preparation tasks for analytics and data science. This includes a familiar dataframe API that integrates with a variety of machine learning algorithms for end-to-end pipeline accelerations without paying typical serialization costs. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset sizes.
- RAPIDS Accelerator for Apache Spark
The RAPIDS Accelerator for Apache Spark combines the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework.
The RAPIDS Accelerator library also has a built-in accelerated shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities.
Documentation on the current release can be found at here.
- Unified Communication X
UCX utilizes high-speed networks for extra-node communication, and shared memory mechanisms for efficient intra-node communication.
- GPUDirect RDMA
GPUDirect (GDR) RDMA provides a direct P2P (Peer-to-Peer) data path between the GPU Memory directly to and from NVIDIA Mellanox HCA devices, which reduces GPU-to-GPU communication latency and completely offloads the CPU, removing it from all GPU-to-GPU communications across the network.
All servers K8s Worker nodes have the same hardware specification (see BoM for details).
- Host OS
Ubuntu Server 18.04 operating system installed on all servers with OpenSSH server packages.
- Switch OS
NVIDIA Cumulus Linux 4.1.1
- Management Network
DHCP and DNS services are part of the IT infrastructure. The components Installation and configuration are not covered in this guide.
- Expirience with Apache Spark
Make sure to familiarize with the Apache Cluster multi-node cluster architecture. See Overview - Spark 3.0 Documentation for more info.
Solution Logical Design
There are multiple ways to deploy Spark and Spark Rapids Plugin. Standalone Mode, Local Mode (for development and testing only, not production), on a YARN cluster (see this link for more details), and on a Kubernetes cluster which is the method used in this deployment guide.
The logical design includes the following layers:
- Two separate networking layers:
- Management network
- High-speed RoCE Network
- One Compute layer:
- K8s Master node and Deployment
- 3 x K8s Worker Nodes with two NVIDIA T4 GPUs and one Mellanox ConnectX-5 adapter.
Bill of Materials (BoM)
This document covers single Kubernetes controller deployment scenario. For highly available cluster deployment refer to https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md
The following hardware setup is utilized in the distributed Spark/K8s configuration described in this guide:
The above table does not contain Kubernetes Management network connectivity components.
The first port of each NVIDIA Mellanox network card on each Worker Node is wired to NVIDIA Mellanox switch using 25Gb/s DAC cables:
This table does not contain Kubernetes Management network connectivity components.
Below are the server names with their relevant network configurations:
|IP and NICS|
25 GigE (VLAN -111)
|Master Node||node1||eno0: DHCP|
|Depl./Driver Node||depl-node||eno0: DHCP|
|High-speed switch||swx-mld-s01||none||mgmt0: From DHCP|
ens1f0 interfaces do not require any additional configuration.
Network Switch Configuration for RoCE Transport
NVIDIA Cumulus Linux Network OS
RoCE transport is utilized to accelerate Spark networking through the UCX library. To get the highest possible results we will configure our network to be lossless.
Run the following commands to configure a lossless networks and for NVIDIA Cumulus version 4.1.1 and above:
Add VLAN 111 to ports 1-3 on NVIDIA Cumulus Linux Network OS by running the following commands:
- Ubuntu 18.04 system.
- Access to a terminal or command line.
- Sudo user or root permissions.
Host OS Prerequisites
Make sure Ubuntu Server 18.04 operating system is installed on all servers with OpenSSH server packages and create a non-root user account with sudo privileges without password.
Update the Ubuntu software packages and install the latest HWE kernel by running the below commands:
Non-root User Account Prerequisites
In this solution we added the following line to the EOF /etc/sudoers :
Install general dependencies on the deployment server, run the commands below or paste each line into the terminal:
Install the Java 8 software packages:
Install the general dependencies on the Worker Servers by running the commands below or paste each line into the terminal:
To install LLDP service on the Worker Servers, run the commands below or paste each line into the terminal:
- To create a Network File System (NFS) Share, Install NFS server on a new or existing server in the environment. For this solution Node 2 (First Worker Node) is used as the NFS server. Follow the procedure in detailed in this guide to install the BFS server.
SR-IOV Configuration (on Worker Nodes only)
Verify that you are using SR-IOV supported server platform and review the BIOS settings in the hardware documentation to enable support for SR-IOV networking.
- Enable Virtualization (SR-IOV) in the BIOS.
Enable SR-IOV in the NIC firmware by execute the following commands:
All Worker Nodes must have the same configuration and the same PCIe card placement.
K8s Cluster Deployment and Configuration
The Kubernetes cluster in this solution will be installed using Kubespray with a non-root user account from a Deployment Node(K8s Master Node).
Configuring SSH Private Key and SSH Passwordless Login
Login to the Deployment Node as a deployment user (in this case - user) and create SSH private key for configuring the password-less authentication on your computer by running the following command:
Copy your SSH private key, such as
~/.ssh/id_rsa, to all nodes in your deployment by running the following command:
Configuring KubesprayInstall dependencies to run Kubespray with Ansible from the deployment Node:
The default folder for subsequent commands is ~/kubespray-2.13.3.
Create a new cluster configuration:
Review and change the host configuration file - inventory/mycluster/hosts.yaml. Below is an example output from this solution:
Customizing variables for K8s cluster installation
Review and change cluster installation parameters under inventory/mycluster/group_vars:
In “inventory/mycluster/group_vars/all.yml” uncomment the following line so that the metrics can receive data about cluster resources use.
In "inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml" enable Multus installation by setting the variable kube_network_plugin_multus to true, and specify the Docker version by adding the variable"docker_version: 19.03" to avoid any related issues.
The default Kubernetes CNI can be changed by setting the desired kube_network_plugin value (default: calico) parameter in inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml.
It will install Multus and Calico and configure Multus to use Calico as the primary network plugin.
Deploy K8s cluster with Kubespray Ansible Playbook
Example of a successful completion of the playbooks looks like:
K8s Deployment Verification
The Kubernetes cluster deployment verification must be done from the K8s Master Node.
Copy Kubernetes cluster configuration files from ROOT folder to user home folder or run using root user on the K8s Master Node.
Execute the following command for copy Kubernetes cluster configuration files to a non-root user on the Master Node:
Verify that the Kubernetes cluster is installed properly. Execute the following commands:
NVIDIA GPU Operator Installation for K8s cluster
The preferred method to deploy the device plugin is as a daemonset using helm. Install Helm from the official installer script:
Add the NVIDIA repository:
NVIDIA GPU Operator uses hostNetwork by default. The defaults must be modified as it is not suitable for this solution.
Deploy the device plugin:
To run a Sample GPU Application: https://github.com/NVIDIA/gpu-operator#running-a-sample-gpu-applicationFor GPU monitoring: https://github.com/NVIDIA/gpu-operator#gpu-monitoring
SR-IOV Components Installation for K8s Cluster
The RoCE enhancement for the K8s cluster enables containerizing SR-IOV virtual functions in Pod deployments for additional RoCE enabled NIC's.
During the installation process, a role will use the Kubespray inventory/mycluster/hosts.yaml file for Kubernetes components deployment and provisioning.
The RoCE components are configured by a separate role in the installation. The RoCE role will install the following components:
- Virtual Function activation
- Node Feature Discovery v0.6.0
- The latest Multus CNI for attaching multiple network interfaces to the Pod
- Specific configuration of the universal SR-IOV device plugin
- Universal SR-IOV CNI
- Specific network provisioning with "NetworkAttachmentDefinition"
- DHCP CNI for providing IP addresses for SR-IOV based NIC's in the Pod deployment from existing infrastructure
RoCE role deployment is validated using Ubuntu 18.04 OS and Kubespray v2.13.3
Copy the RoCE role installation package(provided in Appendix) to Kubespray folder:
Customize Role Variables
Set the variables for the RoCE role in the yml file - roles/roce_sriov/vars/main.yml.
Set the following parameters:
- pf_name: "ens3f0"
- num_vf to 8
- hugePages to false
- install_mofed to true
Run the playbook from the Kubespray deployment folder using the following command:
The following is an example of a successful playbooks execution:
Role Installation Summary:
After the installation, the role with default variable parameters in your K8s cluster will have the following:
- Node Feature Discovery for Kubernetes installed
- The required amount of VF for each specified Mellanox NIC name activated and configured
- configmap for "SRIOV NETWORK DEVICE PLUGIN" will be configured and ready for creating resources
- DaemonSet's with "SRIOV NETWORK DEVICE PLUGIN" and "SRIOV CNI" installed
- Multus meta CNI updated to the latest version
- DaemonSet with "whereabouts-cni" installed to provide IP address management for SRIOV based NIC's
- One network-attachment-definitions SRIOV111
Role Deployment Verification
The role deployment verification must be done from the K8s Master node. Execute the following commands to initiate the verification process:
Lossless Fabric with L3 (DSCP) Configuration
Before starting the below process, make sure you are familiar with this configuration example for NVIDIA Mellanox devices installed with MLNX_OFED running RoCE over a lossless network in DSCP-based QoS mode.
For this configuration, make sure to know your network interface name (for example ens3f0) and its parent NVIDIA Mellanox device (for example mlx5_0). To get this information, run the ibdev2netdev command:
Installing Mellanox GPUDirect RDMA
The below listed software is required to install and run the GPUDirect RDMA:
- NVIDIA compatible driver. (Contact NVIDIA support for more information)
- MLNX_OFED (latest).
Copy the NVIDIA driver from /run/nvidia/driver/usr/src/nvidia-440.64.00/kernel/nvidia.ko file to the /lib/modules/5.4.0-42-generic/updates/dkms/ directory.
Download the latest nv_peer_memory:
Build source packages (src.rpm for RPM based OS and tarball for DEB based OS) and run the build_module.sh script:
Install the packages:
(e.g. dpkg -i nv-peer-memory_1.0-9_all.deb nv-peer-memory-dkms_1.0-9_all.deb )
After a successful installation:
- nv_peer_mem.ko kernel module will be installed.
- service file /etc/init.d/nv_peer_mem to control the kernel module by start/stop/status will be added.
- /etc/infiniband/nv_peer_mem.conf configuration file to control whether kernel module will be loaded on boot (default value is YES).
Installing OFED Performance Tests
Perftools is a collection of tests written over uverbs intended for use as a performance micro-benchmark. The tests may be used for HW or SW tuning as well as for functional testing. The collection contains a set of bandwidth and latency benchmarks such as:
- Send - ib_send_bw and ib_send_lat
- RDMA Read - ib_read_bw and ib_read_lat|
- RDMA Write - ib_write_bw and ib_write_lat
- RDMA Atomic - ib_atomic_bw and ib_atomic_lat
- Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat.
To install perftools, run the following commands:
Run a RDMA Write - ib_write_bw bandwidth stress benchmark over RoCE.
ib_write_bw -a -d mlx5_0 &
ib_write_bw -a -F $server_IP -d mlx5_0 --report_gbits
Distributed Spark Deployment
Installing Apache Spark Driver Node
Download Apache Spark (Deployment Node), from Downloads | Apache Spark by selecting a Spark release and package type spark-3.0.0-bin-hadoop3.2.tgz file and copy the file to the /opt folder.Optional: You can download Spark by running following commands:
cd /opt ; wget http://mirror.cogentco.com/pub/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
To set up Apache Spark, extract the compressed file using the
Make a symbolic link for spark-3.0.0-bin-hadoop3.2 folder and set the permissions for your user to spark-3.0.0-bin-hadoop3.2 and spark folders:
To Configure Spark environment variables (bashrc), edit the .bashrc shell configuration file:
Define the Spark environment variables by adding the following content to the end of the file:
Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file:
Check that the path have been properly modified:
Run a sample Spark job in local mode to ensure Spark is working:
Install the Spark Rapids plugin jars, create /opt/sparkRapidsPlugin on the Deployment Node and set the permissions for your user to the directory:
Download the 2 jars from the https://github.com/NVIDIA/spark-rapids/blob/branch-0.2/docs/version/stable-release.md .
Place them into the /opt/sparkRapidsPlugin directory. Make sure to pull the cudf jar version for the cuda version you are running 10.2.
Prepare K8s Cluster to run Apache Spark Executor
Run following steps on the Deployment Node:
To run an Apache Spark and use NFS as a storage to execute benchmarks over a K8s cluster we need to enable the IPC_LOCK capability for memory registration and the nfs-volume configuration.
Create /opt/executor-template directory:
Define an executor-template by creating /opt/executor-template/executor-template.yaml file which contains the following configurations.It is important to change the NFS server IP to an IP relevant to your environment.
Install and run Docker service on the Deployment Node.
infoYou can skip steps 2-6 and use my docker image from DockerHub.
Create /opt/spark/docker directory on the Deployment Node:
On the Deployment Node, create the /opt/spark/docker/getSRIOVResources.sh file to get the resource information about NVIDIA Mellanox SR-IOV device:
Create a Dockerfile with the following:
Build a docker image from the Dockerfile and push to the Docker Hub registry.
Pull the Docker image from the Docker Hub registry on all Workers Nodes.
You can use my precompiled docker image: bkovalev/spark3rapids
Prepare and Run TPC-H benchmarks
Prepare TPC-H dataset
To prepare the TPCH application, download the generated dataset and run the following commands on the Worker Node or on a server with access to NFS share:
Examples to run TPC-H application
To run TPC-H application with RAPIDS you need the following jar file:
Put the jar file into the /opt/sparkRapidsPlugin directory on the Deployment Node together with Spark Rapids Plugin jar files.
Here you can find examples to run TPC-H application:
For TCP without GPU/RAPIDS/UCX running we are using the tpch-tcp-only.sh file with the following configs when running Spark on K8s.
For TCP with GPU/RAPIDS and without UCX running we are using the tpch-rapids-tcp-noucx.sh file with the following configs when running Spark on K8s.
For TCP with GPU/RAPIDS/UCX running we are using the tpch-rapids-tcp-ucx.sh file with the following configs when running Spark on K8s.
For RDMA/RC with GPU/RAPIDS/UCX without GPUDirect running we are using the tpch-rapids-rc-ucx-nogdr.sh file with the following configs when running Spark on K8s.
For RDMA/RC with GPU/RAPIDS/UCX/GPUDirect running we are using the tpch-rapids-rc-ucx-gdr.sh file with the following configs when running Spark on K8s.
Archive of the RoCE role for deployment with Kubespray - roce.tar.
- Fewer large input files are better than many small ones. You may not have control over this but it is worth knowing.
- Larger input sizes spark.sql.files.maxPartitionBytes=512m are generally better as long as things fit into the GPU.
- The GPU does better with larger data chunks as long as they fit into memory. When using the default spark.sql.shuffle.partitions=200 it may be beneficial to make this smaller. Base this on the amount of data the task is reading. Start with 512MB / task.
Out of GPU Memory.
GPU out of memory can show up in multiple ways. You can see an error that it is out of memory or it can also manifest as it just crashes. Generally this means your partition size is too big, go back to the Configuration section for the partition size and/or the number of partitions. Possibly reduce the number of concurrent gpu tasks to 1. The Spark UI may give you a hint at the size of the data. Look at either the input data or the shuffle data size for the stage that failed.
RAPIDS Accelerator for Apache Spark Tuning Guide
Tuning a Spark job configuration settings from the defaults can improve the job performance, and this remains true for jobs leveraging the RAPIDS Accelerator plugin for Apache Spark.
This document provides guidelines on how to tune a Spark job’s configuration settings for improved performance when using the RAPIDS Accelerator plugin.
Since the plugin runs without any API changes, the easiest way to see what is running on the GPU is to look at the "SQL" tab in the Spark Web UI. The SQL tab only shows up after you have actually executed a query. Go to the SQL tab in the UI, click on the query you are interested in and it shows a DAG picture with details. You can also scroll down and twist the "Details" section to see the text representation.
If you want to look at the Spark plan via the code you can use the explain() function call. For example: query.explain() will print the physical plan from Spark and you can see what nodes were replaced with GPU calls.
For now, the best way to debug is how you would normally do it on Spark. Look at the UI and log files to see what failed. If you got a seg fault from the GPU find the hs_err_pid.log file. To make sure your hs_err_pid.log file goes into the YARN application log, you may add in the config: --conf spark.executor.extraJavaOptions="-XX:ErrorFile=<LOG_DIR>/hs_err_pid_%p.log"
If you want to see why an operation didn’t run on the GPU, you can turn on the configuration: --conf spark.rapids.sql.explain=NOT_ON_GPU. A log message will then be output to the Driver log as to why a Spark operation isn’t able to run on the GPU.
About the Authors
Boris Kovalev has worked for the past several years as a Solutions Architect, focusing on NVIDIA Networking/Mellanox technology, and is responsible for complex machine learning, Big Data and advanced VMware-based cloud research and design. Boris previously spent more than 20 years as a senior consultant and solutions architect at multiple companies, most recently at VMware. He has written multiple reference designs covering VMware, machine learning, Kubernetes, and container solutions which are available at the Mellanox Documents website.
Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference design guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.
Peter Rudenko is a software engineer on the NVIDIA High Performance Computing (HPC) team who focuses on accelerating data intensive applications, developing the UCX communication X library, and other big data solutions.