Create Content



Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Created on Aug 16, 2020

On This Page

Introduction

The following Reference Deployment Guide (RDG) demonstrates the process of running Apache Spark 3.0 workload with RAPIDS Accelerator for Apache Spark and 25Gb/s Ethernet RoCE. 

The deployment will be provisioned on top of Network accelerated and GPU enabled Kubernetes cluster over NVIDIA Mellanox end-to-end 25 Gb/s Ethernet solution.

In this document we will go through the following processes:

  1. How to deploy K8s cluster with Kubespray over bare metal nodes running Ubuntu 18.04.
  2. How to prepare the network for RoCE traffic using NVIDIA recommended settings on both host and switch sides.
  3. How to deploy and run RAPIDS accelerated Apache Spark 3.0 cluster over NVIDIA accelerated infrastructure.

Abbreviation and Acronym List

TermDefinitionTermDefinition
AOCActive Optical Cable

NFD

Node Feature Discovery

CNI

Container Network Interface

NGCNVIDIA GPU Cloud
DACDirect Attach Copper cable

PF

Physical Function

DHCP

Dynamic Host Configuration Protocol

RDG

Reference Deployment Guide

DNS

Domain Name System

RDMARemote Direct Memory Access

DP

Device Plugin

RoCE

RDMA over Converged Ethernet

GDRGPUDirect

SR-IOV

Single Root Input Output Virtualization

GPU

 Graphics Processing Unit

GPU

 Graphics Processing Unit

HWEHardware Enablement

UCX

Unified Communication X

K8s

Kubernetes

VF

Virtual Function

MPIMessage Passing Interface

VLAN

Virtual Local Area Network


References

Key Components and Technologies

  • NVIDIA® T4 GPU
    The NVIDIA® T4 GPU is based on the NVIDIA Turing architecture and packaged in an energy-efficient 70-watt small PCIe form factor. T4 is optimized for mainstream computing environments, and features multi-precision Turing Tensor Cores and RT Cores. Combined with accelerated containerized software stacks from NGC, T4 delivers revolutionary performance at scale to accelerate cloud workloads, such as high-performance computing, deep learning training and inference, machine learning, data analytics, and graphics..
  • NVIDIA Cumulus Linux
    Cumulus Linux is the only open network OS that enables building affordable network fabrics and operating them similarly to the world’s largest data center operators, unlocking web-scale networking for businesses of all sizes.
  • NVIDIA Mellanox ConnectX-5 Ethernet Network Interface Cards 
    NVIDIA Mellanox ConnectX-5 NICs enable the highest performance and efficiency for data centers hyper-scale, public and private clouds, storage, Machine Learning, Deep Learning, Artificial Intelligence, Big Data and Telco platforms and applications
  • NVIDIA Mellanox Spectrum® Open Ethernet Switches
    The NVIDIA Mellanox Spectrum® switch family provides the most efficient network solution for the ever-increasing performance demands of Data Center applications. The Spectrum product family includes a broad portfolio of Top-of-Rack (TOR) and aggregation switches that range from 16 to 128 physical ports, with Ethernet data rates of 1GbE, 10GbE, 25GbE, 40GbE, 50GbE, 100GbE and 200GbE per port. Spectrum Ethernet switches are ideal to build cost-effective and scalable data center network fabrics that can scale from a few nodes to tens-of-thousands of nodes.
  • NVIDIA Mellanox LinkX® Ethernet Cables and Transceivers
    NVIDIA Mellanox LinkX cables and transceivers make 100Gb/s deployments as easy and as universal as 10Gb/s links. Because Mellanox offers one of industry’s broadest portfolio of 10, 25, 40, 50,100 and 200Gb/s Direct Attach Copper cables (DACs), Copper Splitter cables, Active Optical Cables (AOCs) and Transceivers, every data center reach from 0.5m to 10km is supported. To maximize system performance. Mellanox tests every product in an end-to-end environment ensuring a Bit Error Rate of less than 1e-15. A BER of 1e-15 is 1000x better than many competitors.
  • Kubernetes
    Kubernetes (K8s) is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.
  • Kubespray ( From Kubernetes.io, )
    Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:
    • A highly available cluster
    • Composable attributes
    • Support for most popular Linux distributions
  • NVIDIA GPU Operator
    The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others.
  • Multus
    Multus is a meta CNI plugin that provides multiple network interface support to pods. For each interface Multus delegates CNI calls to secondary CNI plugins such as Calico, SR-IOV, etc.
  • SR-IOV Network Device Plugin 
    Kubernetes device plugin for discovering and advertising SR-IOV virtual functions (VFs) available on a Kubernetes host. The plugin was enhanced by the NVIDIA Mellanox R&D team to use RDMA applications in Kubernetes.  
  • RDMA 
    RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire directly to application memory or from application memory directly to the wire, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. This reduces latency in message transfer
  • SR-IOV CNI Plugin 
    The SR-IOV CNI plugin works with 
    SR-IOV network device plugin for VF allocation in Kubernetes. It enables the configuration and usage of SR-IOV VF networks in containers and orchestration systems like Kubernetes.
  • Apache Spark™
    Apache Spark™ is an open-source, fast and general engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
  • RAPIDS
    The RAPIDS suite of open source software libraries and APIs provide the ability to execute end-to-end data science and analytics pipelines entirely on GPUs. Licensed under Apache 2.0, RAPIDS is incubated by NVIDIA® based on extensive hardware and data science experience. RAPIDS utilizes NVIDIA CUDA® primitives for low-level compute optimization, and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
    RAPIDS also focuses on common data preparation tasks for analytics and data science. This includes a familiar dataframe API that integrates with a variety of machine learning algorithms for end-to-end pipeline accelerations without paying typical serialization costs. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset sizes.
RAPIDS suiteImage Removed
  • Image Added
  • RAPIDS Accelerator for Apache Spark
    The RAPIDS Accelerator for Apache Spark combines the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. 
    The RAPIDS Accelerator library also has a built-in accelerated shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities.
    Documentation on the current release can be found at here.
    image2020-7-21_16-40-58.pngImage RemovedImage Added
  • Unified Communication X

Unified Communication X (UCX) provides an optimized communication layer for Message Passing (MPI), PGAS/OpenSHMEM libraries and RPC/data-centric applications.

UCX utilizes high-speed networks for extra-node communication, and shared memory mechanisms for efficient intra-node communication.


  • GPUDirect RDMA
    GPUDirect (GDR) RDMA provides a direct P2P (Peer-to-Peer) data path between the GPU Memory directly to and from NVIDIA Mellanox HCA devices, which reduces GPU-to-GPU communication latency and completely offloads the CPU, removing it from all GPU-to-GPU communications across the network.

GPUDirectImage RemovedImage Added

Solution Overview

Prerequisites

  • Hardware
    All servers K8s Worker nodes have the same hardware specification (see BoM for details).
  • Host OS
    Ubuntu Server 18.04 operating system installed on all servers with OpenSSH server packages.
  • Switch OS
    NVIDIA Cumulus Linux 4.1.1
  • Management Network 
    DHCP and DNS services are part of the IT infrastructure. The components Installation and configuration are not covered in this guide.
  • Expirience with Apache Spark
    Make sure to familiarize with the Apache Cluster multi-node cluster architecture. See Overview - Spark 3.0 Documentation for more info.

Solution Logical Design

There are multiple ways to deploy Spark and Spark Rapids Plugin. Standalone Mode, Local Mode (for development and testing only, not production), on a YARN cluster (see this link for more details), and on a Kubernetes cluster which is the method used in this deployment guide.

The logical design includes the following layers:

  • Two separate networking layers: 
    1. Management network
    2. High-speed RoCE Network 
  • One Compute layer: 
    1. K8s Master node and Deployment 
    2. 3 x K8s Worker Nodes with two NVIDIA T4 GPUs and one Mellanox ConnectX-5 adapter. 
Apache Spark with RAPIDS Accelerator on KubernetesImage Removed
    1. Image Added

Bill of Materials (BoM)

This document covers single Kubernetes controller deployment scenario. For highly available cluster deployment refer to https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md

The following hardware setup is utilized in the distributed Spark/K8s configuration described in this guide:

Apache Spark with RAPIDS Accelerator on Kubernetes BoMImage RemovedImage Added

Note

The above table does not contain Kubernetes Management network connectivity components.

Deployment

Physical Network

Connectivity

The first port of each NVIDIA Mellanox network card on each Worker Node is wired to NVIDIA Mellanox switch using 25Gb/s DAC cables:

 Apache Spark with RAPIDS Accelerator on Kubernetes connectivityImage RemovedImage Added

Note

This table does not contain Kubernetes Management network connectivity components.

Network Configuration

Below are the server names with their relevant network configurations: 


Server/Switch type


Server/Switch name
IP and NICS               

High-speed network

25 GigE (VLAN -111)

Management network

1 GigE

Master Nodenode1
eno0: DHCP
Worker Nodenode2

ens1f0: none

eno0: DHCP
Worker Nodenode3

ens1f0: none

eno0: DHCP
Worker Nodenode4

ens1f0: none

eno0: DHCP
Depl./Driver Nodedepl-node


eno0: DHCP
High-speed switchswx-mld-s01nonemgmt0: From DHCP
Info

ens1f0 interfaces do not require any additional configuration.

Network Switch Configuration for RoCE Transport

NVIDIA Cumulus Linux Network OS

RoCE transport is utilized to accelerate Spark networking through the UCX library. To get the highest possible results we will configure our network to be lossless.

Run the following commands to configure a lossless networks and for NVIDIA Cumulus version 4.1.1 and above

Code Block
languagetext
themeFadeToGrey
titleSwitch console
net add interface swp1-32 storage-optimized pfc
net commit

Add VLAN 111 to ports 1-3 on NVIDIA Cumulus Linux Network OS by running the following commands:

Code Block
languagetext
themeFadeToGrey
titleSwitch console
net add interface swp1-6 bridge trunk 111
net add interface swp1-6 bridge trunk vlans 111
net commit
net show bridge vlan

Deployment Guide

Nodes Configuration

General Prerequisites:

  • Ubuntu 18.04 system.
  • Access to a terminal or command line.
  • Sudo user or root permissions.

Host OS Prerequisites 

Make sure Ubuntu Server 18.04 operating system is installed on all servers with OpenSSH server packages and create a non-root user account with sudo privileges without password.

 Update the Ubuntu software packages and install the latest HWE kernel by running the below commands:

Code Block
languagetext
themeFadeToGrey
titleServer console
# apt-get update
# apt-get -y install linux-image-generic-hwe-18.04
# reboot
# sudo apt-get upgrade -y

Non-root User Account Prerequisites 

In this solution we added the following line to the EOF /etc/sudoers :

Code Block
languagetext
themeFadeToGrey
titleServer Console
#includedir /etc/sudoers.d


#K8s cluster deployment user with sudo privileges without password
user ALL=(ALL) NOPASSWD:ALL

Installation Process

  1. Install general dependencies on the deployment server, run the commands below or paste each line into the terminal:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    > sudo apt-get install git wget scala maven make gcc openssh-server openssh-client -y
  2. Install the Java 8 software packages:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    > sudo apt-get install python-software-properties
    > sudo add-apt-repository ppa:webupd8team/java
    > sudo apt-get update
    > sudo apt-get install oracle-java8-installer
  3. Install the general dependencies on the Worker Servers by running the commands below or paste each line into the terminal:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    > sudo apt-get install git wget dkms make gcc
  4. To install LLDP service on the Worker Servers, run the commands below or paste each line into the terminal:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    # sudo service lldpd start
    # sudo systemctl enable lldpd
  5. To create a Network File System (NFS) Share, Install NFS server on a new or existing server in the environment. For this solution Node 2 (First Worker Node) is used as the NFS server. Follow the procedure in detailed in this guide to install the BFS server.

SR-IOV Configuration (on Worker Nodes only)

Verify that you are using SR-IOV supported server platform and review the BIOS settings in the hardware documentation to enable support for SR-IOV networking.

  1. Enable Virtualization (SR-IOV) in the BIOS.
  2. Enable SR-IOV in the NIC firmware by execute the following commands:

    Code Block
    themeMidnight
    sudo apt-get install mstflint
    ### request information about ALL Mellanox NIC's on the server ###
    sudo mstconfig q
    ### enable SR-IOV with 8 VF's
    mstconfig -d /sys/bus/pci/devices/0000:13:00.0/config set SRIOV_EN=1 NUM_OF_VFS=8
    reboot
Note

All Worker Nodes must have the same configuration and the same PCIe card placement.

K8s Cluster Deployment and Configuration

The Kubernetes cluster in this solution will be installed using Kubespray with a non-root user account from a Deployment Node(K8s Master Node).

Configuring SSH Private Key and SSH Passwordless Login

Login to the Deployment Node as a deployment user (in this case - user) and create SSH private key for configuring the password-less authentication on your computer by running the following command:

Code Block
languagetext
themeMidnight
$ ssh-keygen

Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in your deployment by running the following command:

Code Block
languagetext
themeMidnight
$ ssh-copy-id -i <filename> user@nodename

Configuring Kubespray

Install dependencies to run Kubespray with Ansible from the deployment Node:
Code Block
languagetext
themeMidnight
$ cd ~
$ sudo apt -y install python3-pip jq
$ wget https://github.com/kubernetes-sigs/kubespray/archive/v2.13.3.tar.gz
$ tar -zxf v2.13.3.tar.gz
$ cd kubespray-2.13.3
$ sudo pip3 install -r requirements.txt

The default folder for subsequent commands is ~/kubespray-2.13.3.

Create a new cluster configuration:

Code Block
languagetext
themeMidnight
$ cp -rfp inventory/sample inventory/mycluster
$ declare -a IPS=(192.168.1.6 192.168.1.71 192.168.1.72 192.168.1.73)
$ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

Review and change the host configuration file - inventory/mycluster/hosts.yaml. Below is an example output from this solution:

Code Block
languageyml
themeMidnight
titleinventory/mycluster/hosts.yaml
collapsetrue
all:
  hosts:
    node1:
      ansible_host: 192.168.1.6
      ip: 192.168.1.6
      access_ip: 192.168.1.6
    node2:
      ansible_host: 192.168.1.71
      ip: 192.168.1.71
      access_ip: 192.168.1.71
    node3:
      ansible_host: 192.168.1.72
      ip: 192.168.1.72
      access_ip: 192.168.1.72
    node4:
      ansible_host: 192.168.1.73
      ip: 192.168.1.73
      access_ip: 192.168.1.73
  children:
    kube-master:
      hosts:
        node1:
    kube-node:
      hosts:
        node2:
        node3:
        node4:
    etcd:
      hosts:
        node1:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
    calico-rr:
      hosts: {}

Customizing variables for K8s cluster installation

Review and change cluster installation parameters under inventory/mycluster/group_vars:

Code Block
languagetext
themeMidnight
$ cat inventory/mycluster/group_vars/all/all.yml
$ cat inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

In “inventory/mycluster/group_vars/all.yml” uncomment the following line so that the metrics can receive data about cluster resources use.

Code Block
languagetext
themeMidnight
vim inventory/mycluster/group_vars/all/all.yml

## The read-only port for the Kubelet to serve on with no authentication/authorization. Uncomment to enable.
kube_read_only_port: 10255


In "inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml" enable Multus installation by setting the variable kube_network_plugin_multus to trueand specify the Docker version by adding the variable"docker_version: 19.03" to avoid any related issues.

Code Block
languagetext
themeMidnight
vim inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

## Setting multi_networking to true will install Multus: https://github.com/intel/multus-cni
kube_network_plugin_multus: true

## Container runtime
## docker for docker, crio for cri-o and containerd for containerd.
container_manager: docker
docker_version: 19.03
Info
The Kubespray version in this deployment suffers from an inconsistency when installing Docker components (see issue details here: https://github.com/kubernetes-sigs/kubespray/issues/6160).
Info

The default Kubernetes CNI can be changed by setting the desired kube_network_plugin value (default: calico)parameter in inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml.

It will install Multus and Calico and configure Multus to use Calico as the primary network plugin.


Deploy K8s cluster with Kubespray Ansible Playbook

Code Block
languagetext
themeFadeToGrey
titleDeployment Node Console
$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml 
Info
The execution time for this step may take a while to finalize.

Example of a successful completion of the playbooks looks like:

Code Block
languagetext
themeFadeToGrey
titleDeployment Node Console
collapsetrue
PLAY RECAP ***************************************************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0   
node1                      : ok=617  changed=101  unreachable=0    failed=0   
node2                      : ok=453  changed=58   unreachable=0    failed=0   
node3                      : ok=410  changed=53   unreachable=0    failed=0   
node4                      : ok=410  changed=53   unreachable=0    failed=0   


Monday 16 April 2020  17:48:14 +0300 (0:00:00.265)       0:13:49.321 ********** 
=============================================================================== 
kubernetes/master : kubeadm | Initialize first master ------------------------------------------------------------------------------------ 55.94s
kubernetes/kubeadm : Join to cluster ----------------------------------------------------------------------------------------------------- 37.65s
kubernetes/master : Master | wait for kube-scheduler ------------------------------------------------------------------------------------- 21.97s
download : download_container | Download image if required ------------------------------------------------------------------------------- 21.34s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources ------------------------------------------------------------------------------ 14.85s
kubernetes/preinstall : Update package management cache (APT) ---------------------------------------------------------------------------- 12.49s
download : download_file | Download item ------------------------------------------------------------------------------------------------- 11.45s
etcd : Install | Copy etcdctl binary from docker container ------------------------------------------------------------------------------- 10.57s
download : download_file | Download item -------------------------------------------------------------------------------------------------- 9.37s
kubernetes/preinstall : Install packages requirements ------------------------------------------------------------------------------------- 9.18s
etcd : wait for etcd up ------------------------------------------------------------------------------------------------------------------- 8.78s
etcd : Configure | Check if etcd cluster is healthy --------------------------------------------------------------------------------------- 8.62s
download : download_file | Download item -------------------------------------------------------------------------------------------------- 8.24s
kubernetes-apps/network_plugin/multus : Multus | Start resources -------------------------------------------------------------------------- 7.32s
download : download_container | Download image if required -------------------------------------------------------------------------------- 6.61s
policy_controller/calico : Start of Calico kube controllers ------------------------------------------------------------------------------- 4.92s
download : download_file | Download item -------------------------------------------------------------------------------------------------- 4.76s
kubernetes-apps/cluster_roles : Apply workaround to allow all nodes with cert O=system:nodes to register ---------------------------------- 4.56s
download : download_container | Download image if required -------------------------------------------------------------------------------- 4.48s
download : download | Download files / images --------------------------------------------------------------------------------------------- 4.28s

K8s Deployment Verification

The Kubernetes cluster deployment verification must be done from the K8s Master Node.

Copy Kubernetes cluster configuration files from ROOT folder to user home folder or run using root user on the K8s Master Node.

Execute the following command for copy Kubernetes cluster configuration files to a non-root user on the Master Node:

Code Block
languagetext
themeFadeToGrey
titleK8s Master Node Console
user@node1:$ sudo su -
root@node1:~# cp -r .kube/ /home/user/
root@node1:~# chown -R `id -u user`:`id -g user` /home/user/.kube/
root@node1:~# exit


Verify that the Kubernetes cluster is installed properly. Execute the following commands:

Code Block
languagetext
themeFadeToGrey
titleK8s Master Node Console
user@node1:$ kubectl get nodes -o wide
NAME    STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node1   Ready    master   14d   v1.17.9   192.168.1.6    <none>        Ubuntu 18.04.4 LTS   5.4.0-42-generic   docker://18.9.7
node2   Ready    <none>   14d   v1.17.9   192.168.1.71   <none>        Ubuntu 18.04.4 LTS   5.4.0-42-generic   docker://18.9.7
node3   Ready    <none>   14d   v1.17.9   192.168.1.72   <none>        Ubuntu 18.04.4 LTS   5.4.0-42-generic   docker://18.9.7
node4   Ready    <none>   14d   v1.17.9   192.168.1.73   <none>        Ubuntu 18.04.4 LTS   5.4.0-42-generic   docker://18.9.7

     
user@node1:~$ kubectl get pod -n kube-system -o wide
NAME                                          READY   STATUS    RESTARTS   AGE   IP             NODE    NOMINATED NODE   READINESS GATES
calico-kube-controllers-8567d5fdd6-zfch2      1/1     Running   0          12d   192.168.1.71   node2   <none>           <none>
calico-node-fzv78                             1/1     Running   1          14d   192.168.1.6    node1   <none>           <none>
calico-node-gsplr                             1/1     Running   2          14d   192.168.1.71   node2   <none>           <none>
calico-node-n6mgq                             1/1     Running   4          14d   192.168.1.73   node4   <none>           <none>
calico-node-pdcft                             1/1     Running   5          14d   192.168.1.72   node3   <none>           <none>
coredns-76798d84dd-2tt2f                      1/1     Running   0          12d   10.233.92.60   node3   <none>           <none>
coredns-76798d84dd-7sndq                      1/1     Running   0          14d   10.233.90.1    node1   <none>           <none>
dns-autoscaler-85f898cd5c-5ldrz               1/1     Running   0          14d   10.233.90.2    node1   <none>           <none>
kube-apiserver-node1                          1/1     Running   0          14d   192.168.1.6    node1   <none>           <none>
kube-controller-manager-node1                 1/1     Running   0          14d   192.168.1.6    node1   <none>           <none>
kube-multus-ds-amd64-7s445                    1/1     Running   1          14d   192.168.1.72   node3   <none>           <none>
kube-multus-ds-amd64-8g7br                    1/1     Running   1          14d   192.168.1.71   node2   <none>           <none>
kube-multus-ds-amd64-dncpc                    1/1     Running   2          14d   192.168.1.73   node4   <none>           <none>
kube-multus-ds-amd64-h2n76                    1/1     Running   0          14d   192.168.1.6    node1   <none>           <none>
kube-proxy-f7zgz                              1/1     Running   1          14d   192.168.1.71   node2   <none>           <none>
kube-proxy-ml4s4                              1/1     Running   3          14d   192.168.1.73   node4   <none>           <none>
kube-proxy-mlk7c                              1/1     Running   4          14d   192.168.1.72   node3   <none>           <none>
kube-proxy-pqc8m                              1/1     Running   0          14d   192.168.1.6    node1   <none>           <none>
kube-scheduler-node1                          1/1     Running   0          14d   192.168.1.6    node1   <none>           <none>
kubernetes-dashboard-77475cf576-rsl5h         1/1     Running   0          12d   10.233.92.62   node3   <none>           <none>
kubernetes-metrics-scraper-747b4fd5cd-tjwl2   1/1     Running   0          12d   10.233.96.51   node2   <none>           <none>
nginx-proxy-node2                             1/1     Running   1          14d   192.168.1.71   node2   <none>           <none>
nginx-proxy-node3                             1/1     Running   4          14d   192.168.1.72   node3   <none>           <none>
nginx-proxy-node4                             1/1     Running   3          14d   192.168.1.73   node4   <none>           <none>
nodelocaldns-bpqvr                            1/1     Running   3          14d   192.168.1.73   node4   <none>           <none>
nodelocaldns-f85lh                            1/1     Running   3          14d   192.168.1.72   node3   <none>           <none>
nodelocaldns-jsknr                            1/1     Running   0          14d   192.168.1.6    node1   <none>           <none>
nodelocaldns-t9dts                            1/1     Running   1          14d   192.168.1.71   node2   <none>           <none>

NVIDIA GPU Operator Installation for K8s cluster 

  1. The preferred method to deploy the device plugin is as a daemonset using helmInstall Helm from the official installer script:

    Code Block
    languagetext
    themeFadeToGrey
    titleK8s Master Node Console
    $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3
    $ chmod 700 get_helm.sh
    $ ./get_helm.sh
  2. Add the NVIDIA repository:

    Code Block
    languagetext
    themeFadeToGrey
    titleK8s Master Node Console
    $ helm repo add nvidia https://nvidia.github.io/gpu-operator 
    $ helm repo update
  3. NVIDIA GPU Operator uses hostNetwork by default. The defaults must be modified as it is not suitable for this solution. 
    Deploy the device plugin:

    Code Block
    languagetext
    themeFadeToGrey
    titleK8s Master Node Console
    # helm install --wait --generate-name nvidia/gpu-operator
    Code Block
    languagetext
    themeFadeToGrey
    titleK8s Master Node Console
    # helm ls
    NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
    gpu-operator-1596563035 default         1               2020-10-30 20:43:59.550042604 +0300 IDT deployed        gpu-operator-1.3.0      1.3.0
    Code Block
    languagetext
    themeFadeToGrey
    titleK8s Master Node Console
    collapsetrue
    # kubectl get pods -A
    NAMESPACE                NAME                                                              READY   STATUS      RESTARTS   AGE
    default                  gpu-operator-1596563035-node-feature-discovery-master-6468bznxc   1/1     Running     0          15d
    default                  gpu-operator-1596563035-node-feature-discovery-worker-jdz5f       1/1     Running     33         15d
    default                  gpu-operator-1596563035-node-feature-discovery-worker-lwqks       1/1     Running     13         15d
    default                  gpu-operator-1596563035-node-feature-discovery-worker-sb7sv       1/1     Running     16         15d
    default                  gpu-operator-1596563035-node-feature-discovery-worker-xv5xb       1/1     Running     8          15d
    default                  gpu-operator-74c97448d9-pmwjh                                     1/1     Running     1          15d
    gpu-operator-resources   nvidia-container-toolkit-daemonset-hlgp9                          1/1     Running     3          15d
    gpu-operator-resources   nvidia-container-toolkit-daemonset-wgxnx                          1/1     Running     2          15d
    gpu-operator-resources   nvidia-container-toolkit-daemonset-zxqxj                          1/1     Running     1          15d
    gpu-operator-resources   nvidia-dcgm-exporter-dlzjv                                        1/1     Running     8          15d
    gpu-operator-resources   nvidia-dcgm-exporter-hbd4t                                        1/1     Running     6          15d
    gpu-operator-resources   nvidia-dcgm-exporter-hhjcb                                        1/1     Running     13         15d
    gpu-operator-resources   nvidia-device-plugin-daemonset-jnprj                              1/1     Running     13         15d
    gpu-operator-resources   nvidia-device-plugin-daemonset-rcj9n                              1/1     Running     8          15d
    gpu-operator-resources   nvidia-device-plugin-daemonset-slkpj                              1/1     Running     0          13d
    gpu-operator-resources   nvidia-device-plugin-validation                                   0/1     Completed   0          13d
    gpu-operator-resources   nvidia-driver-daemonset-8b92r                                     1/1     Running     1          15d
    gpu-operator-resources   nvidia-driver-daemonset-lkwdk                                     1/1     Running     8          15d
    gpu-operator-resources   nvidia-driver-daemonset-mcdf2                                     1/1     Running     18         15d
    gpu-operator-resources   nvidia-driver-validation                                          0/1     Completed   0          13d
    kube-system              calico-kube-controllers-8567d5fdd6-zfch2                          1/1     Running     0          13d
    kube-system              calico-node-fzv78                                                 1/1     Running     1          15d
    kube-system              calico-node-gsplr                                                 1/1     Running     2          15d
    kube-system              calico-node-n6mgq                                                 1/1     Running     4          15d
    kube-system              calico-node-pdcft                                                 1/1     Running     5          15d
    kube-system              coredns-76798d84dd-2tt2f                                          1/1     Running     0          13d
    kube-system              coredns-76798d84dd-7sndq                                          1/1     Running     0          15d
    kube-system              dns-autoscaler-85f898cd5c-5ldrz                                   1/1     Running     0          15d
    kube-system              kube-apiserver-node1                                              1/1     Running     0          15d
    kube-system              kube-controller-manager-node1                                     1/1     Running     0          15d
    kube-system              kube-multus-ds-amd64-7s445                                        1/1     Running     1          15d
    kube-system              kube-multus-ds-amd64-8g7br                                        1/1     Running     1          15d
    kube-system              kube-multus-ds-amd64-dncpc                                        1/1     Running     2          15d
    kube-system              kube-multus-ds-amd64-h2n76                                        1/1     Running     0          15d
    kube-system              kube-proxy-f7zgz                                                  1/1     Running     1          15d
    kube-system              kube-proxy-ml4s4                                                  1/1     Running     3          15d
    kube-system              kube-proxy-mlk7c                                                  1/1     Running     4          15d
    kube-system              kube-proxy-pqc8m                                                  1/1     Running     0          15d
    kube-system              kube-scheduler-node1                                              1/1     Running     0          15d
    kube-system              kubernetes-dashboard-77475cf576-rsl5h                             1/1     Running     0          13d
    kube-system              kubernetes-metrics-scraper-747b4fd5cd-tjwl2                       1/1     Running     0          13d
    kube-system              nginx-proxy-node2                                                 1/1     Running     1          15d
    kube-system              nginx-proxy-node3                                                 1/1     Running     4          15d
    kube-system              nginx-proxy-node4                                                 1/1     Running     3          15d
    kube-system              nodelocaldns-bpqvr                                                1/1     Running     3          15d
    kube-system              nodelocaldns-f85lh                                                1/1     Running     3          15d
    kube-system              nodelocaldns-jsknr                                                1/1     Running     0          15d
    kube-system              nodelocaldns-t9dts                                                1/1     Running     1          15d

To run a Sample GPU Application: https://github.com/NVIDIA/gpu-operator#running-a-sample-gpu-application

For GPU monitoring: https://github.com/NVIDIA/gpu-operator#gpu-monitoring

SR-IOV Components Installation for K8s Cluster 

The RoCE enhancement for the K8s cluster enables containerizing SR-IOV virtual functions in Pod deployments for additional RoCE enabled NIC's. 
During the installation process, a role will use the Kubespray inventory/mycluster/hosts.yaml file for Kubernetes components deployment and provisioning.

The RoCE components are configured by a separate role in the installation. The RoCE role will install the following components:

  1. Virtual Function activation
  2. Node Feature Discovery v0.6.0
  3. The latest Multus CNI for attaching multiple network interfaces to the Pod
  4. Specific configuration of the universal SR-IOV device plugin
  5. Universal SR-IOV CNI
  6. Specific network provisioning with "NetworkAttachmentDefinition"
  7. DHCP CNI for providing IP addresses for SR-IOV based NIC's in the Pod deployment from existing infrastructure

Warning

RoCE role deployment is validated using Ubuntu 18.04 OS and Kubespray v2.13.3

Prerequisites

Copy the RoCE role installation package(provided in Appendix) to Kubespray folder: 

Code Block
languagetext
themeFadeToGrey
titleK8s Master Node Console
$ cd ~/kubespray-2.13.3
$ tar -xf roce.tar

Customize Role Variables

Set the variables for the RoCE role in the yml file -  roles/roce_sriov/vars/main.yml.

Set the following parameters:

  1. sriov_resources:
    - pf_name: "ens3f0"
    vendor: "15b3"
    dev_id: "1018"
    vlan_id: 111
    resourcePrefix: "mellanox.com"
    res_name: "sriov_111"
    network_name: "sriov111"
    mtu: 1500
    cidr: "192.168.111.0/24"
  2. num_vf to 8
  3. hugePages to false
  4. install_mofed to true
Code Block
languageyml
themeFadeToGrey
titlevars/main.yml
---
# vars file for roce_sriov
# Physical adapter names must be connected to RoCE backend fabric
# Virtual function device ID. Default - "MT28908 Family [ConnectX-6 Virtual Function]"
# Detailed information about all Mellanox Device ID can be found - https://devicehunt.com/view/type/pci/vendor/15B3.
# Supported values 
#    101c - MT28908 Family [ConnectX-6 Virtual Function]
#    101a - MT28800 Family [ConnectX-5 Ex Virtual Function]
#    1018 - MT27800 Family [ConnectX-5 Virtual Function]   
#    1016 - MT27710 Family [ConnectX-4 Lx Virtual Function]
#    1014 - MT27700 Family [ConnectX-4 Virtual Function]
sriov_resources:
  - pf_name: "ens3f0"
    vendor: "15b3"
    dev_id: "1018"
    vlan_id: 111
    resourcePrefix: "mellanox.com"
    res_name: "sriov_111"
    network_name: "sriov111"
    mtu: 1500
    cidr: "192.168.111.0/24"

# CNI spec version
cniVersion: "0.4.0"

# Number virtual function for activation
num_vf: 8

# HugePages
# Activated only on Worker Nodes with MLNX nic's
# Supported only HugePages mode - hugepages_2048KB
# num_huge parameter set hugepages size for each NUMA. For Node with two NUMA's - 16384

hugePages: false
num_huge: 8192

# DHCP server settings

# If set to TRUE in your network must be installed DHCP server for provide IP address assignment for corresponded VLAN's 
# and will be used DHCP plugin for IPAM

# If set to FALSE will be installed https://github.com/openshift/whereabouts-cni as IPAM plugin
# cidr parametr from sriov_resources used for IP address assignment
# Supported ONLY with K8s v1.16 and above

dhcp_server: false


# Using for Host OS the Ubuntu LTS enablement (also called HWE or Hardware Enablement) kernel
HWE_kernel: true

# New MOFED installation
# If install_mofed is FALSE, will be used kernel inbox driver

install_mofed: false
mlnx_ofed_package: "mlnx-ofed-kernel-only"
mlnx_ofed_version: "latest"
upstream_libs: true


# Install Kubeflow MPI-Operator - https://github.com/kubeflow/mpi-operator

kubeflow_mpi_operator: false
mpi_operator_dep: "https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v1/mpi-operator.yaml"

# K8s SR-IOV daemonset's
# before K8s 1.16
#multus_ds: "https://raw.githubusercontent.com/intel/multus-cni/master/images/multus-daemonset-pre-1.16.yml"
#sriov_dp_ds: "https://raw.githubusercontent.com/intel/sriov-network-device-plugin/master/deployments/k8s-v1.10-v1.15/sriovdp-daemonset.yaml"
#sriov_cni_ds: "https://raw.githubusercontent.com/intel/sriov-cni/master/images/k8s-v1.10-v1.15/sriov-cni-daemonset.yaml"
#dhcp_cni_ds: "https://raw.githubusercontent.com/Mellanox/dhcp-cni/master/dhcp-cni-ds.yaml"

# from K8s 1.16
multus_ds: "https://raw.githubusercontent.com/intel/multus-cni/master/images/multus-daemonset.yml"
sriov_dp_ds: "https://raw.githubusercontent.com/intel/sriov-network-device-plugin/master/deployments/k8s-v1.16/sriovdp-daemonset.yaml"
sriov_cni_ds: "https://raw.githubusercontent.com/intel/sriov-cni/master/images/k8s-v1.16/sriov-cni-daemonset.yaml"
dhcp_cni_ds: "https://raw.githubusercontent.com/Mellanox/dhcp-cni/master/k8s-1.16/dhcp-cni-ds.yaml"
nfd_release: "https://github.com/kubernetes-sigs/node-feature-discovery/archive/v0.6.0.tar.gz"
wh_cni_ds: "https://raw.githubusercontent.com/openshift/whereabouts-cni/master/doc/daemonset-install.yaml"
wh_iptools_ds: "https://raw.githubusercontent.com/openshift/whereabouts-cni/master/doc/whereabouts.cni.cncf.io_ippools.yaml"

Role Execution

Run the playbook from the Kubespray deployment folder using the following command:

Code Block
languagetext
themeFadeToGrey
titleDeployment Node Console
$ ansible-playbook -i inventory/mycluster/hosts.yaml  --become --become-user=root roce.yaml
Info
The execution time for this step may take a while to finalize.

The following is an example of a successful playbooks execution:

Code Block
languagetext
themeFadeToGrey
titleDeployment Node Console
PLAY RECAP ***************************************************************************************************************************************
node1                      : ok=47   changed=24   unreachable=0    failed=0    skipped=18   rescued=0    ignored=0   
node2                      : ok=32   changed=13   unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   
node3                      : ok=32   changed=13   unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   
node4                      : ok=32   changed=13   unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   

Sunday 26 July 2020  15:24:42 +0300 (0:00:00.668)       0:01:07.966 *********** 
=============================================================================== 
roce_sriov : Update additional packages --------------------------------------------------------------------------------------------------- 8.54s
roce_sriov : Set Hugepages ---------------------------------------------------------------------------------------------------------------- 4.92s
roce_sriov : Set Hugepages ---------------------------------------------------------------------------------------------------------------- 4.89s
roce_sriov : Install Universal SR-IOV device plugin --------------------------------------------------------------------------------------- 4.47s
roce_sriov : Install WH CNI --------------------------------------------------------------------------------------------------------------- 3.00s
roce_sriov : Update cache ----------------------------------------------------------------------------------------------------------------- 2.60s
roce_sriov : Extract NFD daemonset's ------------------------------------------------------------------------------------------------------ 2.43s
roce_sriov : Create netattdef ------------------------------------------------------------------------------------------------------------- 2.28s
Gathering Facts --------------------------------------------------------------------------------------------------------------------------- 2.24s
roce_sriov : Install WH IPPOOL CNI -------------------------------------------------------------------------------------------------------- 2.20s
roce_sriov : Install Openshift pip module ------------------------------------------------------------------------------------------------- 2.09s
roce_sriov : Install SR-IOV CNI ----------------------------------------------------------------------------------------------------------- 1.99s
roce_sriov : Remove old Multus DS for amd64 ----------------------------------------------------------------------------------------------- 1.87s
roce_sriov : Remove DHCP CNI -------------------------------------------------------------------------------------------------------------- 1.77s
roce_sriov : Create netattdef ------------------------------------------------------------------------------------------------------------- 1.33s
roce_sriov : Multus update ---------------------------------------------------------------------------------------------------------------- 1.28s
roce_sriov : Install HWE kernel. It takes a while. ---------------------------------------------------------------------------------------- 1.27s
roce_sriov : create configmap ------------------------------------------------------------------------------------------------------------- 1.25s
roce_sriov : Install aptitude ------------------------------------------------------------------------------------------------------------- 1.24s
roce_sriov : Remove Multus DS for ppc64le ------------------------------------------------------------------------------------------------- 0.67s

Role Installation Summary:

After the installation, the role with default variable parameters in your K8s cluster will have the following:

  1. Node Feature Discovery for Kubernetes installed
  2. The required amount of VF for each specified Mellanox NIC name activated and configured
  3. configmap for "SRIOV NETWORK DEVICE PLUGIN" will be configured and ready for creating resources
  4. DaemonSet's with "SRIOV NETWORK DEVICE PLUGIN" and "SRIOV CNI" installed
  5. Multus meta CNI updated to the latest version
  6. DaemonSet with "whereabouts-cni" installed to provide IP address management for SRIOV based NIC's
  7. One network-attachment-definitions SRIOV111

Role Deployment Verification

The role deployment verification must be done from the K8s Master node. Execute the following commands to initiate the verification process:

Code Block
languagetext
themeFadeToGrey
titleK8s Master Node Console
collapsetrue
root@node1:~# kubectl get pod -n kube-system -o wide | egrep "multus|cni|device|whereabouts"
kube-multus-ds-amd64-7s445                    1/1     Running   1          15d   192.168.1.72   node3   <none>           <none>
kube-multus-ds-amd64-8g7br                    1/1     Running   1          15d   192.168.1.71   node2   <none>           <none>
kube-multus-ds-amd64-dncpc                    1/1     Running   2          15d   192.168.1.73   node4   <none>           <none>
kube-multus-ds-amd64-h2n76                    1/1     Running   0          15d   192.168.1.6    node1   <none>           <none>
kube-sriov-cni-ds-amd64-g7rgp                 1/1     Running   1          15d   192.168.1.72   node3   <none>           <none>
kube-sriov-cni-ds-amd64-vzl9s                 1/1     Running   1          15d   192.168.1.71   node2   <none>           <none>
kube-sriov-cni-ds-amd64-zdvmh                 1/1     Running   2          15d   192.168.1.73   node4   <none>           <none>
kube-sriov-device-plugin-amd64-6qpwr          1/1     Running   1          15d   192.168.1.71   node2   <none>           <none>
kube-sriov-device-plugin-amd64-8lvdt          1/1     Running   1          15d   192.168.1.72   node3   <none>           <none>
kube-sriov-device-plugin-amd64-cjhjx          1/1     Running   2          15d   192.168.1.73   node4   <none>           <none>
whereabouts-9f4r5                             1/1     Running   0          15d   192.168.1.6    node1   <none>           <none>
whereabouts-dbxzz                             1/1     Running   1          15d   192.168.1.72   node3   <none>           <none>
whereabouts-fcsxr                             1/1     Running   2          15d   192.168.1.73   node4   <none>           <none>
whereabouts-qd8xm                             1/1     Running   1          15d   192.168.1.71   node2   <none>           <none>
Code Block
languagetext
themeFadeToGrey
titleWorker Node resources
root@node1:~# kubectl describe nodes node2
...
Capacity:
  cpu:                     32
  ephemeral-storage:       229700940Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  mellanox.com/sriov_111:  8
  memory:                  197746780Ki
  nvidia.com/gpu:          2
  pods:                    110
Allocatable:
  cpu:                     31900m
  ephemeral-storage:       211692385954
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  mellanox.com/sriov_111:  8
  memory:                  197394380Ki
  nvidia.com/gpu:          2
  pods:                    110
...
Code Block
languagetext
themeFadeToGrey
titlenetwork-attachment-definitions
collapsetrue
root@node1:~# kubectl get network-attachment-definitions.k8s.cni.cncf.io
NAME       AGE
sriov111   10m

user@node1:~# kubectl get network-attachment-definitions.k8s.cni.cncf.io sriov111 -o yaml
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  annotations:
    k8s.v1.cni.cncf.io/resourceName: mellanox.com/sriov_111
  creationTimestamp: "2020-08-04T18:17:32Z"
  generation: 1
  name: sriov111
  namespace: default
  resourceVersion: "9404"
  selfLink: /apis/k8s.cni.cncf.io/v1/namespaces/default/network-attachment-definitions/sriov111
  uid: 07a5e65b-f42f-41fd-b8e9-59c196e77056
spec:
  config: |-
    {
        "cniVersion": "0.4.0",
        "name": "sriov111",
        "plugins": [
            {
                "ipam": {
                    "datastore": "kubernetes",
                    "kubernetes": {
                        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
                    },
                    "log_file": "/tmp/whereabouts.log",
                    "log_level": "debug",
                    "range": "192.168.111.0/24",
                    "type": "whereabouts"
                },
                "spoofChk": "off",
                "type": "sriov",
                "vlan": 111
            },
            {
                "mtu": 1500,
                "type": "tuning"
            }
        ]
    }

Lossless Fabric with L3 (DSCP) Configuration

Note

Before starting the below process, make sure you are familiar with this configuration example for NVIDIA Mellanox devices installed with MLNX_OFED running RoCE over a lossless network in DSCP-based QoS mode.

For this configuration, make sure to know your network interface name (for example ens3f0) and its parent NVIDIA Mellanox device (for example mlx5_0). To get this information, run the ibdev2netdev command:

Code Block
languagetext
themeFadeToGrey
titleShell
# ibdev2netdev -v | grep ens3f0
mlx5_0 port 1 ==> ens3f0 (Up)
mlx5_1 port 1 ==> ens3f1 (Down)
Configuration:
Code Block
languagetext
themeFadeToGrey
titleShell
# mlnx_qos -i ens3f0 --trust dscp
# echo 106 > /sys/class/infiniband/mlx5_0/tc/1/traffic_class
# cma_roce_tos -d mlx5_0 -t 106
# sysctl -w net.ipv4.tcp_ecn=1
# mlnx_qos -i ens3f0 --pfc 0,0,0,1,0,0,0,0

Installing Mellanox GPUDirect RDMA

The below listed software is required to install and run the GPUDirect RDMA:

  • NVIDIA compatible driver. (Contact NVIDIA support for more information)
  • MLNX_OFED (latest).


  1. Copy the NVIDIA driver from /run/nvidia/driver/usr/src/nvidia-440.64.00/kernel/nvidia.ko file to the /lib/modules/5.4.0-42-generic/updates/dkms/ directory.

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    # cp /run/nvidia/driver/usr/src/nvidia-440.64.00/kernel/nvidia.ko /lib/modules/5.4.0-42-generic/updates/dkms/
  2. Download the latest nv_peer_memory:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    # cd
    # git clone https://github.com/Mellanox/nv_peer_memory
  3. Build source packages (src.rpm for RPM based OS and tarball for DEB based OS) and run the build_module.sh script:

    Code Block
    languagetext
    themeEclipse
    titleNode cli
    # ./build_module.sh
    
    
    Building source rpm for nvidia_peer_memory...
    Building debian tarball for nvidia-peer-memory...
    
    
    Built: /tmp/nvidia_peer_memory-1.0-9.src.rpm
    Built: /tmp/nvidia-peer-memory_1.0.orig.tar.gz
  4. Install the packages:

    Code Block
    languagetext
    themeEclipse
    titleNode cli
    # cd /tmp 
    # tar xzf /tmp/nvidia-peer-memory_1.0.orig.tar.gz
    # cd nvidia-peer-memory-1.0
    # dpkg-buildpackage -us -uc
    # dpkg -i <path to generated deb files>

    (e.g. dpkg -i nv-peer-memory_1.0-9_all.deb nv-peer-memory-dkms_1.0-9_all.deb )

After a successful installation:

  1. nv_peer_mem.ko kernel module will be installed.
  2. service file /etc/init.d/nv_peer_mem to control the kernel module by start/stop/status will be added.
  3. /etc/infiniband/nv_peer_mem.conf configuration file to control whether kernel module will be loaded on boot (default value is YES).


Note
We recommend both the NIC and the GPU to be physically placed on the same PCI switch to achieve better performance.

GPUDirect with and without Image Removed

          

Image Added

Installing OFED Performance Tests

Perftools is a collection of tests written over uverbs intended for use as a performance micro-benchmark. The tests may be used for HW or SW tuning as well as for functional testing. The collection contains a set of bandwidth and latency benchmarks such as:

  • Send - ib_send_bw and ib_send_lat
  • RDMA Read - ib_read_bw and ib_read_lat|
  • RDMA Write - ib_write_bw and ib_write_lat
  • RDMA Atomic - ib_atomic_bw and ib_atomic_lat
  • Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat.

To install perftools, run the following commands:

Code Block
languagetext
themeFadeToGrey
titleServer Console
cd
git clone https://github.com/linux-rdma/perftest
cd perftest/
make clean
./autogen.sh
./configure --prefix=/usr --libdir=/usr/lib64 --sysconfdir=/etc CUDA_H_PATH=/usr/local/cuda/include/cuda.h
make

Run a RDMA Write - ib_write_bw bandwidth stress benchmark over RoCE.

Server

ib_write_bw -a -d mlx5_0 &

Client

ib_write_bw -a -F $server_IP -d mlx5_0 --report_gbits

image2020-7-14_13-10-52.pngImage RemovedImage Added

Image RemovedImage Added

Distributed Spark Deployment

Installing Apache Spark Driver Node

  1. Download Apache Spark (Deployment Node), from Downloads | Apache Spark by selecting a Spark release and package type spark-3.0.0-bin-hadoop3.2.tgz file and copy the file to the /opt folder.
    Spark downloadImage RemovedImage Added

    Info
    Optional: You can download Spark by running following commands:
    cd /opt ; wget http://mirror.cogentco.com/pub/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
  2. To set up Apache Spark, extract the compressed file using the tar command:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    tar -xvf spark-3.0.0-bin-hadoop3.2.tgz
  3. Make a symbolic link for spark-3.0.0-bin-hadoop3.2 folder and set the permissions for your user to spark-3.0.0-bin-hadoop3.2 and spark folders:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    sudo ln -s spark-3.0.0-bin-hadoop3.2/ spark
    sudo chown -R user spark
    sudo chown -R user spark-3.0.0-bin-hadoop3.2/
  4. To Configure Spark environment variables (bashrc), edit the .bashrc shell configuration file:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    sudo vim .bashrc
  5. Define the Spark environment variables by adding the following content to the end of the file:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    #Spark Related Options
    echo "export SPARK_HOME=/opt/spark"
    echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin"
    echo "export PYSPARK_PYTHON=/usr/bin/python3"
  6. Once you save and close the text file, you can return to your original terminal and type this command to reload your .bashrc file:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    $ source ~/.bashrc
  7. Check that the path have been properly modified:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    echo $SPARK_HOME
    echo $PATH
    echo $PYSPARK_PYTHON
  8. Run a sample Spark job in local mode to ensure Spark is working:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    $SPARK_HOME/bin/run-example SparkPi 10
  9. Install the Spark Rapids plugin jars, create /opt/sparkRapidsPlugin on the Deployment Node and set the permissions for your user to the directory:

    Code Block
    languagetext
    themeFadeToGrey
    titleServer Console
    sudo mkdir -p /opt/sparkRapidsPlugin
    sudo chown -R kuser sparkRapidsPlugin
  10. Download the 2 jars from the https://github.com/NVIDIA/spark-rapids/blob/branch-0.2/docs/version/stable-release.md .
    Place them into the /opt/sparkRapidsPlugin directory. Make sure to pull the cudf jar version for the cuda version you are running 10.2


Prepare K8s Cluster to run Apache Spark Executor

Prerequisites 

Run following steps on the Deployment Node:

  1. To run an Apache Spark and use NFS as a storage to execute benchmarks over a K8s cluster we need to enable the IPC_LOCK capability for memory registration and the nfs-volume configuration.

    Create /opt/executor-template directory:

    Code Block
    languagetext
    themeFadeToGrey
    titleDeployment Node Console
    # mkdir /opt/executor-template

    Define an executor-template by creating /opt/executor-template/executor-template.yaml file which contains the following configurations. 

    Note
    It is important to change the NFS server IP to an IP relevant to your environment.
    Code Block
    languageyml
    themeFadeToGrey
    titleDeployment Node Console
    #
    # Licensed to the Apache Software Foundation (ASF) under one or more
    # contributor license agreements.  See the NOTICE file distributed with
    # this work for additional information regarding copyright ownership.
    # The ASF licenses this file to You under the Apache License, Version 2.0
    # (the "License"); you may not use this file except in compliance with
    # the License.  You may obtain a copy of the License at
    #
    #    http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    #
    apiVersion: v1
    Kind: Pod
    metadata:
      labels:
        template-label-key: executor-template-label-value
    spec:
      containers:
      - name: test-executor-container
        image: will-be-overwritten
        securityContext:
          capabilities:
            add: [ "IPC_LOCK" ]
        volumeMounts:
            - name: nfs-volume
              mountPath: /data
      volumes:
        - name: nfs-volume
          nfs:
            # URL for the NFS server
            server: 192.168.1.71
            path: /data
    
  2. Install and run Docker service on the Deployment Node.

    Info
    titleinfo
    You can skip steps 2-6 and use my docker image from DockerHub.
    Code Block
    languagetext
    themeFadeToGrey
    titleDeployment Node Console
    # apt install docker.io
    # service docker start
  3. Create /opt/spark/docker directory on the Deployment Node:

    Code Block
    languagetext
    themeFadeToGrey
    titleDeployment Node Console
    # mkdir /opt/spark/docker
  4. On the Deployment Node, create the /opt/spark/docker/getSRIOVResources.sh file to get the resource information about NVIDIA Mellanox SR-IOV device:

    Code Block
    languageyml
    themeFadeToGrey
    titleDeployment Node Console
    #!/usr/bin/env bash
    
    #
    # Licensed to the Apache Software Foundation (ASF) under one or more
    # contributor license agreements. See the NOTICE file distributed with
    # this work for additional information regarding copyright ownership.
    # The ASF licenses this file to You under the Apache License, Version 2.0
    # (the "License"); you may not use this file except in compliance with
    # the License. You may obtain a copy of the License at
    #
    # http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    #
    
    # This script is a basic example script to get resource information about Mellanox SR-IOV devices.
    # It assumes the InfiniBand drivers and K8s SR-IOV components are properly installed.
    # It is not guaranteed to work on all setups so please test and customize as needed
    # for your environment. It can be passed into SPARK via the config
    # spark.{driver/executor}.resource.RES-NAME.discoveryScript to allow the driver or executor to discover
    # the SR-IOV device it was allocated. It assumes you are running within an isolated container where the
    # SR-IOV device are allocated exclusively to that driver or executor.
    # It outputs a JSON formatted string that is expected by the
    # spark.{driver/executor}.resource.sriov_111.discoveryScript config.
    #
    # Execution: ./getSRIOVResources.sh
    # Example output: {"name": "sriov_111", "addresses":["0000:05:02.4"]}
    
    
    addr_sriov="$(env | grep -i sriov_111 | awk -F "=" '{print $2}')"
    echo {\"name\": \"sriov_111\", \"addresses\":[\"${addr_sriov}\"]}
  5. Create a Dockerfile with the following:

    Code Block
    languagetext
    themeFadeToGrey
    titleDeployment Node Console
    collapsetrue
    #
    # Licensed to the Apache Software Foundation (ASF) under one or more
    # contributor license agreements. See the NOTICE file distributed with
    # this work for additional information regarding copyright ownership.
    # The ASF licenses this file to You under the Apache License, Version 2.0
    # (the "License"); you may not use this file except in compliance with
    # the License. You may obtain a copy of the License at
    #
    # http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    #
    
    ARG CUDA_VER=10.2
    ARG spark_uid=185
    
    FROM nvidia/cuda:${CUDA_VER}-devel-ubuntu18.04
    
    # Set MOFED version, OS version and platform
    ENV MOFED_VER 5.0-2.1.8.0
    ENV OS_VER ubuntu18.04
    ENV PLATFORM x86_64
    ENV DEBIAN_FRONTEND=noninteractive
    
    # Set Unified Communication X {UCX) version, OS, MOFED and CUDA versions
    ENV UCX_VER v1.8.1
    ENV OS_VER ubuntu18.04
    ENV MOFED mofed5.0
    ENV CUDA 10.2
    
    # Set Spark, Hadoop, cuDF and RAPIDS versions
    ENV SPARK_VER 3.0.0
    ENV HADOOP_VER 3.2
    ENV CUDF_VER 0.14
    ENV RAPIDS_VER 0.1.0
    ENV RAPIDS 4
    ENV SPARK_RAPIDS spark_2.12
    ENV CUDA_RAPIDS cuda10-2
    
    # Install dependencies
    RUN apt-get update && \
    apt install -y git wget apt-utils scala libnuma1 udev libudev1 libcap2 dpatch libnl-3-200 gfortran automake lsof ethtool chrpath libmnl0 pkg-config m4 libnl-route-3-200 autoconf debhelper swig bison libltdl-dev kmod tcl libnl-route-3-dev pciutils tk autotools-dev flex libnl-3-dev graphviz libgfortran4 iproute2 iputils-ping
    RUN ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime
    RUN dpkg-reconfigure --frontend noninteractive tzdata
    
    # Install java dependencies
    RUN apt-get install -y --no-install-recommends openjdk-8-jdk openjdk-8-jre
    ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
    ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin
    
    RUN mkdir -p /opt/spark/python
    # TODO: Investigate running both pip and pip3 via virtualenvs
    RUN apt install -y python python-pip && \
    apt install -y python3 python3-pip && \
    # Install numpy and pandas for xgboost python api tests
    pip install --upgrade pip && \
    pip3 install --upgrade pip && \
    pip install numpy pandas && \
    pip3 install numpy pandas && \
    # We remove ensurepip since it adds no functionality since pip is
    # installed on the image and it just takes up 1.6MB on the image
    rm -r /usr/lib/python*/ensurepip && \
    pip install --upgrade pip setuptools && \
    # You may install with python3 packages by using pip3.6
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*
    
    # MOFED install
    ENV OFED_FQN MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}
    RUN wget --quiet http://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VER}/${OFED_FQN}.tgz && \
    tar -xf ${OFED_FQN}.tgz && \
    ${OFED_FQN}/mlnxofedinstall --user-space-only --without-fw-update -q
    RUN cd .. && \
    rm -rf ${MOFED_DIR} && \
    rm -rf *.tgz
    
    # UCX install
    RUN wget https://github.com/openucx/ucx/releases/download/${UCX_VER}/ucx-${UCX_VER}-${OS_VER}-${MOFED}-cuda${CUDA}.deb && \
    dpkg -i ucx-${UCX_VER}-${OS_VER}-${MOFED}-cuda${CUDA}.deb
    
    # Before building the docker image, first build and make a Spark distribution following
    # the instructions in http://spark.apache.org/docs/latest/building-spark.html.
    # If this docker file is being used in the context of building your images from a Spark
    # distribution, the docker build command should be invoked from the top level directory
    # of the Spark distribution. E.g.:
    # docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .
    
    RUN set -ex && \
    ln -s /lib /lib64 && \
    mkdir -p /opt/spark && \
    mkdir -p /opt/spark/jars && \
    mkdir -p /opt/tpch && \
    mkdir -p /opt/spark/examples && \
    mkdir -p /opt/spark/work-dir && \
    mkdir -p /opt/sparkRapidsPlugin && \
    touch /opt/spark/RELEASE && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd
    
    COPY spark-${SPARK_VER}-bin-hadoop${HADOOP_VER}/jars /opt/spark/jars
    COPY spark-${SPARK_VER}-bin-hadoop${HADOOP_VER}/bin /opt/spark/bin
    COPY spark-${SPARK_VER}-bin-hadoop${HADOOP_VER}/sbin /opt/spark/sbin
    COPY spark-${SPARK_VER}-bin-hadoop${HADOOP_VER}/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
    COPY spark-${SPARK_VER}-bin-hadoop${HADOOP_VER}/examples /opt/spark/examples
    COPY spark-${SPARK_VER}-bin-hadoop${HADOOP_VER}/kubernetes/tests /opt/spark/tests
    COPY spark-${SPARK_VER}-bin-hadoop${HADOOP_VER}/data /opt/spark/data
    
    # Download RAPIDS Spark, cuDF Packages and a get GPU resources script
    RUN cd /opt/sparkRapidsPlugin && \
    wget https://repo1.maven.org/maven2/com/nvidia/rapids-${RAPIDS}-${SPARK_RAPIDS}/${RAPIDS_VER}/rapids-${RAPIDS}-${SPARK_RAPIDS}-${RAPIDS_VER}.jar && \
    wget https://repo1.maven.org/maven2/ai/rapids/cudf/${CUDF_VER}/cudf-${CUDF_VER}-${CUDA_RAPIDS}.jar && \
    wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh
    
    COPY getSRIOVResources.sh /opt/sparkRapidsPlugin/getSRIOVResources.sh
    
    COPY spark-${SPARK_VER}-bin-hadoop${HADOOP_VER}/python/pyspark /opt/spark/python/pyspark
    COPY spark-${SPARK_VER}-bin-hadoop${HADOOP_VER}/python/lib /opt/spark/python/lib
    
    ENV SPARK_HOME /opt/spark
    WORKDIR /opt/spark/work-dir
    RUN chmod g+w /opt/spark/work-dir
    
    ENV TINI_VERSION v0.18.0
    ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /usr/bin/tini
    RUN chmod +rx /usr/bin/tini
    RUN chmod -R 777 /opt/sparkRapidsPlugin
    
    ENTRYPOINT [ "/opt/entrypoint.sh" ]
    
    # Specify the User that the actual main process will run as
    USER ${spark_uid}
    
  6. Build a docker image from the Dockerfile and push to the Docker Hub registry.

    Code Block
    languagetext
    themeFadeToGrey
    titleDeployment Node Console
    # docker login <your DockerHub credentials>
    # docker build -t <DockerHub repository>/<Docker image name> . 
    # docker push <DockerHub repository>/<Docker image name>
  7. Pull the Docker image from the Docker Hub registry on all Workers Nodes.

    Code Block
    languagetext
    themeFadeToGrey
    titleWorker Node Console
    # docker pull <DockerHub repository>/<Docker image name>

    You can use my precompiled docker image: bkovalev/spark3rapids

Prepare and Run TPC-H benchmarks

Prepare TPC-H dataset

To prepare the TPCH application, download the generated dataset and run the following commands on the Worker Node or on a server with access to NFS share:

Code Block
languagetext
themeFadeToGrey
titleDeployment Node Console
# mkdir /data/tpc
# mkdir /data/db
# sudo git clone https://github.com/databricks/tpch-dbgen.git
# cd tpch-dbgen
# sudo make
# sudo dbgen -s 70
# sudo hdfs dfs -put *.tbl
# cp *.tbl /data/db/
# ll /data/db/

Examples to run TPC-H application

To run TPC-H application with RAPIDS you need the following jar file:

View file
namerapids-4-spark-integration-tests_2.12-0.1-SNAPSHOT.jar
height150

Put the jar file into the /opt/sparkRapidsPlugin directory on the Deployment Node together with Spark Rapids Plugin jar files.

Here you can find examples to run TPC-H application:

Example 1:

For TCP without GPU/RAPIDS/UCX running we are using the tpch-tcp-only.sh file with the following configs when running Spark on K8s.

Code Block
languagebash
themeFadeToGrey
titleDeployment Node Console
collapsetrue
./spark-submit \
       --master k8s://https://192.168.1.6:6443 \
       --driver-memory 1G \
       --conf spark.executor.instances=6 \
       --conf spark.executor.memory=30G \
       --conf spark.executor.cores=8 \
       --conf spark.task.cpus=4 \
       --conf spark.sql.files.maxPartitionBytes=512m \
       --conf spark.sql.shuffle.partitions=300 \
       --conf spark.locality.wait=0s \
       --conf spark.kubernetes.executor.podTemplateFile=/opt/executor-template/executor-template.yaml \
       --conf spark.executor.resource.sriov_111.discoveryScript=/opt/sparkRapidsPlugin/getSRIOVResources.sh \
       --conf spark.executor.resource.sriov_111.amount=1 \
       --conf spark.executor.resource.sriov_111.vendor=mellanox.com \
       --conf spark.kubernetes.executor.annotation.k8s.v1.cni.cncf.io/networks=sriov111 \
       --conf spark.kubernetes.container.image=bkovalev/spark3rapids \
       --class com.nvidia.spark.rapids.tests.tpch.CSV /opt/sparkRapidsPlugin/rapids-4-spark-integration-tests_2.12-0.1-SNAPSHOT.jar file:///data/db file:///data/result 22

Example 2:

For TCP with GPU/RAPIDS and without UCX running we are using the tpch-rapids-tcp-noucx.sh file with the following configs when running Spark on K8s.

Code Block
languagebash
themeFadeToGrey
titleDeployment Node Console
collapsetrue
./spark-submit \
       --master k8s://https://192.168.1.6:6443 \
       --driver-memory 1G \
       --conf spark.executor.instances=6 \
       --conf spark.executor.memory=30G \
       --conf spark.executor.cores=8 \
       --conf spark.task.cpus=4 \
       --conf spark.rapids.memory.pinnedPool.size=2G \
       --conf spark.task.resource.gpu.amount=0.25 \
       --conf spark.rapids.sql.concurrentGpuTasks=1 \
       --conf spark.sql.files.maxPartitionBytes=512m \
       --conf spark.sql.shuffle.partitions=300 \
       --conf spark.locality.wait=0s \
       --conf spark.driver.extraClassPath=/opt/sparkRapidsPlugin/* \
       --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/*:/usr/lib/:/data/jar/* \
       --conf spark.plugins=com.nvidia.spark.SQLPlugin \
       --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
       --conf spark.executor.resource.gpu.amount=1 \
       --conf spark.executor.resource.gpu.vendor=nvidia.com \
       --conf spark.kubernetes.executor.podTemplateFile=/opt/executor-template/executor-template.yaml \
       --conf spark.executor.resource.sriov_111.discoveryScript=/opt/sparkRapidsPlugin/getSRIOVResources.sh \
       --conf spark.executor.resource.sriov_111.amount=1 \
       --conf spark.executor.resource.sriov_111.vendor=mellanox.com \
       --conf spark.kubernetes.executor.annotation.k8s.v1.cni.cncf.io/networks=sriov111 \
       --conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,tcp \
       --conf spark.executorEnv.UCX_NET_DEVICES=net1 \
       --conf spark.executorEnv.LD_LIBRARY_PATH=/usr/local/cuda/lib64 \
       --conf spark.executorEnv.CUDA_HOME=/usr/local/cuda \     
       --conf spark.kubernetes.container.image=bkovalev/spark3rapids \
       --class com.nvidia.spark.rapids.tests.tpch.CSV /opt/sparkRapidsPlugin/rapids-4-spark-integration-tests_2.12-0.1-SNAPSHOT.jar file:///data/db file:///data/result 22

Example 3:

For TCP with GPU/RAPIDS/UCX running we are using the tpch-rapids-tcp-ucx.sh file with the following configs when running Spark on K8s.

Code Block
languagebash
themeFadeToGrey
titleDeployment Node Console
collapsetrue
./spark-submit \
       --master k8s://https://192.168.1.6:6443 \
       --driver-memory 1G \
       --conf spark.executor.instances=6 \
       --conf spark.executor.memory=30G \
       --conf spark.executor.cores=8 \
       --conf spark.task.cpus=4 \
       --conf spark.rapids.memory.pinnedPool.size=2G \
       --conf spark.task.resource.gpu.amount=0.25 \
       --conf spark.rapids.sql.concurrentGpuTasks=1 \
       --conf spark.sql.files.maxPartitionBytes=512m \
       --conf spark.sql.shuffle.partitions=300 \
       --conf spark.locality.wait=0s \
       --conf spark.driver.extraClassPath=/opt/sparkRapidsPlugin/* \
       --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/*:/usr/lib/:/data/jar/* \
       --conf spark.plugins=com.nvidia.spark.SQLPlugin \
       --conf spark.shuffle.manager=com.nvidia.spark.RapidsShuffleManager \
       --conf spark.rapids.shuffle.transport.enabled=true \
       --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
       --conf spark.executor.resource.gpu.amount=1 \
       --conf spark.executor.resource.gpu.vendor=nvidia.com \
       --conf spark.kubernetes.executor.podTemplateFile=/opt/executor-template/executor-template.yaml \
       --conf spark.executor.resource.sriov_111.discoveryScript=/opt/sparkRapidsPlugin/getSRIOVResources.sh \
       --conf spark.executor.resource.sriov_111.amount=1 \
       --conf spark.executor.resource.sriov_111.vendor=mellanox.com \
       --conf spark.kubernetes.executor.annotation.k8s.v1.cni.cncf.io/networks=sriov111 \
       --conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,tcp \
       --conf spark.executorEnv.UCX_NET_DEVICES=net1 \
       --conf spark.executorEnv.LD_LIBRARY_PATH=/usr/local/cuda/lib64 \
       --conf spark.executorEnv.CUDA_HOME=/usr/local/cuda \
       --conf spark.kubernetes.container.image=bkovalev/spark3rapids \
       --class com.nvidia.spark.rapids.tests.tpch.CSV /opt/sparkRapidsPlugin/rapids-4-spark-integration-tests_2.12-0.1-SNAPSHOT.jar file:///data/db file:///data/result 22

Example 4:

For RDMA/RC with GPU/RAPIDS/UCX  without GPUDirect running we are using the tpch-rapids-rc-ucx-nogdr.sh file with the following configs when running Spark on K8s.

Code Block
languagebash
themeFadeToGrey
titleDeployment Node Console
collapsetrue
./spark-submit \
       --master k8s://https://192.168.1.6:6443 \
       --driver-memory 1G \
       --conf spark.executor.instances=6 \
       --conf spark.executor.memory=30G \
       --conf spark.executor.cores=8 \
       --conf spark.task.cpus=4 \
       --conf spark.rapids.memory.pinnedPool.size=2G \
       --conf spark.task.resource.gpu.amount=0.25 \
       --conf spark.rapids.sql.concurrentGpuTasks=1 \
       --conf spark.sql.files.maxPartitionBytes=512m \
       --conf spark.sql.shuffle.partitions=300 \
       --conf spark.locality.wait=0s \
       --conf spark.driver.extraClassPath=/opt/sparkRapidsPlugin/* \
       --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/*:/usr/lib/:/data/jar/* \
       --conf spark.plugins=com.nvidia.spark.SQLPlugin \
       --conf spark.shuffle.manager=com.nvidia.spark.RapidsShuffleManager \
       --conf spark.rapids.shuffle.transport.enabled=true \
       --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
       --conf spark.executor.resource.gpu.amount=1 \
       --conf spark.executor.resource.gpu.vendor=nvidia.com \
       --conf spark.kubernetes.executor.podTemplateFile=/opt/executor-template/executor-template.yaml \
       --conf spark.executor.resource.sriov_111.discoveryScript=/opt/sparkRapidsPlugin/getSRIOVResources.sh \
       --conf spark.executor.resource.sriov_111.amount=1 \
       --conf spark.executor.resource.sriov_111.vendor=mellanox.com \
       --conf spark.kubernetes.executor.annotation.k8s.v1.cni.cncf.io/networks=sriov111 \
       --conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc \
       --conf spark.executorEnv.UCX_NET_DEVICES=net1 \
       --conf spark.executorEnv.LD_LIBRARY_PATH=/usr/local/cuda/lib64 \
       --conf spark.executorEnv.CUDA_HOME=/usr/local/cuda \
       --conf spark.executorEnv.UCX_RNDV_SCHEME=get_zcopy \
       --conf spark.executorEnv.UCX_IB_GPU_DIRECT_RDMA=no \
       --conf spark.kubernetes.container.image=bkovalev/spark3rapids \
       --class com.nvidia.spark.rapids.tests.tpch.CSV /opt/sparkRapidsPlugin/rapids-4-spark-integration-tests_2.12-0.1-SNAPSHOT.jar file:///data/db
file:///data/result 22

Example 5:

For RDMA/RC with GPU/RAPIDS/UCX/GPUDirect running we are using the tpch-rapids-rc-ucx-gdr.sh file with the following configs when running Spark on K8s.

Code Block
languagebash
themeFadeToGrey
titleDeployment Node Console
collapsetrue
./spark-submit \
       --master k8s://https://192.168.1.6:6443 \
       --driver-memory 1G \
       --conf spark.executor.instances=6 \
       --conf spark.executor.memory=30G \
       --conf spark.executor.cores=8 \
       --conf spark.task.cpus=4 \
       --conf spark.rapids.memory.pinnedPool.size=2G \
       --conf spark.task.resource.gpu.amount=0.25 \
       --conf spark.rapids.sql.concurrentGpuTasks=1 \
       --conf spark.sql.files.maxPartitionBytes=512m \
       --conf spark.sql.shuffle.partitions=300 \
       --conf spark.locality.wait=0s \
       --conf spark.driver.extraClassPath=/opt/sparkRapidsPlugin/* \
       --conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/*:/usr/lib/:/data/jar/* \
       --conf spark.plugins=com.nvidia.spark.SQLPlugin \
       --conf spark.shuffle.manager=com.nvidia.spark.RapidsShuffleManager \
       --conf spark.rapids.shuffle.transport.enabled=true \
       --conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
       --conf spark.executor.resource.gpu.amount=1 \
       --conf spark.executor.resource.gpu.vendor=nvidia.com \
       --conf spark.kubernetes.executor.podTemplateFile=/opt/executor-template/executor-template.yaml \
       --conf spark.executor.resource.sriov_111.discoveryScript=/opt/sparkRapidsPlugin/getSRIOVResources.sh \
       --conf spark.executor.resource.sriov_111.amount=1 \
       --conf spark.executor.resource.sriov_111.vendor=mellanox.com \
       --conf spark.kubernetes.executor.annotation.k8s.v1.cni.cncf.io/networks=sriov111 \
       --conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc \
       --conf spark.executorEnv.UCX_NET_DEVICES=net1 \
       --conf spark.executorEnv.LD_LIBRARY_PATH=/usr/local/cuda/lib64 \
       --conf spark.executorEnv.CUDA_HOME=/usr/local/cuda \
       --conf spark.executorEnv.UCX_RNDV_SCHEME=get_zcopy \
       --conf spark.kubernetes.container.image=bkovalev/spark3rapids \
       --class com.nvidia.spark.rapids.tests.tpch.CSV /opt/sparkRapidsPlugin/rapids-4-spark-integration-tests_2.12-0.1-SNAPSHOT.jar file:///data/db
file:///data/result 22

Done!

Appendix A

Archive of the RoCE role for deployment with Kubespray - roce.tar.

Appendix B

General Recommendations

  1. Fewer large input files are better than many small ones. You may not have control over this but it is worth knowing. 
  2. Larger input sizes spark.sql.files.maxPartitionBytes=512m are generally better as long as things fit into the GPU.
  3. The GPU does better with larger data chunks as long as they fit into memory. When using the default spark.sql.shuffle.partitions=200 it may be beneficial to make this smaller.  Base this on the amount of data the task is reading. Start with 512MB / task.
  4. Out of GPU Memory.

    GPU out of memory can show up in multiple ways.  You can see an error that it is out of memory or it can also manifest as it just crashes. Generally this means your partition size is too big, go back to the Configuration section for the partition size and/or the number of partitions. Possibly reduce the number of concurrent gpu tasks to 1. The Spark UI may give you a hint at the size of the data. Look at either the input data or the shuffle data size for the stage that failed.

RAPIDS Accelerator for Apache Spark Tuning Guide

Tuning a Spark job configuration settings from the defaults can improve the job performance, and this remains true for jobs leveraging the RAPIDS Accelerator plugin for Apache Spark.
This document provides guidelines on how to tune a Spark job’s configuration settings for improved performance when using the RAPIDS Accelerator plugin.

Monitoring

Since the plugin runs without any API changes, the easiest way to see what is running on the GPU is to look at the "SQL" tab in the Spark Web UI. The SQL tab only shows up after you have actually executed a query. Go to the SQL tab in the UI, click on the query you are interested in and it shows a DAG picture with details. You can also scroll down and twist the "Details" section to see the text representation.

If you want to look at the Spark plan via the code you can use the explain() function call. For example: query.explain() will print the physical plan from Spark and you can see what nodes were replaced with GPU calls.

Debugging

For now, the best way to debug is how you would normally do it on Spark. Look at the UI and log files to see what failed. If you got a seg fault from the GPU find the hs_err_pid.log file. To make sure your hs_err_pid.log file goes into the YARN application log, you may add in the config: --conf spark.executor.extraJavaOptions="-XX:ErrorFile=<LOG_DIR>/hs_err_pid_%p.log"

If you want to see why an operation didn’t run on the GPU, you can turn on the configuration: --conf spark.rapids.sql.explain=NOT_ON_GPU. A log message will then be output to the Driver log as to why a Spark operation isn’t able to run on the GPU.

About the Authors


Include Page
SA:Boris Kovalev
SA:Boris Kovalev

Include Page
SA:Vitaliy Razinkov
SA:Vitaliy Razinkov

Include Page
SA:Peter Rudenko
SA:Peter Rudenko







Related Documents

Content by Label
showLabelsfalse
showSpacefalse
sortcreation
cqllabel in ("big_data","spark","k8s_and_containers","k8s","hpc_and_data_science")