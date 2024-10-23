Follow AWS EMR document “Using the NVIDIA Spark-RAPIDS Accelerator for Spark”. Below is an example.

Launch an EMR Cluster using AWS Console (GUI)#

Go to the AWS Management Console and select the EMR service from the “Analytics” section. Choose the region you want to launch your cluster in, for example, US West (Oregon), using the dropdown menu in the top right corner. Click Create cluster , which will bring up a detailed cluster configuration page.

Step 1: EMR Release and Application Bundle Selection# Enter a custom “Cluster name” for your cluster. Select emr-7.1.0 for the release and pick “Custom” for the “Application bundle”. Uncheck all the software options, and then check Hadoop 3.3.6, Spark 3.5.0, Hive 3.1.3 and JupyterEnterpriseGateway 2.6.0. Optionally, pick Amazon Linux Release or configure a “Custom AMI.”

Step 2: Hardware# Keep the default “Primary” node instance type of m5.xlarge. Change the “Core” node “Instance type” to g4dn.xlarge, g4dn.2xlarge, or p3.2xlarge An optional step is to have “Task” nodes. These nodes can run a Spark executor but they don’t run the HDFS Data Node service. You can click on “Remove instance group” if you would like to only run “Core” nodes with the Data Node and Spark executors. If you want to add extra “Task” nodes, make sure that instance type matches what you selected for “Core.” Under “Cluster scaling and provisioning potion,” verify that the instance count for the “Core” instance group is at least 1. Under “Networking,” select the desired VPC and subnet. You can also create a new VPC and subnet for the cluster. Optionally set custom security groups in the “EC2 security groups” tab. In the “EC2 security groups” section, confirm that the security group chosen for the “Primary” node allows for SSH access. Follow these instructions to allow inbound SSH traffic if the security group doesn’t allow it yet.

Step 3: General Cluster Settings# Add a custom bootstrap action under “Bootstrap Actions” to allow cgroup permissions to YARN on your cluster. This process varies between EMR 7.x and EMR 6.x. An example bootstrap script is as follows: For AWS EMR 7.x 1 #!/bin/bash 2 set -ex 3 4 sudo mkdir -p /spark-rapids-cgroup/devices 5 sudo mount -t cgroup -o devices cgroupv1-devices /spark-rapids-cgroup/devices 6 sudo chmod a+rwx -R /spark-rapids-cgroup For AWS EMR 6.x 1 #!/bin/bash 2 3 set -ex 4 5 sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct 6 sudo chmod a+rwx -R /sys/fs/cgroup/devices

Step 4: Edit Software Configuration# In the “Software settings” field, copy and paste the configuration from the EMR document in the textbox provided under “Enter configuration”. You can also create a JSON file on you own S3 bucket when selecting “Load JSON from Amazon S3”. Ensure to use the correct configuration corresponding to your EMR version. For clusters with 2x g4dn.2xlarge GPU instances as worker nodes, we recommend the following default settings: For AWS EMR 7.x 1 [ 2 { 3 "Classification" : "spark" , 4 "Properties" :{ 5 "enableSparkRapids" : "true" 6 } 7 }, 8 { 9 "Classification" : "yarn-site" , 10 "Properties" :{ 11 "yarn.nodemanager.resource-plugins" : "yarn.io/gpu" , 12 "yarn.resource-types" : "yarn.io/gpu" , 13 "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices" : "auto" , 14 "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables" : "/usr/bin" , 15 "yarn.nodemanager.linux-container-executor.cgroups.mount" : "true" , 16 "yarn.nodemanager.linux-container-executor.cgroups.mount-path" : "/spark-rapids-cgroup" , 17 "yarn.nodemanager.linux-container-executor.cgroups.hierarchy" : "yarn" , 18 "yarn.nodemanager.container-executor.class" : "org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" 19 } 20 }, 21 { 22 "Classification" : "container-executor" , 23 "Properties" :{ 24 25 }, 26 "Configurations" :[ 27 { 28 "Classification" : "gpu" , 29 "Properties" :{ 30 "module.enabled" : "true" 31 } 32 }, 33 { 34 "Classification" : "cgroups" , 35 "Properties" :{ 36 "root" : "/spark-rapids-cgroup" , 37 "yarn-hierarchy" : "yarn" 38 } 39 } 40 ] 41 }, 42 { 43 "Classification" : "spark-defaults" , 44 "Properties" :{ 45 "spark.plugins" : "com.nvidia.spark.SQLPlugin" , 46 "spark.executor.resource.gpu.discoveryScript" : "/usr/lib/spark/scripts/gpu/getGpusResources.sh" , 47 "spark.submit.pyFiles" : "/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar" , 48 "spark.executor.extraLibraryPath" : "/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" , 49 "spark.rapids.sql.concurrentGpuTasks" : "2" , 50 "spark.executor.resource.gpu.amount" : "1" , 51 "spark.executor.cores" : "8" , 52 "spark.task.cpus " : "1" , 53 "spark.task.resource.gpu.amount" : "0.125" , 54 "spark.rapids.memory.pinnedPool.size" : "2G" , 55 "spark.executor.memoryOverhead" : "2G" , 56 "spark.sql.files.maxPartitionBytes" : "256m" , 57 "spark.sql.adaptive.enabled" : "false" 58 } 59 }, 60 { 61 "Classification" : "capacity-scheduler" , 62 "Properties" :{ 63 "yarn.scheduler.capacity.resource-calculator" : "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" 64 } 65 } 66 ] For AWS EMR 6.x 1 [ 2 { 3 "Classification" : "spark" , 4 "Properties" :{ 5 "enableSparkRapids" : "true" 6 } 7 }, 8 { 9 "Classification" : "yarn-site" , 10 "Properties" :{ 11 "yarn.nodemanager.resource-plugins" : "yarn.io/gpu" , 12 "yarn.resource-types" : "yarn.io/gpu" , 13 "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices" : "auto" , 14 "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables" : "/usr/bin" , 15 "yarn.nodemanager.linux-container-executor.cgroups.mount" : "true" , 16 "yarn.nodemanager.linux-container-executor.cgroups.mount-path" : "/sys/fs/cgroup" , 17 "yarn.nodemanager.linux-container-executor.cgroups.hierarchy" : "yarn" , 18 "yarn.nodemanager.container-executor.class" : "org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" 19 } 20 }, 21 { 22 "Classification" : "container-executor" , 23 "Properties" :{ 24 25 }, 26 "Configurations" :[ 27 { 28 "Classification" : "gpu" , 29 "Properties" :{ 30 "module.enabled" : "true" 31 } 32 }, 33 { 34 "Classification" : "cgroups" , 35 "Properties" :{ 36 "root" : "/sys/fs/cgroup" , 37 "yarn-hierarchy" : "yarn" 38 } 39 } 40 ] 41 }, 42 { 43 "Classification" : "spark-defaults" , 44 "Properties" :{ 45 "spark.plugins" : "com.nvidia.spark.SQLPlugin" , 46 "spark.executor.resource.gpu.discoveryScript" : "/usr/lib/spark/scripts/gpu/getGpusResources.sh" , 47 "spark.submit.pyFiles" : "/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar" , 48 "spark.executor.extraLibraryPath" : "/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" , 49 "spark.rapids.sql.concurrentGpuTasks" : "2" , 50 "spark.executor.resource.gpu.amount" : "1" , 51 "spark.executor.cores" : "8" , 52 "spark.task.cpus " : "1" , 53 "spark.task.resource.gpu.amount" : "0.125" , 54 "spark.rapids.memory.pinnedPool.size" : "2G" , 55 "spark.executor.memoryOverhead" : "2G" , 56 "spark.sql.files.maxPartitionBytes" : "256m" , 57 "spark.sql.adaptive.enabled" : "false" 58 } 59 }, 60 { 61 "Classification" : "capacity-scheduler" , 62 "Properties" :{ 63 "yarn.scheduler.capacity.resource-calculator" : "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" 64 } 65 } 66 ] Adjust the settings as appropriate for your cluster. For example, setting the appropriate number of cores based on the node type. The spark.task.resource.gpu.amount should be set to 1/(number of cores per executor), which will allow multiple tasks to run in parallel on the GPU. For example, for clusters with 2x g4dn.12xlarge as core nodes, use the following: 1 "spark.executor.cores" : "12" , 2 "spark.task.resource.gpu.amount" : "0.0833" , More configuration details can be found in the configuration documentation.

Step 5: Security# Select an existing “EC2 key pair” that will be used to authenticate SSH access to the cluster’s nodes. If you don’t have access to an EC2 key pair, follow these instructions to create an EC2 key pair.