Amazon EMR Deployment UserGuide#

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. The NVIDIA RAPIDS Accelerator for Apache Spark is available on EMR, allowing users to accelerate data processing and machine learning workloads using GPUs. This integration is fully supported by AWS and enables users to run their Spark workloads with optimized performance and efficiency.

Overview of Steps#

This guide provides step by step instructions for getting started with using the RAPIDS Accelerator for an AWS EMR cluster. We start with an Ubuntu based OS to run the AWS CLI for connecting and interacting with an EMR cluster and an s3 bucket. Once the cluster is created, we can then submit a GPU accelerated Spark job/application via Spark Submit. Instructions for using EMR studio with GPU accelerated instances are also provided.

Prerequisites#

Prior to getting started with the RAPIDS Accelerator for Apache Spark on AWS EMR ensure you have a the following prerequisites:

Ubuntu OS with internet access to AWS

NVIDIA AI Enterprise License

NGC Account with NGC Catalog Access

AWS Cloud Tools installed and configured with an AWS key pair

The supported components for using the NVIDIA RAPIDS Accelerator with EMR are listed in the table below.

EMR	Spark	RAPIDS Accelerator jar	cuDF jar	xgboost4j-spark jar
6.10	3.3.1	rapids-4-spark_2.12-22.12.0-amzn-0.jar	Bundled with rapids-4-spark	xgboost4j-spark_3.0-1.4.2-0.3.0.jar

For more details of supported applications, please see the EMR release notes. For more information on AWS EMR, please see the AWS documentation.

Connectivity#

A simple way to validate access to EMR is to list out the EMR clusters via

aws emr list-clusters

Create and Launch AWS EMR with GPU Nodes#

The following steps are based on the AWS EMR document Using the NVIDIA RAPIDS Accelerator for Spark. This document provides a basic configuration to get users started.

The my-configurations.json installs the spark-rapids plugin on your cluster, configures YARN to use GPUs, configures Spark to use RAPIDS, and configures the YARN capacity scheduler. Create a local file called my-configurations.json with the following default configurations:

[
    {
        "Classification":"spark",
        "Properties":{
            "enableSparkRapids":"true"
        }
    },
    {
        "Classification":"yarn-site",
        "Properties":{
            "yarn.nodemanager.resource-plugins":"yarn.io/gpu",
            "yarn.resource-types":"yarn.io/gpu",
            "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
            "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
            "yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
            "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup",
            "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
            "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
        }
    },
    {
        "Classification":"container-executor",
        "Properties":{

        },
        "Configurations":[
            {
                "Classification":"gpu",
                "Properties":{
                    "module.enabled":"true"
                }
            },
            {
                "Classification":"cgroups",
                "Properties":{
                    "root":"/sys/fs/cgroup",
                    "yarn-hierarchy":"yarn"
                }
            }
        ]
    },
    {
        "Classification":"spark-defaults",
        "Properties":{
        "spark.plugins":"com.nvidia.spark.SQLPlugin",
        "spark.sql.sources.useV1SourceList":"",
        "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh",
        "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native",
        "spark.executor.resource.gpu.amount":"1",
        "spark.rapids.sql.concurrentGpuTasks":"2",
        "spark.sql.files.maxPartitionBytes":"256m",
        "spark.sql.adaptive.enabled":"true"
        }
    },
    {
        "Classification":"capacity-scheduler",
        "Properties":{
            "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
        }
    }
]

Adaptive Query Execution (AQE) is enabled by default in Spark 3.2.0, however, the RAPIDS accelerator does not fully support AQE until EMR 6.10+. The setting can be toggled via spark.sql.adaptive.enabled. For lower EMR versions, this setting should be set to false. More advanced configuration details can be found in the configuration documentation.

The my-bootstrap-action.sh script opens cgroup permissions to YARN on your cluster for YARN to use GPUs. Create this file locally with the following content.

#!/bin/bash
set -ex
sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct
sudo chmod a+rwx -R /sys/fs/cgroup/devices

Upload it into an S3 bucket and export the file path. Replace <my-bucket> with your bucket name in the below command.

export BOOTSTRAP_ACTION_SCRIPT=s3://<my-bucket>/my-bootstrap-action.sh

This file will be used for Launching an EMR Cluster

Create a Key pair#

Prior to launching a cluster, a Key Pair is required at cluster creation to allow for SSH access to the cluster. If you have not already created a Key Pair, follow instructions on https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html.

Update KEYNAME in the export statement below with your personal key pair (*.pem file). The key pair is needed to securely access the EMR cluster

Launch an EMR Cluster using AWS CLI#

You can use the AWS CLI to launch a cluster with one master node (m5.xlarge) and two g4dn.2xlarge worker nodes:

export KEYNAME=my-key-pair
export CLUSTER_NAME=spark-rapids-cluster
export EMR_LABEL=emr-6.10.0
export MASTER_INSTANCE_TYPE=m4.4xlarge
export NUM_WORKERS=2
export WORKER_INSTANCE_TYPE=g4dn.2xlarge
export CONFIG_JSON_LOCATION=./my-configurations.json

Create the EMR cluster:

aws emr create-cluster \
--name $CLUSTER_NAME \
--release-label $EMR_LABEL \
--applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway \
--service-role EMR_DefaultRole \
--ec2-attributes KeyName=$KEYNAME,InstanceProfile=EMR_EC2_DefaultRole \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=$MASTER_INSTANCE_TYPE \
InstanceGroupType=CORE,InstanceCount=$NUM_WORKERS,InstanceType=$WORKER_INSTANCE_TYPE \
--configurations file://$CONFIG_JSON_LOCATION \
--bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=$BOOTSTRAP_ACTION_SCRIPT

Output example:

{
    "ClusterId": "j-GEXQ0J54HTI0",
    "ClusterArn": "arn:aws:elasticmapreduce:us-west-2:123456789:cluster/j-GEXQ0J54HTI0"
}

Save the ClusterId for checking the cluster status in subsequent sections.

Create EMR Cluster with Specified Subnet (Optional)#

EMR Studio provides a Jupyter notebook environment that can be connected to EMR clusters. A Virtual Private Cloud (VPC) is required to secure the connection. This can be enforced by specifying the SubnetId during the EMR cluster creation step by including the SubnetId argument within the ec2-attributes option during aws emr create-cluster:

--ec2-attributes KeyName=$KEYNAME,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=<gpu_enabled_subnet> \

To help determine which VPC SubnetId (subnet-*) to use, search for the VPC configuration page. From the left navigation menu on the VPC page, select “Virtual private cloud” >> “Your VPCs” >> Select existing or create new VPC. Use the SubnetId for us-west-2b or other Availability Zone where GPUs are available. The subnet Availability Zone can be determined from the Resource map:

Note

GPU Cluster creation time can take ~15+ min.

Validation#

Check Cluster status:

$ aws emr describe-cluster --cluster-id <cluster_id>

Below is an example output:

{
    "Cluster": {
        "Id": "j-1I7SZJ8DRXCWY",
        "Name": "sample-cluster",
        "Status": {
            "State": "WAITING",
            "StateChangeReason": {
                "Message": "Cluster ready to run steps."
            },
            "Timeline": {
                "CreationDateTime": "2023-03-17T17:29:10.912000+00:00",
                "ReadyDateTime": "2023-03-17T17:42:06.963000+00:00"
            }
        },

Connect to the EMR cluster master with the following command. Modify the command to include your <cluster_id> and key pair.

aws emr ssh --cluster-id <cluster_id> --key-pair-file ~/.ssh/my-key-pair.pem

Running an example join operation using Spark Shell#

On the master node, activate the spark-shell by typing:

spark-shell

Run the following Scala code in Spark Shell:

val data = 1 to 10000
val df1 = sc.parallelize(data).toDF()
val df2 = sc.parallelize(data).toDF()
val out = df1.as("df1").join(df2.as("df2"), $"df1.value" === $"df2.value")
out.count()
out.explain()

Exit the spark-shell and the SSH session before continuing to the next step.

Spark Submit Jobs to a EMR Cluster Accelerated by GPUs#

EMR jobs can be initiated by logging into the master node and submitting applications via spark-submit, just like with Apache Spark. By default, YARN operates in client mode. All applicable files referenced in spark-submit should be accessible on the master node.

export SPARK_HOME=/usr/lib/spark

$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
$SPARK_HOME/examples/jars/spark-examples.jar \
1000

The console output will look like this:

The bottom section of the outputs are from exiting the application. The results of interest are:

23/03/16 20:44:04 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 34.622154 s
Pi is roughly 3.1416558714165586

Spark History Server UI#

The Spark history server UI is on port 18080 and is accessible via port mapping. To view the spark logs, SSH into the master with port forwarding for port 18080 enabled.

The master node public IP can be found via the AWS web console UI. Select your EMR cluster, then click on the “Instances” tab as shown below. Select the link under the ID column that starts with “ig-”* for the Instance Group corresponding to the master node. The public IP address of the EC2 instance corresponding to the master node is shown in the “Public IP address” column (you may need to scroll to the right).

Use this IP address to update <master_node_public_ip> in the command below:

ssh -i ~/.ssh/my-key-pair.pem hadoop@<master_node_public_ip> -L 18080:localhost:18080

Then open a browser window and navigate to: localhost:18080. Keep the SSH session open to maintain the port forwarding. The spark event logs can be downloaded via the web UI.

Cluster Cleanup#

The cluster can be terminated in the GUI by searching for EMR, then selecting your cluster and the Terminate button on the upper right.

Alternatively, the cluster can be terminated via the CLI with:

aws emr terminate-clusters --cluster-id <cluster_id>

If you started an EMR Studio session, the session can be stopped through the EMR Studio webpage. Select “Workspaces” from the left navigation panel, click on a Workspace that’s active, press the Actions button on the upper right, then select “Stop”.