User Guide (23.12.2)
User Guide (23.12.2)

Spark3 GPU Configuration Guide on Yarn 3.2.1

Following files recommended to be configured to enable GPU scheduling on Yarn 3.2.1 and later.

GPU resource discovery script - /usr/lib/spark/scripts/gpu/getGpusResources.sh:

Copy
Copied!
            

mkdir -p /usr/lib/spark/scripts/gpu/ cd /usr/lib/spark/scripts/gpu/ wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh chmod a+rwx -R /usr/lib/spark/scripts/gpu/

Spark config - /etc/spark/conf/spark-default.conf:

Copy
Copied!
            

spark.rapids.sql.concurrentGpuTasks=2 spark.executor.resource.gpu.amount=1 spark.executor.cores=8 spark.task.cpus=1 spark.task.resource.gpu.amount=0.125 spark.rapids.memory.pinnedPool.size=2G spark.executor.memoryOverhead=2G spark.plugins=com.nvidia.spark.SQLPlugin spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true' spark.executor.resource.gpu.discoveryScript=/usr/lib/spark/scripts/gpu/getGpusResources.sh # this match the location of discovery script spark.sql.files.maxPartitionBytes=512m

Yarn Scheduler config - /etc/hadoop/conf/capacity-scheduler.xml:

Copy
Copied!
            

<configuration> <property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value> </property> </configuration>

Yarn config - /etc/hadoop/conf/yarn-site.xml:

Copy
Copied!
            

<configuration> <property> <name>yarn.nodemanager.resource-plugins</name> <value>yarn.io/gpu</value> </property> <property> <name>yarn.resource-types</name> <value>yarn.io/gpu</value> </property> <property> <name>yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices</name> <value>auto</value> </property> <property> <name>yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables</name> <value>/usr/bin</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.cgroups.mount</name> <value>true</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.cgroups.mount-path</name> <value>/sys/fs/cgroup</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.cgroups.hierarchy</name> <value>yarn</value> </property> <property> <name>yarn.nodemanager.container-executor.class</name> <value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.group</name> <value>yarn</value> </property> </configuration>

/etc/hadoop/conf/container-executor.cfg - user yarn as service account:

Copy
Copied!
            

yarn.nodemanager.linux-container-executor.group=yarn #--Original container-exectuor.cfg Content-- [gpu] module.enabled=true [cgroups] root=/sys/fs/cgroup yarn-hierarchy=yarn

Need to share node manager local dir to all user, run below in bash:

Copy
Copied!
            

chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct chmod a+rwx -R /sys/fs/cgroup/devices local_dirs=$(bdconfig get_property_value \ --configuration_file /etc/hadoop/conf/yarn-site.xml \ --name yarn.nodemanager.local-dirs 2>/dev/null) mod_local_dirs=${local_dirs//\,/ } chmod a+rwx -R ${mod_local_dirs}

In the end, restart node manager and resource manager service:

On all workers:

Copy
Copied!
            

sudo systemctl restart hadoop-yarn-nodemanager.service

On all masters:

Copy
Copied!
            

sudo systemctl restart hadoop-yarn-resourcemanager.service

Note

If cgroup is mounted on tmpfs and a node is rebooted, the cgroup directory permission gets reverted. Please check the cgroup documentation for your platform for more details.

Below is one example of how this can be handled:

Update the cgroup permissions:

Copy
Copied!
            

chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct chmod a+rwx -R /sys/fs/cgroup/devices

Or the operation can be added in the systemd scripts:

Create mountCgroup scripts:

Copy
Copied!
            

sudo bash -c "cat >/etc/systemd/system/mountCgroup.service" <<EOF [Unit] Description=startup [Service] ExecStart=/etc/mountCgroup.sh Type=oneshot [Install] WantedBy=multi-user.target EOF sudo bash -c "cat >/etc/mountCgroup.sh" <<EOF #!/bin/sh chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct chmod a+rwx -R /sys/fs/cgroup/devices EOF sudo chmod 644 /etc/systemd/system/mountCgroup.service sudo chmod 655 /etc/mountCgroup.sh

Then start the mountCgroup service:

Copy
Copied!
            

systemctl enable mountCgroup.service systemctl start mountCgroup.service

Previous RAPIDS Accelerator on Oracle Cloud Infrastructure
Next Tuning Guide
© Copyright 2023-2024, NVIDIA. Last updated on Feb 6, 2024.