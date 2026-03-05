Spark3 GPU Configuration Guide on Yarn 3.2.1#

Following files recommended to be configured to enable GPU scheduling on Yarn 3.2.1 and later.

GPU resource discovery script - /usr/lib/spark/scripts/gpu/getGpusResources.sh :

1 mkdir -p /usr/lib/spark/scripts/gpu/ 2 cd /usr/lib/spark/scripts/gpu/ 3 wget https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh 4 chmod a+rwx -R /usr/lib/spark/scripts/gpu/

Spark config - /etc/spark/conf/spark-default.conf :

1 spark.rapids.sql.concurrentGpuTasks = 2 2 spark.executor.resource.gpu.amount = 1 3 spark.executor.cores = 8 4 spark.task.cpus = 1 5 spark.task.resource.gpu.amount = 0 .125 6 spark.rapids.memory.pinnedPool.size = 2G 7 spark.executor.memoryOverhead = 2G 8 spark.plugins = com.nvidia.spark.SQLPlugin 9 spark.executor.extraJavaOptions = '-Dai.rapids.cudf.prefer-pinned=true' 10 spark.executor.resource.gpu.discoveryScript = /usr/lib/spark/scripts/gpu/getGpusResources.sh # this match the location of discovery script 11 spark.sql.files.maxPartitionBytes = 512m

Yarn Scheduler config - /etc/hadoop/conf/capacity-scheduler.xml :

1 <configuration> 2 <property> 3 <name> yarn.scheduler.capacity.resource-calculator </name> 4 <value> org.apache.hadoop.yarn.util.resource.DominantResourceCalculator </value> 5 </property> 6 </configuration>

Yarn config - /etc/hadoop/conf/yarn-site.xml :

1 <configuration> 2 <property> 3 <name> yarn.nodemanager.resource-plugins </name> 4 <value> yarn.io/gpu </value> 5 </property> 6 <property> 7 <name> yarn.resource-types </name> 8 <value> yarn.io/gpu </value> 9 </property> 10 <property> 11 <name> yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices </name> 12 <value> auto </value> 13 </property> 14 <property> 15 <name> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables </name> 16 <value> /usr/bin </value> 17 </property> 18 <property> 19 <name> yarn.nodemanager.linux-container-executor.cgroups.mount </name> 20 <value> true </value> 21 </property> 22 <property> 23 <name> yarn.nodemanager.linux-container-executor.cgroups.mount-path </name> 24 <value> /sys/fs/cgroup </value> 25 </property> 26 <property> 27 <name> yarn.nodemanager.linux-container-executor.cgroups.hierarchy </name> 28 <value> yarn </value> 29 </property> 30 <property> 31 <name> yarn.nodemanager.container-executor.class </name> 32 <value> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor </value> 33 </property> 34 <property> 35 <name> yarn.nodemanager.linux-container-executor.group </name> 36 <value> yarn </value> 37 </property> 38 </configuration>

/etc/hadoop/conf/container-executor.cfg - user yarn as service account:

1 yarn.nodemanager.linux-container-executor.group = yarn 2 3 #--Original container-exectuor.cfg Content-- 4 5 [ gpu ] 6 module.enabled = true 7 [ cgroups ] 8 root = /sys/fs/cgroup 9 yarn-hierarchy = yarn

Need to share node manager local dir to all user, run below in bash:

1 chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct 2 chmod a+rwx -R /sys/fs/cgroup/devices 3 local_dirs = $( bdconfig get_property_value \ 4 --configuration_file /etc/hadoop/conf/yarn-site.xml \ 5 --name yarn.nodemanager.local-dirs 2 >/dev/null ) 6 mod_local_dirs = ${ local_dirs // \, / } 7 chmod a+rwx -R ${ mod_local_dirs }

In the end, restart node manager and resource manager service:

On all workers:

sudo systemctl restart hadoop-yarn-nodemanager.service

On all masters:

sudo systemctl restart hadoop-yarn-resourcemanager.service

Note If cgroup is mounted on tmpfs and a node is rebooted, the cgroup directory permission gets reverted. Check the cgroup documentation for your platform for more details.

Below is one example of how this can be handled:

Update the cgroup permissions:

1 chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct 2 chmod a+rwx -R /sys/fs/cgroup/devices

Or the operation can be added in the systemd scripts:

Create mountCgroup scripts:

1 sudo bash -c "cat >/etc/systemd/system/mountCgroup.service" <<EOF 2 [Unit] 3 Description=startup 4 [Service] 5 ExecStart=/etc/mountCgroup.sh 6 Type=oneshot 7 [Install] 8 WantedBy=multi-user.target 9 EOF 10 11 sudo bash -c "cat >/etc/mountCgroup.sh" <<EOF 12 #!/bin/sh 13 chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct 14 chmod a+rwx -R /sys/fs/cgroup/devices 15 EOF 16 17 sudo chmod 644 /etc/systemd/system/mountCgroup.service 18 sudo chmod 655 /etc/mountCgroup.sh

Then start the mountCgroup service: