RAPIDS Accelerator for Apache Spark Configuration

The following is the list of options that rapids-plugin-4-spark supports.

On startup use: --conf [conf key]=[conf value]. For example:

Copy
Copied!

            
            ${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-<SPARK_RAPIDS_VERSION>.jar \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.concurrentGpuTasks=2

At runtime use: spark.conf.set("[conf key]", [conf value]). For example:

Copy
Copied!

            
            scala> spark.conf.set("spark.rapids.sql.concurrentGpuTasks", 2)

All configs can be set on startup, but some configs, especially for shuffle, will not work if they are set at runtime. Please check the column of “Applicable at” to see when the config can be set. “Startup” means only valid on startup, “Runtime” means valid on both startup and runtime.

*General Configuration*
Name	Description	Default Value	Applicable at
spark.rapids.cloudSchemes	Comma separated list of additional URI schemes that are to be considered cloud based filesystems. Schemes already included: abfs, abfss, dbfs, gs, s3, s3a, s3n, wasbs, cosn. Cloud based stores generally would be total separate from the executors and likely have a higher I/O read cost. Many times the cloud filesystems also get better throughput when you have multiple readers in parallel. This is used with spark.rapids.sql.format.parquet.reader.type	None	Runtime
spark.rapids.filecache.enabled	Controls whether the caching of input files is enabled. When enabled, input datais cached to the same local directories configured for the Spark application. The cache will use up to half the available space by default. To set an absolute cache size limit, see the spark.rapids.filecache.maxBytes configuration setting. Currently only data from Parquet files are cached.	FALSE	Startup
spark.rapids.memory.gpu.maxAllocFraction	The fraction of total GPU memory that limits the maximum size of the RMM pool. The value must be greater than or equal to the setting for spark.rapids.memory.gpu.allocFraction. Note that this limit will be reduced by the reserve memory configured in spark.rapids.memory.gpu.reserve.	1	Startup
spark.rapids.memory.gpu.minAllocFraction	The fraction of total GPU memory that limits the minimum size of the RMM pool. The value must be less than or equal to the setting for spark.rapids.memory.gpu.allocFraction.	0.25	Startup
spark.rapids.memory.host.spillStorageSize	Amount of off-heap host memory to use for buffering spilled GPU data before spilling to local disk. Use -1 to set the amount to the combined size of pinned and pageable memory pools.	-1	Startup
spark.rapids.memory.pinnedPool.size	The size of the pinned memory pool in bytes unless otherwise specified. Use 0 to disable the pool.	0	Startup
spark.rapids.sql.batchSizeBytes	Set the target number of bytes for a GPU batch. Splits sizes for input data is covered by separate configs. The maximum setting is 2 GB to avoid exceeding the cudf row count limit of a column.	1073741824	Runtime
spark.rapids.sql.concurrentGpuTasks	Set the number of tasks that can execute concurrently per GPU. Tasks may temporarily block when the number of concurrent tasks in the executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors.	2	Runtime
spark.rapids.sql.enabled	Enable (true) or disable (false) sql operations on the GPU	TRUE	Runtime
spark.rapids.sql.explain	Explain why some parts of a query were not placed on a GPU or not. Possible values are ALL: print everything, NONE: print nothing, NOT_ON_GPU: print only parts of a query that did not go on the GPU	NOT_ON_GPU	Runtime
spark.rapids.sql.metrics.level	GPU plans can produce a lot more metrics than CPU plans do. In very large queries this can sometimes result in going over the max result size limit for the driver. Supported values include DEBUG which will enable all metrics supported and typically only needs to be enabled when debugging the plugin. MODERATE which should output enough metrics to understand how long each part of the query is taking and how much data is going to each part of the query. ESSENTIAL which disables most metrics except those Apache Spark CPU plans will also report or their equivalents.	MODERATE	Runtime
spark.rapids.sql.multiThreadedRead.numThreads	The maximum number of threads on each executor to use for reading small files in parallel. This can not be changed at runtime after the executor has started. Used with COALESCING and MULTITHREADED readers, see spark.rapids.sql.format.parquet.reader.type, spark.rapids.sql.format.orc.reader.type, or spark.rapids.sql.format.avro.reader.type for a discussion of reader types. If it is not set explicitly and spark.executor.cores is set, it will be tried to assign value of `max(MULTITHREAD_READ_NUM_THREADS_DEFAULT, spark.executor.cores)`, where MULTITHREAD_READ_NUM_THREADS_DEFAULT = 20.	20	Startup
spark.rapids.sql.reader.batchSizeBytes	Soft limit on the maximum number of bytes the reader reads per batch. The readers will read chunks of data until this limit is met or exceeded. Note that the reader may estimate the number of bytes that will be used on the GPU in some cases based on the schema and number of rows in each batch.	2147483647	Runtime
spark.rapids.sql.reader.batchSizeRows	Soft limit on the maximum number of rows the reader will read per batch. The orc and parquet readers will read row groups until this limit is met or exceeded. The limit is respected by the csv reader.	2147483647	Runtime
spark.rapids.sql.shuffle.spillThreads	Number of threads used to spill shuffle data to disk in the background.	6	Runtime
spark.rapids.sql.udfCompiler.enabled	When set to true, Scala UDFs will be considered for compilation as Catalyst expressions	FALSE	Runtime

For more advanced configs, please refer to the RAPIDS Accelerator for Apache Spark Advanced Configuration page.