RAPIDS Accelerator for Apache Spark Configuration

User Guide (Latest Version)

The following is the list of options that rapids-plugin-4-spark supports.

On startup use: --conf [conf key]=[conf value]. For example:

Copy
Copied!
            

${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-<SPARK_RAPIDS_VERSION>.jar \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.rapids.sql.concurrentGpuTasks=2

At runtime use: spark.conf.set("[conf key]", [conf value]). For example:

Copy
Copied!
            

scala> spark.conf.set("spark.rapids.sql.concurrentGpuTasks", 2)

All configs can be set on startup, but some configs, especially for shuffle, will not work if they are set at runtime. Please check the column of “Applicable at” to see when the config can be set. “Startup” means only valid on startup, “Runtime” means valid on both startup and runtime.

General Configuration

Name

Description

Default Value

Applicable at

spark.rapids.cloudSchemes

Comma separated list of additional URI schemes that are to be considered cloud based filesystems. Schemes already included: abfs, abfss, dbfs, gs, s3, s3a, s3n, wasbs, cosn. Cloud based stores generally would be total separate from the executors and likely have a higher I/O read cost. Many times the cloud filesystems also get better throughput when you have multiple readers in parallel. This is used with spark.rapids.sql.format.parquet.reader.type

None

Runtime

spark.rapids.filecache.enabled

Controls whether the caching of input files is enabled. When enabled, input datais cached to the same local directories configured for the Spark application. The cache will use up to half the available space by default. To set an absolute cache size limit, see the spark.rapids.filecache.maxBytes configuration setting. Currently only data from Parquet files are cached.

FALSE

Startup

spark.rapids.memory.gpu.maxAllocFraction

The fraction of total GPU memory that limits the maximum size of the RMM pool. The value must be greater than or equal to the setting for spark.rapids.memory.gpu.allocFraction. Note that this limit will be reduced by the reserve memory configured in spark.rapids.memory.gpu.reserve.

1

Startup

spark.rapids.memory.gpu.minAllocFraction

The fraction of total GPU memory that limits the minimum size of the RMM pool. The value must be less than or equal to the setting for spark.rapids.memory.gpu.allocFraction.

0.25

Startup

spark.rapids.memory.host.spillStorageSize

Amount of off-heap host memory to use for buffering spilled GPU data before spilling to local disk. Use -1 to set the amount to the combined size of pinned and pageable memory pools.

-1

Startup

spark.rapids.memory.pinnedPool.size

The size of the pinned memory pool in bytes unless otherwise specified. Use 0 to disable the pool.

0

Startup

spark.rapids.sql.batchSizeBytes

Set the target number of bytes for a GPU batch. Splits sizes for input data is covered by separate configs. The maximum setting is 2 GB to avoid exceeding the cudf row count limit of a column.

1073741824

Runtime

spark.rapids.sql.concurrentGpuTasks

Set the number of tasks that can execute concurrently per GPU. Tasks may temporarily block when the number of concurrent tasks in the executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors.

2

Runtime

spark.rapids.sql.enabled

Enable (true) or disable (false) sql operations on the GPU

TRUE

Runtime

spark.rapids.sql.explain

Explain why some parts of a query were not placed on a GPU or not. Possible values are ALL: print everything, NONE: print nothing, NOT_ON_GPU: print only parts of a query that did not go on the GPU

NOT_ON_GPU

Runtime

spark.rapids.sql.metrics.level

GPU plans can produce a lot more metrics than CPU plans do. In very large queries this can sometimes result in going over the max result size limit for the driver. Supported values include DEBUG which will enable all metrics supported and typically only needs to be enabled when debugging the plugin. MODERATE which should output enough metrics to understand how long each part of the query is taking and how much data is going to each part of the query. ESSENTIAL which disables most metrics except those Apache Spark CPU plans will also report or their equivalents.

MODERATE

Runtime

spark.rapids.sql.multiThreadedRead.numThreads

The maximum number of threads on each executor to use for reading small files in parallel. This can not be changed at runtime after the executor has started. Used with COALESCING and MULTITHREADED readers, see spark.rapids.sql.format.parquet.reader.type, spark.rapids.sql.format.orc.reader.type, or spark.rapids.sql.format.avro.reader.type for a discussion of reader types. If it is not set explicitly and spark.executor.cores is set, it will be tried to assign value of max(MULTITHREAD_READ_NUM_THREADS_DEFAULT, spark.executor.cores), where MULTITHREAD_READ_NUM_THREADS_DEFAULT = 20.

20

Startup

spark.rapids.sql.reader.batchSizeBytes

Soft limit on the maximum number of bytes the reader reads per batch. The readers will read chunks of data until this limit is met or exceeded. Note that the reader may estimate the number of bytes that will be used on the GPU in some cases based on the schema and number of rows in each batch.

2147483647

Runtime

spark.rapids.sql.reader.batchSizeRows

Soft limit on the maximum number of rows the reader will read per batch. The orc and parquet readers will read row groups until this limit is met or exceeded. The limit is respected by the csv reader.

2147483647

Runtime

spark.rapids.sql.shuffle.spillThreads

Number of threads used to spill shuffle data to disk in the background.

6

Runtime

spark.rapids.sql.udfCompiler.enabled

When set to true, Scala UDFs will be considered for compilation as Catalyst expressions

FALSE

Runtime

For more advanced configs, please refer to the RAPIDS Accelerator for Apache Spark Advanced Configuration page.

© Copyright 2023, NVIDIA. Last updated on Sep 18, 2023.