Overview#

The RAPIDS Accelerator for Apache Spark leverages GPUs to accelerate processing via the RAPIDS libraries.

As data workloads continued to grow to support traditional analytics as well as large-scale ML pipelines, traditional CPU-based processing can no longer keep up without compromising either speed or cost. The growing adoption of AI in analytics has created the need for a new framework to process data quickly and cost-efficiently with GPUs.

The RAPIDS Accelerator for Apache Spark combines the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. The RAPIDS Accelerator library also has a built-in accelerated shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities.

Performance & Cost Benefits#

Rapids Accelerator for Apache Spark reaps the benefit of GPU performance while saving infrastructure costs. Perf-cost ETL for FannieMae Mortgage Dataset (~200GB) as shown in our demo. Costs based on Cloud T4 GPU instance market price.

Please refer to spark-rapids-examples repo for details of this example job.

Ease of Use#

Run your existing Apache Spark applications with no code change. Launch Spark with the RAPIDS Accelerator for Apache Spark plugin jar and enable a configuration setting:

spark.conf.set('spark.rapids.sql.enabled','true')

The following is an example of a physical plan with operators running on the GPU:

== Physical Plan ==
GpuColumnarToRow false
+- GpuProject [cast(c_customer_sk#0 as string) AS c_customer_sk#40]
   +- GpuFileGpuScan parquet [c_customer_sk#0] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/customer], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c_customer_sk:int>

Learn more on how to get started.

A Unified AI framework for ETL + ML/DL#

A single pipeline, from ingest to data preparation to model training spark3cluster