User Guide (24.06)
User Guide (24.06)

Overview

The RAPIDS Accelerator for Apache Spark leverages GPUs to accelerate processing via the RAPIDS libraries.

As data workloads continued to grow to support traditional analytics as well as large-scale ML pipelines, traditional CPU-based processing can no longer keep up without compromising either speed or cost. The growing adoption of AI in analytics has created the need for a new framework to process data quickly and cost-efficiently with GPUs.

The RAPIDS Accelerator for Apache Spark combines the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. The RAPIDS Accelerator library also has a built-in accelerated shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities.

Rapids Accelerator for Apache Spark reaps the benefit of GPU performance while saving infrastructure costs. Perf-cost ETL for FannieMae Mortgage Dataset (~200GB) as shown in our demo. Costs based on Cloud T4 GPU instance market price.

perf-cost.png

Please refer to spark-rapids-examples repo for details of this example job.

Run your existing Apache Spark applications with no code change. Launch Spark with the RAPIDS Accelerator for Apache Spark plugin jar and enable a configuration setting:

spark.conf.set('spark.rapids.sql.enabled','true')

The following is an example of a physical plan with operators running on the GPU:

Copy
Copied!
            

== Physical Plan == GpuColumnarToRow false +- GpuProject [cast(c_customer_sk#0 as string) AS c_customer_sk#40] +- GpuFileGpuScan parquet [c_customer_sk#0] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/customer], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c_customer_sk:int>

Learn more on how to get started.

A single pipeline, from ingest to data preparation to model training spark3cluster

spark3cluster.png

Previous RAPIDS Accelerator for Apache Spark - User Guide
Next Overview
© Copyright 2024, NVIDIA. Last updated on Jul 26, 2024.