Examples#
Please visit spark-rapids-examples repo for ETL, ML/DL, UDF related examples using the RAPIDS Accelerator for Apache Spark. It includes Scala/Python source code and related notebooks for different examples.
Benchmarks#
Please visit spark-rapids-benchmarks repo for Spark related benchmark sets and utilities using the RAPIDS Accelerator for Apache Spark.
Profiling Tool#
CLI Samples#
This section shows samples of Profiling CLI cmd assuming the following inputs:
- CLUTER_NAME: The GPU cluster name on CSP (dataproc, Databricks, or EMR)
- PROP_FILE: Path to a GPU cluster property file. The path can be a local filesystem, HDFS, S3, ABFS, or GCS. The file can be formatted according to gcloud specs (- DATAPROC_PROP), EMR (- EMR_PROP)
- EVENTLOG: Path to Spark event logs without the scheme part. The scheme can be a local filesystem, HDFS, S3, ABFS, or GCS.
The following table shows samples of CLI cmds along with the expected functionalities and the platform based on which the analysis is performed.
| CMD | Platform | Auto-Tuner | Comments | 
|---|---|---|---|
| spark_rapids profiling \
  --cluster $CLUSTER_NAME \
  --eventlogs gs://$EVENTLOG
 | Dataproc | ☑️ | Auto-Tuner recommendations are based on accelerated Dataproc cluster because EVENTLOG is stored on GCS | 
| spark_rapids profiling \
  --cluster $DATAPROC_PROP \
  --eventlogs file://$EVENTLOG
 | Dataproc | ☑️ | Auto-Tuner recommendations are based on accelerated Dataproc cluster because  | 
| spark_rapids profiling \
  --eventlogs file://$EVENTLOG
 | On-prem | The recommendations cannot be generated without cluster argument while EVENTLOG is stored on a local filesystem |