Examples

Please visit spark-rapids-examples repo for ETL, ML/DL, UDF related examples using the RAPIDS Accelerator for Apache Spark. It includes Scala/Python source code and related notebooks for different examples.

Please visit spark-rapids-benchmarks repo for Spark related benchmark sets and utilities using the RAPIDS Accelerator for Apache Spark.

CLI Samples

This section shows samples of Qualification CLI cmd assuming the following inputs:

  • CLUTER_NAME: The cluster name on CSP (dataproc, Databricks, or EMR)

  • PROP_FILE: Path to a cluster property file. The path can be a local filesystem, HDFS, S3, ABFS, or GCS. The file can be formatted according to gcloud specs (DATAPROC_PROP), EMR (EMR_PROP)

  • EVENTLOG: Path to Spark event logs without the scheme part. The scheme can be a local filesystem, HDFS, S3, ABFS, or GCS.

The following table shows samples of CLI cmds along with the expected functionalities and the platform based on which the analysis is performed.

List of arguments and options for qualification CLI cmd

CMD

Platform

Cost Savings

Speedups

Comments

Copy
Copied!
            

spark_rapids qualification \ --cluster $CLUSTER_NAME \ --eventlogs gs://$EVENTLOG

Dataproc ☑️ ☑️ cost savings are calculated based on Dataproc cluster because EVENTLOG is stored on GCS
Copy
Copied!
            

spark_rapids qualification \ --cluster $DATAPROC_PROP \ --eventlogs file://$EVENTLOG

Dataproc ☑️ ☑️ The cluster arguments to a property file matching the Dataproc specs
Copy
Copied!
            

spark_rapids qualification \ --eventlogs file://$EVENTLOG

On-prem ☑️ The cost-savings cannot be generated without cluster argument while EVENTLOG is stored on a local filesystem

Java CMD Samples

  • Process the 10 newest logs, and only output the top 3 in the output:

    Copy
    Copied!
                

    java ${QUALIFICATION_HEAP} \ -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 10-newest -n 3 /eventlogDir

  • Process last 100 days’ logs:

    Copy
    Copied!
                

    java ${QUALIFICATION_HEAP} \ -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.qualification.QualificationMain -s 100d /eventlogDir

  • Process only the newest log with the same application name:

    Copy
    Copied!
                

    java ${QUALIFICATION_HEAP} \ -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.qualification.QualificationMain -f 1-newest-per-app-name /eventlogDir

  • Parse ML functions from the eventlog:

    Copy
    Copied!
                

    java ${QUALIFICATION_HEAP} \ -cp ~/rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/*:$HADOOP_CONF_DIR/ \ com.nvidia.spark.rapids.tool.qualification.QualificationMain --ml-functions /eventlogDir

CLI Samples

This section shows samples of Profiling CLI cmd assuming the following inputs:

  • CLUTER_NAME: The GPU cluster name on CSP (dataproc, Databricks, or EMR)

  • PROP_FILE: Path to a GPU cluster property file. The path can be a local filesystem, HDFS, S3, ABFS, or GCS. The file can be formatted according to gcloud specs (DATAPROC_PROP), EMR (EMR_PROP)

  • EVENTLOG: Path to Spark event logs without the scheme part. The scheme can be a local filesystem, HDFS, S3, ABFS, or GCS.

The following table shows samples of CLI cmds along with the expected functionalities and the platform based on which the analysis is performed.

List of arguments and options for profiling CLI cmd

CMD

Platform

Auto-Tuner

Comments

Copy
Copied!
            

spark_rapids profiling \ --cluster $CLUSTER_NAME \ --eventlogs gs://$EVENTLOG

Dataproc ☑️ Auto-Tuner recommendations are based on accelerated Dataproc cluster because EVENTLOG is stored on GCS
Copy
Copied!
            

spark_rapids profiling \ --cluster $DATAPROC_PROP \ --eventlogs file://$EVENTLOG

Dataproc ☑️ Auto-Tuner recommendations are based on accelerated Dataproc cluster because cluster arguments to a property file matching the Dataproc specs
Copy
Copied!
            

spark_rapids profiling \ --eventlogs file://$EVENTLOG

On-prem The recommendations cannot be generated without cluster argument while EVENTLOG is stored on a local filesystem

Java CMD Samples

Collection Modes

Example running Profiling tool with different collections modes:

Copy
Copied!
            

java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain [options] \ <eventlogs | eventlog directories ...>

Copy
Copied!
            

java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain --combined \ <eventlogs | eventlog directories ...>

Copy
Copied!
            

java -cp rapids-4-spark-tools_2.12-<version>.jar:$SPARK_HOME/jars/* \ com.nvidia.spark.rapids.tool.profiling.ProfileMain --compare \ <eventlogs | eventlog directories ...>


Previous Frequently Asked Questions
Next Glossary
© Copyright 2024, NVIDIA. Last updated on Apr 23, 2024.