User Guide (24.04.01)
v24.04.01

Output Details

The default output location is the current directory. The output location can be changed using the --output-directory option. The output goes into a sub-directory named “rapids_4_spark_profile” inside that output location.

The default Collection Mode processes event logs individually and outputs files for each application under a directory named rapids_4_spark_profile/{APP_ID}. It creates a summary text file named profile.log.

There are separate files that are generated under the same sub-directory when using the option to generate query visualizations or printing the SQL plans. Optionally if the --csv option is specified then it creates a csv file for each table for each application in the corresponding sub-directory (see Profiling Tool Options).

Note
  • There is a 100 characters limit for each output column. If the result of the column exceeds this limit, it is suffixed with ... for that column.

  • ResourceProfile ids are parsed for the event logs that are from Spark 3.1 or later. A ResourceProfile allows the user to specify executor and task requirements for an RDD that will get applied during a stage. This allows the user to change the resource requirements between stages.

A. Collect Information or Compare Information

The Compare Mode is used when –compare argument is specified and multiple event logs are used as input. For example, we can input multiple Spark event logs and this tool can compare environments, executors, RAPIDS related Spark parameters such as: durations; versions; and gpuMode

  • Application information

Copy
Copied!
            

+--------+-----------+-----------------------+---------+-------------+-------------+--------+-----------+------------+-------------+ |appIndex|appName |appId |sparkUser|startTime |endTime |duration|durationStr|sparkVersion|pluginEnabled| +--------+-----------+-----------------------+---------+-------------+-------------+--------+-----------+------------+-------------+ |1 |Spark shell|app-20210329165943-0103|user1 |1617037182848|1617037490515|307667 |5.1 min |3.0.1 |false | |2 |Spark shell|app-20210329170243-0018|user1 |1617037362324|1617038578035|1215711 |20 min |3.0.1 |true | +--------+-----------+-----------------------+---------+-------------+-------------+--------+-----------+------------+-------------+


  • Application log path mapping

  • Data Source information: the details of this output differ between using a Spark Data Source V1 and Data Source V2 reader. The Data Source V2 truncates the schema, so if you see ..., then the full schema is not available.

Copy
Copied!
            

+--------+-----+-------+---------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------------------------------------------------------------+ |appIndex|sqlID|format |location |pushedFilters |schema | +--------+-----+-------+---------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------------------------------------------------------------+ |1 |0 |Text |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/src/test/resources/trucks-comments.csv]|[] |value:string | |1 |1 |csv |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/src/test/re... |PushedFilters: []|_c0:string | |1 |2 |parquet|Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/lotscolumnsout] |PushedFilters: []|loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...| |1 |3 |parquet|Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/lotscolumnsout] |PushedFilters: []|loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...| |1 |4 |orc |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/logscolumsout.orc] |PushedFilters: []|loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...| |1 |5 |orc |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/logscolumsout.orc] |PushedFilters: []|loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...| |1 |6 |json |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/lotsofcolumnsout.json] |PushedFilters: []|adj_remaining_months_to_maturity:double,asset_recovery_costs:double,credit_enhancement_pro...| |1 |7 |json |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/lotsofcolumnsout.json] |PushedFilters: []|adj_remaining_months_to_maturity:double,asset_recovery_costs:double,credit_enhancement_pro...| |1 |8 |json |Location: InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/lotsofcolumnsout.json] |PushedFilters: []|adj_remaining_months_to_maturity:double,asset_recovery_costs:double,credit_enhancement_pro...| |1 |9 |JDBC |unknown |unknown | | +--------+-----+-------+---------------------------------------------------------------------------------------------------------------------------+-----------------+---------------------------------------------------------------------------------------------+


  • Executors information

Copy
Copied!
            

+--------+-----------------+------------+-------------+-----------+------------+-------------+--------------+------------------+---------------+-------+-------+ |appIndex|resourceProfileId|numExecutors|executorCores|maxMem |maxOnHeapMem|maxOffHeapMem|executorMemory|numGpusPerExecutor|executorOffHeap|taskCpu|taskGpu| +--------+-----------------+------------+-------------+-----------+------------+-------------+--------------+------------------+---------------+-------+-------+ |1 |0 |1 |4 |11264537395|11264537395 |0 |20480 |1 |0 |1 |0.0 | |1 |1 |2 |2 |3247335014 |3247335014 |0 |6144 |2 |0 |2 |2.0 | +--------+-----------------+------------+-------------+-----------+------------+-------------+--------------+------------------+---------------+-------+-------+


  • Job, stage and SQL ID information: this is not available for compare mode yet.

Copy
Copied!
            

+--------+-----+---------+-----+-------------+-------------+ |appIndex|jobID|stageIds |sqlID|startTime |endTime | +--------+-----+---------+-----+-------------+-------------+ |1 |0 |[0] |null |1622846402778|1622846410240| |1 |1 |[1,2,3,4]|0 |1622846431114|1622846441591| +--------+-----+---------+-----+-------------+-------------+


  • SQL to stage information: the list is sorted by stage duration. Note that not all SQL nodes have a mapping to stage id so some nodes might be missing.

Copy
Copied!
            

SQL to Stage Information: +--------+-----+-----+-------+--------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |appIndex|sqlID|jobID|stageId|stageAttemptId|Stage Duration|SQL Nodes(IDs) | +--------+-----+-----+-------+--------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |1 |0 |1 |1 |0 |8174 |Exchange(9),WholeStageCodegen (1)(10),Scan(13) | |1 |0 |1 |2 |0 |8154 |Exchange(16),WholeStageCodegen (3)(17),Scan(20) | |1 |0 |1 |3 |0 |2148 |Exchange(2),HashAggregate(4),SortMergeJoin(6),WholeStageCodegen (5)(3),Sort(8),WholeStageCodegen (2)(7),Exchange(9),Sort(15),WholeStageCodegen (4)(14),Exchange(16)| |1 |0 |1 |4 |0 |126 |HashAggregate(1),WholeStageCodegen (6)(0),Exchange(2) | +--------+-----+-----+-------+--------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+


  • Rapids related parameters

Copy
Copied!
            

Compare Rapids Properties which are set explicitly: +-------------------------------------------+----------+----------+ |propertyName |appIndex_1|appIndex_2| +-------------------------------------------+----------+----------+ |spark.rapids.memory.pinnedPool.size |null |2g | |spark.rapids.sql.castFloatToDecimal.enabled|null |true | |spark.rapids.sql.concurrentGpuTasks |null |2 | |spark.rapids.sql.enabled |false |true | |spark.rapids.sql.explain |null |NOT_ON_GPU| |spark.rapids.sql.incompatibleOps.enabled |null |true | +-------------------------------------------+----------+----------+


  • Spark Properties

  • Rapids Accelerator jar

Copy
Copied!
            

+--------+------------------------------------------------------------+ |appIndex|Rapids4Spark jars | +--------+------------------------------------------------------------+ |1 |spark://10.10.10.10:43445/jars/rapids-4-spark_2.12-0.5.0.jar| |2 |spark://10.10.10.11:41319/jars/rapids-4-spark_2.12-0.5.0.jar| +--------+------------------------------------------------------------+


  • SQL Plan Metrics: lists all the SQL metrics for Application for each SQL plan node in each SQL. These are also called accumulables in Spark. Note that not all SQL nodes have a mapping to stage id.

Copy
Copied!
            

+--------+-----+------+--------------------+-------------+-----------------------+-------------+----------+--------+ |appIndex|sqlID|nodeID|nodeName |accumulatorId|name |max_value |metricType|stageIds| +--------+-----+------+--------------------+-------------+-----------------------+-------------+----------+--------+ |1 |0 |1 |GpuColumnarExchange |111 |output rows |1111111111 |sum |4,3 | |1 |0 |1 |GpuColumnarExchange |112 |output columnar batches|222222 |sum |4,3 | |1 |0 |1 |GpuColumnarExchange |113 |data size |333333333333 |size |4,3 | |1 |0 |1 |GpuColumnarExchange |114 |shuffle bytes written |444444444444 |size |4,3 | |1 |0 |1 |GpuColumnarExchange |115 |shuffle records written|555555 |sum |4,3 | |1 |0 |1 |GpuColumnarExchange |116 |shuffle write time |666666666666 |nsTiming |4,3 |


  • WholeStageCodeGen to node mappings (only applies to CPU plans)

Copy
Copied!
            

+--------+-----+------+---------------------+-------------------+------------+ |appIndex|sqlID|nodeID|SQL Node |Child Node |Child NodeID| +--------+-----+------+---------------------+-------------------+------------+ |1 |0 |0 |WholeStageCodegen (6)|HashAggregate |1 | |1 |0 |3 |WholeStageCodegen (5)|HashAggregate |4 | |1 |0 |3 |WholeStageCodegen (5)|Project |5 | |1 |0 |3 |WholeStageCodegen (5)|SortMergeJoin |6 | |1 |0 |7 |WholeStageCodegen (2)|Sort |8 |


  • IO Metrics

  • Matching SQL IDs Across Applications (Compare Mode): There is one column per application. There is a row per SQL ID. The SQL IDs are matched primarily on the structure of the SQL query run, and then on the order in which they were run. Be aware that this is truly the structure of the query. Two queries that do similar things, but on different data are likely to match as the same. An effort is made to also match between CPU plans and GPU plans so in most cases the same query run on the CPU and on the GPU will match.

Copy
Copied!
            

+-----------------------+-----------------------+ |app-20210329165943-0103|app-20210329170243-0018| +-----------------------+-----------------------+ |0 |0 | |1 |1 | |2 |2 | |3 |3 | |4 |4 | +-----------------------+-----------------------+


  • Matching Stage IDs Across Applications (Compare Mode): There is one column per application. There is a row per stage ID. If a SQL query matches between applications, see Matching SQL IDs Across Applications, then an attempt is made to match stages within that application to each other. This has the same issues with stages when generating a dot graph. This can be especially helpful when trying to compare large queries and Spark happened to assign the stage IDs slightly differently, or in some cases there are a different number of stages because of slight differences in the plan. This is a best effort, and it is not guaranteed to match up all stages in a plan.

  • Optionally: SQL Plan for each SQL query

  • Optionally: Generates DOT graphs for each SQL query

  • Optionally: Generates timeline graph for application

B. Analysis

  • Job + Stage level aggregated task metrics

Copy
Copied!
            

+--------+-------+--------+--------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+ |appIndex|ID |numTasks|Duration|diskBytesSpilled_sum|duration_sum|duration_max|duration_min|duration_avg|executorCPUTime_sum|executorDeserializeCPUTime_sum|executorDeserializeTime_sum|executorRunTime_sum|input_bytesRead_sum|input_recordsRead_sum|jvmGCTime_sum|memoryBytesSpilled_sum|output_bytesWritten_sum|output_recordsWritten_sum|peakExecutionMemory_max|resultSerializationTime_sum|resultSize_max|sr_fetchWaitTime_sum|sr_localBlocksFetched_sum|sr_localBytesRead_sum|sr_remoteBlocksFetched_sum|sr_remoteBytesRead_sum|sr_remoteBytesReadToDisk_sum|sr_totalBytesRead_sum|sw_bytesWritten_sum|sw_recordsWritten_sum|sw_writeTime_sum| +--------+-------+--------+--------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+ |1 |job_0 |3333 |222222 |0 |11111111 |111111 |111 |1111.1 |6666666 |55555 |55555 |55555555 |222222222222 |22222222222 |111111 |0 |0 |0 |222222222 |1 |11111 |11111 |99999 |22222222222 |2222221 |222222222222 |0 |222222222222 |222222222222 |5555555 |444444 |


  • SQL level aggregated task metrics

Copy
Copied!
            

+--------+------------------------------+-----+--------------------+--------+--------+---------------+---------------+----------------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+ |appIndex|appID |sqlID|description |numTasks|Duration|executorCPUTime|executorRunTime|executorCPURatio|diskBytesSpilled_sum|duration_sum|duration_max|duration_min|duration_avg|executorCPUTime_sum|executorDeserializeCPUTime_sum|executorDeserializeTime_sum|executorRunTime_sum|input_bytesRead_sum|input_recordsRead_sum|jvmGCTime_sum|memoryBytesSpilled_sum|output_bytesWritten_sum|output_recordsWritten_sum|peakExecutionMemory_max|resultSerializationTime_sum|resultSize_max|sr_fetchWaitTime_sum|sr_localBlocksFetched_sum|sr_localBytesRead_sum|sr_remoteBlocksFetched_sum|sr_remoteBytesRead_sum|sr_remoteBytesReadToDisk_sum|sr_totalBytesRead_sum|sw_bytesWritten_sum|sw_recordsWritten_sum|sw_writeTime_sum| +--------+------------------------------+-----+--------------------+--------+--------+---------------+---------------+----------------+--------------------+------------+------------+------------+------------+-------------------+------------------------------+---------------------------+-------------------+-------------------+---------------------+-------------+----------------------+-----------------------+-------------------------+-----------------------+---------------------------+--------------+--------------------+-------------------------+---------------------+--------------------------+----------------------+----------------------------+---------------------+-------------------+---------------------+----------------+ |1 |application_1111111111111_0001|0 |show at <console>:11|1111 |222222 |6666666 |55555555 |55.55 |0 |13333333 |111111 |999 |3333.3 |6666666 |55555 |66666 |11111111 |111111111111 |11111111111 |111111 |0 |0 |0 |888888888 |8 |11111 |11111 |99999 |11111111111 |2222222 |222222222222 |0 |222222222222 |444444444444 |5555555 |444444 |


  • SQL duration, application during, if it contains Dataset or RDD operation, potential problems, executor CPU time percent

Copy
Copied!
            

+--------+-------------------+-----+------------+--------------------------+------------+---------------------------+-------------------------+ |appIndex|App ID |sqlID|SQL Duration|Contains Dataset or RDD Op|App Duration|Potential Problems |Executor CPU Time Percent| +--------+-------------------+-----+------------+--------------------------+------------+---------------------------+-------------------------+ |1 |local-1626104300434|0 |1260 |false |131104 |NESTED COMPLEX TYPE |92.65 | |1 |local-1626104300434|1 |259 |false |131104 |NESTED COMPLEX TYPE |76.79 |


  • Shuffle Skew Check:

    When task’s Shuffle Read Size \(> 3 \times\) (Avg Stage-level size)

    Copy
    Copied!
                

    +--------+-------+--------------+------+-------+---------------+--------------+-----------------+----------------+----------------+----------+----------------------------------------------------------------------------------------------------+ |appIndex|stageId|stageAttemptId|taskId|attempt|taskDurationSec|avgDurationSec|taskShuffleReadMB|avgShuffleReadMB|taskPeakMemoryMB|successful|reason | +--------+-------+--------------+------+-------+---------------+--------------+-----------------+----------------+----------------+----------+----------------------------------------------------------------------------------------------------+ |1 |2 |0 |2222 |0 |111.11 |7.7 |2222.22 |111.11 |0.01 |false |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /dddd/xxxxxxx/ccccc/bbbbbbbbb/aaaaaaa| |1 |2 |0 |2224 |1 |222.22 |8.8 |3333.33 |111.11 |0.01 |false |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /dddd/xxxxxxx/ccccc/bbbbbbbbb/aaaaaaa|

C. Health Check

  • List failed tasks, stages and jobs

Example of failed tasks

Copy
Copied!
            

+--------+-------+--------------+------+-------+----------------------------------------------------------------------------------------------------+ |appIndex|stageId|stageAttemptId|taskId|attempt|failureReason | +--------+-------+--------------+------+-------+----------------------------------------------------------------------------------------------------+ |3 |4 |0 |2842 |0 |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /home/jenkins/agent/workspace/jenkins| |3 |4 |0 |2858 |0 |TaskKilled(another attempt succeeded,List(AccumulableInfo(453,None,Some(22000),None,false,true,None)| |3 |4 |0 |2884 |0 |TaskKilled(another attempt succeeded,List(AccumulableInfo(453,None,Some(21148),None,false,true,None)| |3 |4 |0 |2908 |0 |TaskKilled(another attempt succeeded,List(AccumulableInfo(453,None,Some(20420),None,false,true,None)| |3 |4 |0 |3410 |1 |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /home/jenkins/agent/workspace/jenkins| |4 |1 |0 |1948 |1 |TaskKilled(another attempt succeeded,List(AccumulableInfo(290,None,Some(1107),None,false,true,None),| +--------+-------+--------------+------+-------+----------------------------------------------------------------------------------------------------+


Example of failed stages

Copy
Copied!
            

+--------+-------+---------+-------------------------------------+--------+---------------------------------------------------+ |appIndex|stageId|attemptId|name |numTasks|failureReason | +--------+-------+---------+-------------------------------------+--------+---------------------------------------------------+ |3 |4 |0 |attachTree at Spark300Shims.scala:624|1000 |Job 0 cancelled as part of cancellation of all jobs| +--------+-------+---------+-------------------------------------+--------+---------------------------------------------------+


Example of failed jobs

Copy
Copied!
            

+--------+-----+---------+------------------------------------------------------------------------+ |appIndex|jobID|jobResult|failureReason | +--------+-----+---------+------------------------------------------------------------------------+ |3 |0 |JobFailed|java.lang.Exception: Job 0 cancelled as part of cancellation of all j...| +--------+-----+---------+------------------------------------------------------------------------+

  • Removed BlockManagers and Executors

  • SQL Plan HealthCheck

Example showing possibly unsupported query plan nodes such as $Lambda key word means dataset API.

Copy
Copied!
            

+--------+-----+------+--------+------------------------------------------------------------------------------------------+ |appIndex|sqlID|nodeID|nodeName|nodeDescription | +--------+-----+------+--------+------------------------------------------------------------------------------------------+ |3 |1 |8 |Filter |Filter $line21.$read$iw$iw$iw$iw$iw$iw$iw$iw$Lambda$4578/0x00000008019f1840@4b63e04c.apply| +--------+-----+------+--------+------------------------------------------------------------------------------------------+

The Auto-Tuner output has 2 main sections:

  1. Spark Properties: A list of Apache Spark configurations to tune the performance of the app. The list is the result of diff between the existing app configurations and the recommended ones. Therefore, a recommendation matches the existing app configuration, it will not show up in the list.

  2. Comments: A list of messages to highlight properties that were missing in the app configurations, or the cause of failure to generate the recommendations.

Examples

Example of a successful run with missing softwareProperties

Copy
Copied!
            

Spark Properties: –conf spark.executor.cores=16 –conf spark.executor.instances=8 –conf spark.executor.memory=32768m –conf spark.executor.memoryOverhead=7372m –conf spark.rapids.memory.pinnedPool.size=4096m –conf spark.rapids.sql.concurrentGpuTasks=2 –conf spark.sql.files.maxPartitionBytes=512m –conf spark.sql.shuffle.partitions=200 –conf spark.task.resource.gpu.amount=0.0625 Comments: - ‘spark.executor.instances’ was not set. - ‘spark.executor.cores’ was not set. - ‘spark.task.resource.gpu.amount’ was not set. - ‘spark.rapids.sql.concurrentGpuTasks’ was not set. - ‘spark.executor.memory’ was not set. - ‘spark.rapids.memory.pinnedPool.size’ was not set. - ‘spark.executor.memoryOverhead’ was not set. - ‘spark.sql.files.maxPartitionBytes’ was not set. - ‘spark.sql.shuffle.partitions’ was not set. - ‘spark.sql.adaptive.enabled’ should be enabled for better performance.


Example of a successful run with missing softwareProperties. Only two recommendations did not match the existing app configurations.

Copy
Copied!
            

Spark Properties: --conf spark.executor.instances=8 --conf spark.sql.shuffle.partitions=200 Comments: - 'spark.sql.shuffle.partitions' was not set.


Example showing the output when loading the worker info has failed.

Copy
Copied!
            

Cannot recommend properties. See Comments. Comments: - java.io.FileNotFoundException: File worker-info.yaml does not exist - 'spark.executor.memory' should be set to at least 2GB/core. - 'spark.executor.instances' should be set to (gpuCount * numWorkers). - 'spark.task.resource.gpu.amount' should be set to Max(1, (numCores / gpuCount)). - 'spark.rapids.sql.concurrentGpuTasks' should be set to Min(4, (gpuMemory / 7.5G)). - 'spark.rapids.memory.pinnedPool.size' should be set to 2048m. - 'spark.rapids.sql.enabled' should be true to enable SQL operations on the GPU. - 'spark.sql.adaptive.enabled' should be enabled for better performance.


Print SQL Plans

Option: –print-plans

Prints the SQL plan as a text string to a file named planDescriptions.log.

Generate DOT graph for each SQL

Option: –generate-dot

Copy
Copied!
            

Generated DOT graphs for app app-20210507103057-0000 to /path/. in 17 second(s)

A dot file will be generated for each query in the application. Once the DOT file is generated, you can install graphviz to convert the DOT file as a graph in PDF/SVG formats using below command:

Copy
Copied!
            

dot -Tpdf ./app-20210507103057-0000-query-0/0.dot > app-20210507103057-0000.pdf

Copy
Copied!
            

dot -Tsvg ./app-20210507103057-0000-query-0/0.dot > app-20210507103057-0000.svg


The PDF/SVG file has the SQL plan graph with metrics. The SVG file will act a little more like the Spark UI and include extra information for nodes when hovering over it with a mouse.

As a part of this an effort is made to associate parts of the graph with the Spark stage it is a part of. This is not 100% accurate. Some parts of the plan like TakeOrderedAndProject may be a part of multiple stages and only one of the stages will be selected. Exchanges are purposely left out of the sections associated with a stage because they cover at least 2 stages and possibly more. In other cases we may not be able to determine what stage something was a part of. In those cases we mark it as UNKNOWN STAGE. This is because we rely on metrics to link a node to a stage. If a stage has no metrics, like if the query crashed early, we cannot establish that link.

Generate Application Timeline

Option: –generate-timeline

The output of this is an svg file named timeline.svg. Most web browsers can display this file. It is a timeline view similar to Apache Spark’s event timeline.

This displays several data sections:

  1. Tasks: this shows all tasks in the application divided by executor. Please note that this tries to pack the tasks in the graph. It does not represent actual scheduling on CPU cores. The tasks are labeled with the time it took for them to run. There is a breakdown of some metrics per task in the lower half of the task block with different colors used to designate different metrics.

  1. Yellow is the deserialization time for the task as reported by Spark. This works for both CPU and GPU tasks.

  2. White is the read time for a task. This is a combination of the “buffer time” GPU SQL metric and the shuffle read time as reported by Spark. The shuffle time works for both CPU and GPU tasks, but “buffer time” only is reported for GPU accelerated file reads.

  3. Red is the semaphore wait time. This is the amount of time a task spent waiting to get access to the GPU. When processing logs generated by versions of the RAPIDS accelerator prior to 23.04 this would only show up on GPU tasks when DEBUG metrics are enabled. For logs generated with 23.04 and above it is always on. It does not apply to CPU tasks, as they don’t go through the Semaphore.

  4. Green is the “op time” SQL metric along with a few other metrics that also indicate the amount of time the GPU was being used to process data. This is GPU specific.

  5. Blue is the write time for a task. This is the “write time” SQL metric used when writing out results as files using GPU acceleration, or it is the shuffle write time as reported by Spark. The shuffle metrics work for both CPU and GPU tasks, but the “write time” metrics is GPU specific.

  6. Anything else is time that is not accounted for by these metrics. Typically, this is time spent on the CPU, but could also include semaphore wait time as DEBUG metrics are not on by default.

  1. STAGES: this shows the stages times reported by Spark. It starts with when the stage was scheduled and ends when Spark considered the stage done.

  2. STAGE RANGES: this shows the time from the start of the first task to the end of the last task. Often a stage is scheduled, but there are not enough resources in the cluster to run it. This helps to show. How long it takes for a task to start running after it is scheduled, and in many cases how long it took to run all of the tasks in the stage. This is not always true because Spark can intermix tasks from different stages.

  3. JOBS: this shows the time range reported by Spark from when a job was scheduled to when it completed.

  4. SQL: this shows the time range reported by Spark from when a SQL statement was scheduled to when it completed.

Tasks and stages all are color coordinated to help know what tasks are associated with a given stage. Jobs and SQL are not color coordinated.

Previous Quickstart
Next RAPIDS Accelerator for Apache Spark ML Library Integration
© Copyright 2024, NVIDIA. Last updated on Jun 12, 2024.