Glossary
- ABFS
- Auto-Tuner
- cuDF
- CSP
- DBFS
- ETL
- GCS
- MIG
- RDD
- RDMA
- SparkPlan
- UCX
- UDF
Azure Blob File System (ABFS) is the scheme identifier for Azure Data Lake Storage Gen2.
The Autotuner module is designed to optimize Apache Spark applications by recommending a set of configurations to enhance the performance of the Rapids accelerator.
cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF also provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.
Cloud Service Provider that provides on-demand, scalable computing resources like computing power, data storage, or applications. Examples of CSP include Google Cloud, Microsoft Azure, Amazon Web Services (AWS), and Oracle.
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage that maps Unix-like filesystem calls to native cloud storage API calls.
Extract, Transform, Load
Cloud Storage is a service for storing objects in Google Cloud. An object is an immutable piece of data consisting of a file of any format.
Multi-Instance GPU (MIG) expands the performance and value of NVIDIA H100, A100, and A30 Tensor Core GPUs. MIG can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores.
Resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. Users may choose to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
Remote direct memory access
SparkPlan is an extension of the QueryPlan abstraction for physical operators that can be executed (to generate RDD[InternalRow]
that Spark can execute).
Unified Communication X (UCX) is an optimized point-to-point communication framework.
User-Defined Functions (UDFs) are user-programmable routines that act on one row (see the Spark UDFs documentation).