Glossary
- ABFS
Azure Blob File System (ABFS) is the scheme identifier for Azure Data Lake Storage Gen2.
- Auto-Tuner
The Autotuner module is designed to optimize Apache Spark applications by recommending a set of configurations to enhance the performance of the Rapids accelerator.
- cuDF
cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF also provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.
- CSP
Cloud Service Provider that provides on-demand, scalable computing resources like computing power, data storage, or applications. Examples of CSP include Google Cloud, Microsoft Azure, Amazon Web Services (AWS), and Oracle.
- DBFS
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage that maps Unix-like filesystem calls to native cloud storage API calls.
- ETL
Extract, Transform, Load
- GCS
Cloud Storage is a service for storing objects in Google Cloud. An object is an immutable piece of data consisting of a file of any format.
- MIG
Multi-Instance GPU (MIG) expands the performance and value of NVIDIA H100, A100, and A30 Tensor Core GPUs. MIG can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores.
- RDD
Resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. Users may choose to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
- RDMA
Remote direct memory access
- SparkPlan
SparkPlan is an extension of the QueryPlan abstraction for physical operators that can be executed (to generate
RDD[InternalRow]
that Spark can execute).- UCX
Unified Communication X (UCX) is an optimized point-to-point communication framework.
- UDF
User-Defined Functions (UDFs) are user-programmable routines that act on one row (see the Spark UDFs documentation).