Dask Cluster Functions#

nemo_curator.get_client(

cluster_type: Literal['cpu', 'gpu'] = 'cpu',

scheduler_address: str | None = None,

scheduler_file: str | None = None,

n_workers: int | None = 4,

threads_per_worker: int = 1,

nvlink_only: bool = False,

protocol: Literal['tcp', 'ucx'] = 'tcp',

rmm_pool_size: str | int | None = '1024M',

enable_spilling: bool = True,

set_torch_to_use_rmm: bool = False,

rmm_async: bool = True,

rmm_maximum_pool_size: str | int | None = None,

rmm_managed_memory: bool = False,

rmm_release_threshold: str | int | None = None,

**cluster_kwargs,

) → dask.distributed.Client#

Initializes or connects to a Dask cluster. The Dask cluster can be CPU-based or GPU-based (if GPUs are available). The intialization ensures maximum memory efficiency for the GPU by:

Ensuring the PyTorch memory pool is the same as the RAPIDS memory pool. (If set_torch_to_use_rmm is True)

Enabling spilling for cuDF. (If enable_spilling is True)

Parameters:

cluster_type – If scheduler_address and scheduler_file are None, sets up a local (single node) cluster of the specified type. Either “cpu” or “gpu”. Defaults to “cpu”. Many options in get_client only apply to CPU-based or GPU-based clusters. Make sure you check the description of the parameter.
scheduler_address – Address of existing Dask cluster to connect to. This can be the address of a Scheduler server like a string ‘127.0.0.1:8786’ or a cluster object like LocalCluster(). If specified, all other arguments are ignored and the client is connected to the existing cluster. The other configuration options should be done when setting up the Dask cluster.
scheduler_file – Path to a file with scheduler information if available. If specified, all other arguments are ignored and the client is connected to the existing cluster. The other configuration options should be done when setting up the Dask cluster.
n_workers – For CPU-based clusters only. The number of workers to start. Defaults to os.cpu_count(). For GPU-based clusters, the number of workers is locked to the number of GPUs in CUDA_VISIBLE_DEVICES.
threads_per_worker – For CPU-based clusters only. The number of threads per each worker. Defaults to 1. Before increasing, ensure that your functions frequently release the GIL.
nvlink_only – For GPU-based clusters only. Whether to use nvlink or not.
protocol – For GPU-based clusters only. Protocol to use for communication. “tcp” or “ucx”.
rmm_pool_size – For GPU-based clusters only. RMM pool size to initialize each worker with. Can be an integer (bytes), float (fraction of total device memory), string (like “5GB” or “5000M”), or None to disable RMM pools.
enable_spilling – For GPU-based clusters only. Enables automatic spilling (and “unspilling”) of buffers from device to host to enable out-of-memory computation, i.e., computing on objects that occupy more memory than is available on the GPU.
set_torch_to_use_rmm – For GPU-based clusters only. Sets up the PyTorch memory pool to be the same as the RAPIDS memory pool. This helps avoid OOM errors when using both PyTorch and RAPIDS on the same GPU.
rmm_async – For GPU-based clusters only. Initializes each worker with RAPIDS Memory Manager (RMM) (see RMM documentation for more information: https://docs.rapids.ai/api/rmm/stable/) and sets it to use RMM’s asynchronous allocator. Warning: The asynchronous allocator requires CUDA Toolkit 11.2 or newer. It is also incompatible with RMM pools and managed memory. Trying to enable both will result in an exception.
rmm_maximum_pool_size – For GPU-based clusters only. When rmm_pool_size is set, this argument indicates the maximum pool size. Can be an integer (bytes), float (fraction of total device memory), string (like “5GB” or “5000M”) or None. By default, the total available memory on the GPU is used. rmm_pool_size must be specified to use RMM pool and to set the maximum pool size. Note: When paired with –enable-rmm-async the maximum size cannot be guaranteed due to fragmentation. Note: This size is a per-worker configuration, and not cluster-wide.
rmm_managed_memory – For GPU-based clusters only. Initialize each worker with RMM and set it to use managed memory. If disabled, RMM may still be used by specifying rmm_pool_size. Warning: Managed memory is currently incompatible with NVLink. Trying to enable both will result in an exception.
rmm_release_threshold – For GPU-based clusters only. When rmm.async is True and the pool size grows beyond this value, unused memory held by the pool will be released at the next synchronization point. Can be an integer (bytes), float (fraction of total device memory), string (like “5GB” or “5000M”) or None. By default, this feature is disabled. Note: This size is a per-worker configuration, and not cluster-wide.
cluster_kwargs – Additional keyword arguments for the LocalCluster or LocalCUDACluster configuration. See API documentation https://docs.dask.org/en/stable/deploying-python.html#distributed.deploy.local.LocalCluster for all LocalCluster parameters, or https://docs.rapids.ai/api/dask-cuda/nightly/api/ for all LocalCUDACluster parameters.

Returns:

A Dask client object.

nemo_curator.get_network_interfaces() → list[str]#

Gets a list of all valid network interfaces on a machine

Returns:: A list of all valid network interfaces on a machine

class nemo_curator.ToBackend(backend: Literal['pandas', 'cudf'])#

A module for moving dataframes between backends.

call( dataset: DocumentDataset, ) → DocumentDataset#

Performs an arbitrary operation on a dataset

Parameters:: dataset (DocumentDataset) – The dataset to operate on