Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Dask Cluster Functions

nemo_curator.get_client(cluster_type='cpu', scheduler_address=None, scheduler_file=None, n_workers=4, threads_per_worker=1, nvlink_only=False, protocol='tcp', rmm_pool_size='1024M', enable_spilling=True, set_torch_to_use_rmm=False) → dask.distributed.Client

Initializes or connects to a Dask cluster. The Dask cluster can be CPU-based or GPU-based (if GPUs are available). The intialization ensures maximum memory efficiency for the GPU by:

Ensuring the PyTorch memory pool is the same as the RAPIDS memory pool. (If set_torch_to_use_rmm is True)

Enabling spilling for cuDF. (If enable_spilling is True)

Parameters

cluster_type – If scheduler_address and scheduler_file are None, sets up a local (single node) cluster of the specified type. Either “cpu” or “gpu”. Defaults to “cpu”. Many options in get_client only apply to CPU-based or GPU-based clusters. Make sure you check the description of the parameter.
scheduler_address – Address of existing Dask cluster to connect to. This can be the address of a Scheduler server like a string ‘127.0.0.1:8786’ or a cluster object like LocalCluster(). If specified, all other arguments are ignored and the client is connected to the existing cluster. The other configuration options should be done when setting up the Dask cluster.
scheduler_file – Path to a file with scheduler information if available. If specified, all other arguments are ignored and the client is connected to the existing cluster. The other configuration options should be done when setting up the Dask cluster.
n_workers – For CPU-based clusters only. The number of workers to start. Defaults to os.cpu_count(). For GPU-based clusters, the number of workers is locked to the number of GPUs in CUDA_VISIBLE_DEVICES.
threads_per_worker – For CPU-based clusters only. The number of threads per each worker. Defaults to 1. Before increasing, ensure that your functions frequently release the GIL.
nvlink_only – For GPU-based clusters only. Whether to use nvlink or not.
protocol – For GPU-based clusters only. Protocol to use for communication. “tcp” or “ucx”.
rmm_pool_size – For GPU-based clusters only. RMM pool size to initialize each worker with. Can be an integer (bytes), float (fraction of total device memory), string (like “5GB” or “5000M”), or None to disable RMM pools.
enable_spilling – For GPU-based clusters only. Enables automatic spilling (and “unspilling”) of buffers from device to host to enable out-of-memory computation, i.e., computing on objects that occupy more memory than is available on the GPU.
set_torch_to_use_rmm – For GPU-based clusters only. Sets up the PyTorch memory pool to be the same as the RAPIDS memory pool. This helps avoid OOM errors when using both PyTorch and RAPIDS on the same GPU.

Returns

A Dask client object.

nemo_curator.get_network_interfaces() → List[str]

Gets a list of all valid network interfaces on a machine

Returns: A list of all valid network interfaces on a machine