nemo_run.slurm
#
Module Contents#
Classes#
Configuration for running a NeMo Curator script on a Slurm cluster using NeMo Run |
Data#
API#
- class nemo_run.slurm.SlurmJobConfig#
Configuration for running a NeMo Curator script on a Slurm cluster using NeMo Run
Args: job_dir: The base directory where all the files related to setting up the Dask cluster for NeMo Curator will be written container_entrypoint: A path to the container-entrypoint.sh script on the cluster. container-entrypoint.sh is found in the repo here: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/slurm/container-entrypoint.sh script_command: The NeMo Curator CLI tool to run. Pass any additional arguments needed directly in this string. device: The type of script that will be running, and therefore the type of Dask cluster that will be created. Must be either “cpu” or “gpu”. interface: The network interface the Dask cluster will communicate over. Use nemo_curator.get_network_interfaces() to get a list of available ones. protocol: The networking protocol to use. Can be either “tcp” or “ucx”. Setting to “ucx” is recommended for GPU jobs if your cluster supports it. cpu_worker_memory_limit: The maximum memory per process that a Dask worker can use. “5GB” or “5000M” are examples. “0” means no limit. rapids_no_initialize: Will delay or disable the CUDA context creation of RAPIDS libraries, allowing for improved compatibility with UCX-enabled clusters and preventing runtime warnings. cudf_spill: Enables automatic spilling (and “unspilling”) of buffers from device to host to enable out-of-memory computation, i.e., computing on objects that occupy more memory than is available on the GPU. rmm_scheduler_pool_size: Sets a small pool of GPU memory for message transfers when the scheduler is using ucx rmm_worker_pool_size: The amount of GPU memory each GPU worker process may use. Recommended to set at 80-90% of available GPU memory. 72GiB is good for A100/H100 libcudf_cufile_policy: Allows reading/writing directly from storage to GPU.
- container_entrypoint: str#
None
- cpu_worker_memory_limit: str#
‘0’
- cudf_spill: str#
‘1’
- device: str#
‘cpu’
- interface: str#
‘eth0’
- job_dir: str#
None
- libcudf_cufile_policy: str#
‘OFF’
- protocol: str#
‘tcp’
- rapids_no_initialize: str#
‘1’
- rmm_scheduler_pool_size: str#
‘1GB’
- rmm_worker_pool_size: str#
‘72GiB’
- script_command: str#
None
- to_script(
- add_scheduler_file: bool = True,
- add_device: bool = True,
Converts to a script object executable by NeMo Run Args: add_scheduler_file: Automatically appends a ‘–scheduler-file’ argument to the script_command where the value is job_dir/logs/scheduler.json. All scripts included in NeMo Curator accept and require this argument to scale properly on Slurm clusters. add_device: Automatically appends a ‘–device’ argument to the script_command where the value is the member variable of device. All scripts included in NeMo Curator accept and require this argument. Returns: A NeMo Run Script that will intialize a Dask cluster, and run the specified command. It is designed to be executed on a Slurm cluster
- nemo_run.slurm.run#
‘safe_import(…)’