*** layout: overview slug: nemo-curator/nemo\_curator/core/utils title: nemo\_curator.core.utils ------------------------------- ## Module Contents ### Functions | Name | Description | | ------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- | | [`_logger_custom_deserializer`](#nemo_curator-core-utils-_logger_custom_deserializer) | - | | [`_logger_custom_serializer`](#nemo_curator-core-utils-_logger_custom_serializer) | - | | [`check_ray_responsive`](#nemo_curator-core-utils-check_ray_responsive) | - | | [`get_free_port`](#nemo_curator-core-utils-get_free_port) | Checks if start\_port is free. | | [`init_cluster`](#nemo_curator-core-utils-init_cluster) | Initialize a new local Ray cluster or connects to an existing one. | | [`split_table_by_group_max_bytes`](#nemo_curator-core-utils-split_table_by_group_max_bytes) | Split an Arrow table by approximate byte size without splitting group rows. | ### API ```python nemo_curator.core.utils._logger_custom_deserializer( _: None ) -> loguru.Logger ``` ```python nemo_curator.core.utils._logger_custom_serializer( _: loguru.Logger ) -> None ``` ```python nemo_curator.core.utils.check_ray_responsive( timeout_s: int = RAY_CLUSTER_START_VERIFICAT... ) -> bool ``` ```python nemo_curator.core.utils.get_free_port( start_port: int, get_next_free_port: bool = True ) -> int ``` Checks if start\_port is free. If not, it will get the next free port starting from start\_port if get\_next\_free\_port is True. Else, it will raise an error if the free port is not equal to start\_port. ```python nemo_curator.core.utils.init_cluster( ray_port: int, ray_temp_dir: str, ray_dashboard_port: int, ray_metrics_port: int, ray_client_server_port: int, ray_dashboard_host: str, num_gpus: int | None = None, num_cpus: int | None = None, object_store_memory: int | None = None, enable_object_spilling: bool = False, block: bool = True, ip_address: str | None = None, stdouterr_capture_file: str | None = None ) -> subprocess.Popen ``` Initialize a new local Ray cluster or connects to an existing one. ```python nemo_curator.core.utils.split_table_by_group_max_bytes( table: pyarrow.Table, group_column: str, max_batch_bytes: int | None ) -> list[pyarrow.Table] ``` Split an Arrow table by approximate byte size without splitting group rows. Each unique value in `group_column` is kept in a single output table. If a single group exceeds `max_batch_bytes`, it is still emitted as one chunk. Note: null values in `group_column` are grouped together (consecutive nulls are not split). Callers should ensure the column is non-nullable or handle nulls upstream.