nemo_curator.core.utils
Module Contents
Functions
API
Checks if start_port is free. If not, it will get the next free port starting from start_port if get_next_free_port is True. Else, it will raise an error if the free port is not equal to start_port.
Initialize a new local Ray cluster or connects to an existing one.
Split an Arrow table by approximate byte size without splitting group rows.
Each unique value in group_column is kept in a single output table.
If a single group exceeds max_batch_bytes, it is still emitted as one chunk.
Note: null values in group_column are grouped together (consecutive
nulls are not split). Callers should ensure the column is non-nullable
or handle nulls upstream.