***

layout: overview
slug: nemo-curator/nemo\_curator/core/utils
title: nemo\_curator.core.utils
-------------------------------

## Module Contents

### Functions

| Name                                                                                        | Description                                                                 |
| ------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| [`_logger_custom_deserializer`](#nemo_curator-core-utils-_logger_custom_deserializer)       | -                                                                           |
| [`_logger_custom_serializer`](#nemo_curator-core-utils-_logger_custom_serializer)           | -                                                                           |
| [`check_ray_responsive`](#nemo_curator-core-utils-check_ray_responsive)                     | -                                                                           |
| [`get_free_port`](#nemo_curator-core-utils-get_free_port)                                   | Checks if start\_port is free.                                              |
| [`init_cluster`](#nemo_curator-core-utils-init_cluster)                                     | Initialize a new local Ray cluster or connects to an existing one.          |
| [`split_table_by_group_max_bytes`](#nemo_curator-core-utils-split_table_by_group_max_bytes) | Split an Arrow table by approximate byte size without splitting group rows. |

### API

<Anchor id="nemo_curator-core-utils-_logger_custom_deserializer">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.core.utils._logger_custom_deserializer(
        _: None
    ) -> loguru.Logger
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-core-utils-_logger_custom_serializer">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.core.utils._logger_custom_serializer(
        _: loguru.Logger
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-core-utils-check_ray_responsive">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.core.utils.check_ray_responsive(
        timeout_s: int = RAY_CLUSTER_START_VERIFICAT...
    ) -> bool
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-core-utils-get_free_port">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.core.utils.get_free_port(
        start_port: int,
        get_next_free_port: bool = True
    ) -> int
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Checks if start\_port is free.
  If not, it will get the next free port starting from start\_port if get\_next\_free\_port is True.
  Else, it will raise an error if the free port is not equal to start\_port.
</Indent>

<Anchor id="nemo_curator-core-utils-init_cluster">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.core.utils.init_cluster(
        ray_port: int,
        ray_temp_dir: str,
        ray_dashboard_port: int,
        ray_metrics_port: int,
        ray_client_server_port: int,
        ray_dashboard_host: str,
        num_gpus: int | None = None,
        num_cpus: int | None = None,
        object_store_memory: int | None = None,
        enable_object_spilling: bool = False,
        block: bool = True,
        ip_address: str | None = None,
        stdouterr_capture_file: str | None = None
    ) -> subprocess.Popen
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Initialize a new local Ray cluster or connects to an existing one.
</Indent>

<Anchor id="nemo_curator-core-utils-split_table_by_group_max_bytes">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.core.utils.split_table_by_group_max_bytes(
        table: pyarrow.Table,
        group_column: str,
        max_batch_bytes: int | None
    ) -> list[pyarrow.Table]
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Split an Arrow table by approximate byte size without splitting group rows.

  Each unique value in `group_column` is kept in a single output table.
  If a single group exceeds `max_batch_bytes`, it is still emitted as one chunk.

  Note: null values in `group_column` are grouped together (consecutive
  nulls are not split).  Callers should ensure the column is non-nullable
  or handle nulls upstream.
</Indent>
