> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.multimodal.parquet_utils

Parquet shard discovery + filesystem factory for BAGEL T2I / edit data.

Only the local-filesystem path is exercised in our current tests; the
HDFS branch is preserved for upstream compatibility but the cluster-
specific host / port / extra\_conf hooks remain stubs. Customise them in
your own deployment if you actually have HDFS-backed parquet shards.

## Module Contents

### Functions

| Name                                                                                                            | Description                                                    |
| --------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------- |
| [`get_hdfs_block_size`](#nemo_automodel-components-datasets-multimodal-parquet_utils-get_hdfs_block_size)       | Return the HDFS read buffer size for pyarrow.                  |
| [`get_hdfs_extra_conf`](#nemo_automodel-components-datasets-multimodal-parquet_utils-get_hdfs_extra_conf)       | Return optional pyarrow HDFS configuration overrides.          |
| [`get_hdfs_host`](#nemo_automodel-components-datasets-multimodal-parquet_utils-get_hdfs_host)                   | Return the HDFS host URI used by BAGEL parquet readers.        |
| [`get_parquet_data_paths`](#nemo_automodel-components-datasets-multimodal-parquet_utils-get_parquet_data_paths) | Return a flat list of parquet file paths sharded across ranks. |
| [`hdfs_ls_cmd`](#nemo_automodel-components-datasets-multimodal-parquet_utils-hdfs_ls_cmd)                       | List HDFS parquet directory entries with the native hdfs CLI.  |
| [`init_arrow_pf_fs`](#nemo_automodel-components-datasets-multimodal-parquet_utils-init_arrow_pf_fs)             | Return a pyarrow filesystem for `parquet_file_path`.           |

### Data

[`logger`](#nemo_automodel-components-datasets-multimodal-parquet_utils-logger)

### API

```python
nemo_automodel.components.datasets.multimodal.parquet_utils.get_hdfs_block_size()
```

Return the HDFS read buffer size for pyarrow.

```python
nemo_automodel.components.datasets.multimodal.parquet_utils.get_hdfs_extra_conf()
```

Return optional pyarrow HDFS configuration overrides.

```python
nemo_automodel.components.datasets.multimodal.parquet_utils.get_hdfs_host()
```

Return the HDFS host URI used by BAGEL parquet readers.

```python
nemo_automodel.components.datasets.multimodal.parquet_utils.get_parquet_data_paths(
    data_dir_list,
    num_sampled_data_paths,
    rank = 0,
    world_size = 1
)
```

Return a flat list of parquet file paths sharded across ranks.

Directories are split across ranks via
`chunk_size = ceil(num_dirs / world_size)`. Each rank lists its local
directories, repeats the file list to reach `num_sampled_data_paths` per
directory, then all-gathers across ranks so every rank ends up with the
same combined list.

```python
nemo_automodel.components.datasets.multimodal.parquet_utils.hdfs_ls_cmd(
    dir
)
```

List HDFS parquet directory entries with the native hdfs CLI.

```python
nemo_automodel.components.datasets.multimodal.parquet_utils.init_arrow_pf_fs(
    parquet_file_path
)
```

Return a pyarrow filesystem for `parquet_file_path`.

`pyarrow` is imported lazily because not every AM install carries it,
and the import is only needed when an actually-parquet-backed dataset
(T2I, UnifiedEdit) is constructed.

```python
nemo_automodel.components.datasets.multimodal.parquet_utils.logger = logging.getLogger(__name__)
```