nemo_automodel.components.datasets.multimodal.parquet_utils

View as Markdown

Parquet shard discovery + filesystem factory for BAGEL T2I / edit data.

Only the local-filesystem path is exercised in our current tests; the HDFS branch is preserved for upstream compatibility but the cluster- specific host / port / extra_conf hooks remain stubs. Customise them in your own deployment if you actually have HDFS-backed parquet shards.

Module Contents

Functions

NameDescription
get_hdfs_block_sizeReturn the HDFS read buffer size for pyarrow.
get_hdfs_extra_confReturn optional pyarrow HDFS configuration overrides.
get_hdfs_hostReturn the HDFS host URI used by BAGEL parquet readers.
get_parquet_data_pathsReturn a flat list of parquet file paths sharded across ranks.
hdfs_ls_cmdList HDFS parquet directory entries with the native hdfs CLI.
init_arrow_pf_fsReturn a pyarrow filesystem for parquet_file_path.

Data

logger

API

nemo_automodel.components.datasets.multimodal.parquet_utils.get_hdfs_block_size()

Return the HDFS read buffer size for pyarrow.

nemo_automodel.components.datasets.multimodal.parquet_utils.get_hdfs_extra_conf()

Return optional pyarrow HDFS configuration overrides.

nemo_automodel.components.datasets.multimodal.parquet_utils.get_hdfs_host()

Return the HDFS host URI used by BAGEL parquet readers.

nemo_automodel.components.datasets.multimodal.parquet_utils.get_parquet_data_paths(
data_dir_list,
num_sampled_data_paths,
rank = 0,
world_size = 1
)

Return a flat list of parquet file paths sharded across ranks.

Directories are split across ranks via chunk_size = ceil(num_dirs / world_size). Each rank lists its local directories, repeats the file list to reach num_sampled_data_paths per directory, then all-gathers across ranks so every rank ends up with the same combined list.

nemo_automodel.components.datasets.multimodal.parquet_utils.hdfs_ls_cmd(
dir
)

List HDFS parquet directory entries with the native hdfs CLI.

nemo_automodel.components.datasets.multimodal.parquet_utils.init_arrow_pf_fs(
parquet_file_path
)

Return a pyarrow filesystem for parquet_file_path.

pyarrow is imported lazily because not every AM install carries it, and the import is only needed when an actually-parquet-backed dataset (T2I, UnifiedEdit) is constructed.

nemo_automodel.components.datasets.multimodal.parquet_utils.logger = logging.getLogger(__name__)