nemo_automodel.components.datasets.multimodal.parquet_utils
nemo_automodel.components.datasets.multimodal.parquet_utils
Parquet shard discovery + filesystem factory for BAGEL T2I / edit data.
Only the local-filesystem path is exercised in our current tests; the HDFS branch is preserved for upstream compatibility but the cluster- specific host / port / extra_conf hooks remain stubs. Customise them in your own deployment if you actually have HDFS-backed parquet shards.
Module Contents
Functions
Data
API
Return the HDFS read buffer size for pyarrow.
Return optional pyarrow HDFS configuration overrides.
Return the HDFS host URI used by BAGEL parquet readers.
Return a flat list of parquet file paths sharded across ranks.
Directories are split across ranks via
chunk_size = ceil(num_dirs / world_size). Each rank lists its local
directories, repeats the file list to reach num_sampled_data_paths per
directory, then all-gathers across ranks so every rank ends up with the
same combined list.
List HDFS parquet directory entries with the native hdfs CLI.
Return a pyarrow filesystem for parquet_file_path.
pyarrow is imported lazily because not every AM install carries it,
and the import is only needed when an actually-parquet-backed dataset
(T2I, UnifiedEdit) is constructed.