API ReferenceFull Library ReferenceNemo CuratorNemo CuratorUtils
nemo_curator.utils.split_large_files
nemo_curator.utils.split_large_files
Module Contents
Functions
| Name | Description |
|---|---|
_basename_and_ext | Basename and extension for local paths and fsspec URIs (e.g. s3://bucket/key/file.jsonl). |
_flush_jsonl_chunk | - |
_join_out_path | Join output directory and filename using the target filesystem (local or remote). |
_split_table | - |
_storage_options | - |
_write_table_to_file | - |
main | - |
parse_args | - |
split_jsonl_file_by_size | - |
split_parquet_file_by_size | - |
API
nemo_curator.utils.split_large_files._basename_and_ext( path: str ) -> tuple[str, str]
Basename and extension for local paths and fsspec URIs (e.g. s3://bucket/key/file.jsonl).
nemo_curator.utils.split_large_files._flush_jsonl_chunk( lines: list[bytes], output_file: str, storage_options: dict[str, typing.Any] ) -> None
nemo_curator.utils.split_large_files._join_out_path( output_path: str, filename: str, storage_options: dict[str, typing.Any] ) -> str
Join output directory and filename using the target filesystem (local or remote).
nemo_curator.utils.split_large_files._split_table( table: pyarrow.Table, target_size: int ) -> list[pyarrow.Table]
nemo_curator.utils.split_large_files._storage_options( storage_options: dict[str, typing.Any] | None ) -> dict[str, typing.Any]
nemo_curator.utils.split_large_files._write_table_to_file( table: pyarrow.Table, output_file: str, storage_options: dict[str, typing.Any] ) -> None
nemo_curator.utils.split_large_files.main( args: argparse.ArgumentParser | None = None ) -> None
nemo_curator.utils.split_large_files.parse_args( args: argparse.ArgumentParser | None = None ) -> argparse.Namespace
nemo_curator.utils.split_large_files.split_jsonl_file_by_size( input_file: str, output_path: str, target_size_mb: int, storage_options: dict[str, typing.Any] | None = None ) -> None
nemo_curator.utils.split_large_files.split_parquet_file_by_size( input_file: str, output_path: str, target_size_mb: int, storage_options: dict[str, typing.Any] | None = None ) -> None