nemo_curator.utils.split_large_files

Module Contents

Functions

Name	Description
`_basename_and_ext`	Basename and extension for local paths and fsspec URIs (e.g. s3://bucket/key/file.jsonl).
`_flush_jsonl_chunk`	-
`_join_out_path`	Join output directory and filename using the target filesystem (local or remote).
`_split_table`	-
`_storage_options`	-
`_write_table_to_file`	-
`main`	-
`parse_args`	-
`split_jsonl_file_by_size`	-
`split_parquet_file_by_size`	-

API

nemo_curator.utils.split_large_files._basename_and_ext(
    path: str
) -> tuple[str, str]

Basename and extension for local paths and fsspec URIs (e.g. s3://bucket/key/file.jsonl).

nemo_curator.utils.split_large_files._flush_jsonl_chunk(
    lines: list[bytes],
    output_file: str,
    storage_options: dict[str, typing.Any]
) -> None

nemo_curator.utils.split_large_files._join_out_path(
    output_path: str,
    filename: str,
    storage_options: dict[str, typing.Any]
) -> str

Join output directory and filename using the target filesystem (local or remote).

nemo_curator.utils.split_large_files._split_table(
    table: pyarrow.Table,
    target_size: int
) -> list[pyarrow.Table]

nemo_curator.utils.split_large_files._storage_options(
    storage_options: dict[str, typing.Any] | None
) -> dict[str, typing.Any]

nemo_curator.utils.split_large_files._write_table_to_file(
    table: pyarrow.Table,
    output_file: str,
    storage_options: dict[str, typing.Any]
) -> None

nemo_curator.utils.split_large_files.main(
    args: argparse.ArgumentParser | None = None
) -> None

nemo_curator.utils.split_large_files.parse_args(
    args: argparse.ArgumentParser | None = None
) -> argparse.Namespace

nemo_curator.utils.split_large_files.split_jsonl_file_by_size(
    input_file: str,
    output_path: str,
    target_size_mb: int,
    storage_options: dict[str, typing.Any] | None = None
) -> None

nemo_curator.utils.split_large_files.split_parquet_file_by_size(
    input_file: str,
    output_path: str,
    target_size_mb: int,
    storage_options: dict[str, typing.Any] | None = None
) -> None