nemo_curator.stages.deduplication.semantic.utils

View as Markdown

Module Contents

Functions

NameDescription
break_parquet_partition_into_groupsBreak parquet files into groups to avoid cudf 2bn row limit.
get_array_from_dfConvert a column of lists to a 2D array.

API

nemo_curator.stages.deduplication.semantic.utils.break_parquet_partition_into_groups(
files: list[str],
embedding_dim: int | None = None,
storage_options: dict[str, typing.Any] | None = None
) -> list[list[str]]

Break parquet files into groups to avoid cudf 2bn row limit.

nemo_curator.stages.deduplication.semantic.utils.get_array_from_df(
df: cudf.DataFrame,
embedding_col: str
) -> cupy.ndarray

Convert a column of lists to a 2D array.