stages.deduplication.semantic.utils#

Module Contents#

Functions#

break_parquet_partition_into_groups

Break parquet files into groups to avoid cudf 2bn row limit.

get_array_from_df

Convert a column of lists to a 2D array.

API#

stages.deduplication.semantic.utils.break_parquet_partition_into_groups(
files: list[str],
embedding_dim: int | None = None,
storage_options: dict[str, Any] | None = None,
) list[list[str]]#

Break parquet files into groups to avoid cudf 2bn row limit.

stages.deduplication.semantic.utils.get_array_from_df(
df: cudf.DataFrame,
embedding_col: str,
) cupy.ndarray#

Convert a column of lists to a 2D array.