stages.deduplication.semantic.utils
#
Module Contents#
Functions#
Break parquet files into groups to avoid cudf 2bn row limit. |
|
Convert a column of lists to a 2D array. |
API#
- stages.deduplication.semantic.utils.break_parquet_partition_into_groups(
- files: list[str],
- embedding_dim: int | None = None,
- storage_options: dict[str, Any] | None = None,
Break parquet files into groups to avoid cudf 2bn row limit.
- stages.deduplication.semantic.utils.get_array_from_df(
- df: cudf.DataFrame,
- embedding_col: str,
Convert a column of lists to a 2D array.