stages.deduplication.semantic.ranking
#
Module Contents#
Classes#
Flexible ranking strategy that allows users to specify metadata columns and sorting order. |
API#
- class stages.deduplication.semantic.ranking.RankingStrategy(
- metadata_cols: list[str],
- ascending: list[bool] | bool = True,
- strategy: Literal[sort, random] = 'sort',
- random_seed: int = 42,
Flexible ranking strategy that allows users to specify metadata columns and sorting order.
This design allows for extensible ranking based on any metadata columns with user-specified sorting criteria.
Initialization
Initialize ranking strategy.
Args: metadata_cols: List of metadata column names to sort by (in priority order) ascending: Boolean or list of booleans indicating sort order for each column. If single bool, applies to all columns. strategy: Ranking strategy - “sort” for sorting by metadata_cols, “random” for random random_seed: Seed for random strategy
- classmethod metadata_based(
- metadata_cols: list[str],
- ascending: list[bool] | bool = True,
- random_seed: int = 42,
Create a metadata-based ranking strategy.
Args: metadata_cols: List of metadata column names to sort by (in priority order) ascending: Boolean or list of booleans indicating sort order for each column random_seed: Random seed for reproducible results
Returns: RankingStrategy instance configured for metadata-based ranking
- classmethod random(
- random_seed: int = 42,
Create a random ranking strategy.
Args: random_seed: Random seed for reproducible results
Returns: RankingStrategy instance configured for random ranking
- rank_cluster(cluster_df: cudf.DataFrame) cudf.DataFrame #
Rank cluster based on the specified strategy.