stages.deduplication.semantic.ranking#

Module Contents#

Classes#

RankingStrategy

Flexible ranking strategy that allows users to specify metadata columns and sorting order.

API#

class stages.deduplication.semantic.ranking.RankingStrategy(
metadata_cols: list[str],
ascending: list[bool] | bool = True,
strategy: Literal[sort, random] = 'sort',
random_seed: int = 42,
)#

Flexible ranking strategy that allows users to specify metadata columns and sorting order.

This design allows for extensible ranking based on any metadata columns with user-specified sorting criteria.

Initialization

Initialize ranking strategy.

Args: metadata_cols: List of metadata column names to sort by (in priority order) ascending: Boolean or list of booleans indicating sort order for each column. If single bool, applies to all columns. strategy: Ranking strategy - “sort” for sorting by metadata_cols, “random” for random random_seed: Seed for random strategy

classmethod metadata_based(
metadata_cols: list[str],
ascending: list[bool] | bool = True,
random_seed: int = 42,
) stages.deduplication.semantic.ranking.RankingStrategy#

Create a metadata-based ranking strategy.

Args: metadata_cols: List of metadata column names to sort by (in priority order) ascending: Boolean or list of booleans indicating sort order for each column random_seed: Random seed for reproducible results

Returns: RankingStrategy instance configured for metadata-based ranking

classmethod random(
random_seed: int = 42,
) stages.deduplication.semantic.ranking.RankingStrategy#

Create a random ranking strategy.

Args: random_seed: Random seed for reproducible results

Returns: RankingStrategy instance configured for random ranking

rank_cluster(cluster_df: cudf.DataFrame) cudf.DataFrame#

Rank cluster based on the specified strategy.