nemo_curator.stages.deduplication.semantic.ranking

View as Markdown

Module Contents

Classes

NameDescription
RankingStrategyFlexible ranking strategy that allows users to specify metadata columns and sorting order.

API

class nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy(
metadata_cols: list[str],
ascending: list[bool] | bool = True,
strategy: typing.Literal['sort', 'random'] = 'sort',
random_seed: int = 42
)

Flexible ranking strategy that allows users to specify metadata columns and sorting order.

This design allows for extensible ranking based on any metadata columns with user-specified sorting criteria.

ascending
= [ascending] * len(metadata_cols)
nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy.metadata_based(
metadata_cols: list[str],
ascending: list[bool] | bool = True,
random_seed: int = 42
) -> nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy
classmethod

Create a metadata-based ranking strategy.

Parameters:

metadata_cols
list[str]

List of metadata column names to sort by (in priority order)

ascending
list[bool] | boolDefaults to True

Boolean or list of booleans indicating sort order for each column

random_seed
intDefaults to 42

Random seed for reproducible results

Returns: RankingStrategy

RankingStrategy instance configured for metadata-based ranking

classmethod

Create a random ranking strategy.

Parameters:

random_seed
intDefaults to 42

Random seed for reproducible results

Returns: RankingStrategy

RankingStrategy instance configured for random ranking

nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy.rank_cluster(
cluster_df: cudf.DataFrame
) -> cudf.DataFrame

Rank cluster based on the specified strategy.