nemo_curator.stages.text.filters.token.token_count

View as Markdown

Module Contents

Classes

NameDescription
TokenCountFilterIf the document contains more or less than a specified number of tokens, then discard.

API

class nemo_curator.stages.text.filters.token.token_count.TokenCountFilter(
tokenizer: transformers.AutoTokenizer | None = None,
hf_model_name: str | None = None,
hf_token: str | None = None,
min_tokens: int = 0,
max_tokens: int = float('inf')
)

Bases: DocumentFilter

If the document contains more or less than a specified number of tokens, then discard.

_name
= 'token_count'
nemo_curator.stages.text.filters.token.token_count.TokenCountFilter.keep_document(
score: int
) -> bool
nemo_curator.stages.text.filters.token.token_count.TokenCountFilter.load_tokenizer() -> None
nemo_curator.stages.text.filters.token.token_count.TokenCountFilter.model_check_or_download() -> None
nemo_curator.stages.text.filters.token.token_count.TokenCountFilter.score_document(
text: str
) -> int