nemo_automodel._transformers.auto_tokenizer#
Module Contents#
Classes#
Auto tokenizer class that dispatches to appropriate tokenizer implementations. |
Functions#
Determine the model type from the config. |
Data#
API#
- nemo_automodel._transformers.auto_tokenizer.logger#
‘getLogger(…)’
- nemo_automodel._transformers.auto_tokenizer._get_model_type(
- pretrained_model_name_or_path: str,
- trust_remote_code: bool = False,
Determine the model type from the config.
- Parameters:
pretrained_model_name_or_path – Model identifier or path
trust_remote_code – Whether to trust remote code
- Returns:
The model_type string, or None if it cannot be determined
- class nemo_automodel._transformers.auto_tokenizer.NeMoAutoTokenizer#
Bases:
transformers.AutoTokenizerAuto tokenizer class that dispatches to appropriate tokenizer implementations.
Similar to HuggingFace’s AutoTokenizer, but with a custom registry for specialized tokenizer implementations.
The dispatch logic is:
If a custom tokenizer is registered for the model type, use it
Otherwise, fall back to NeMoAutoTokenizerWithBosEosEnforced
.. rubric:: Example
Will use MistralCommonBackend if available for Mistral models
tokenizer = NeMoAutoTokenizer.from_pretrained(“mistralai/Mistral-7B-v0.1”)
Force using HF AutoTokenizer with BOS/EOS enforcement
tokenizer = NeMoAutoTokenizer.from_pretrained(“gpt2”, force_default=True)
Initialization
- _registry#
None
- classmethod register(
- model_type: str,
- tokenizer_cls: Union[Type, Callable],
Register a custom tokenizer for a specific model type.
- Parameters:
model_type – The model type string (e.g., “mistral”, “llama”)
tokenizer_cls – The tokenizer class or factory function
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *args,
- force_default: bool = False,
- force_hf: bool = False,
- trust_remote_code: bool = False,
- **kwargs,
Load a tokenizer from a pretrained model.
- Parameters:
pretrained_model_name_or_path – Model identifier or path
force_default – If True, always use NeMoAutoTokenizerWithBosEosEnforced
force_hf – If True, return the raw HF AutoTokenizer without any wrapping
trust_remote_code – Whether to trust remote code when loading config
**kwargs – Additional arguments passed to the tokenizer’s from_pretrained
- Returns:
A tokenizer instance appropriate for the model type
- nemo_automodel._transformers.auto_tokenizer.__all__#
[‘NeMoAutoTokenizer’, ‘NeMoAutoTokenizerWithBosEosEnforced’, ‘TokenizerRegistry’]