nemo_deploy.nlp.hf_deployable#
Module Contents#
Classes#
| A Triton inference server compatible wrapper for HuggingFace models. | 
Data#
API#
- nemo_deploy.nlp.hf_deployable.LOGGER = 'getLogger(...)'#
- nemo_deploy.nlp.hf_deployable.SUPPORTED_TASKS = ['text-generation']#
- class nemo_deploy.nlp.hf_deployable.HuggingFaceLLMDeploy(
- hf_model_id_path: Optional[str] = None,
- hf_peft_model_id_path: Optional[str] = None,
- tokenizer_id_path: Optional[str] = None,
- model: Optional[transformers.AutoModel] = None,
- tokenizer: Optional[transformers.AutoTokenizer] = None,
- tokenizer_padding=True,
- tokenizer_truncation=True,
- tokenizer_padding_side='left',
- task: Optional[str] = 'text-generation',
- **hf_kwargs,
- Bases: - nemo_deploy.ITritonDeployable- A Triton inference server compatible wrapper for HuggingFace models. - This class provides a standardized interface for deploying HuggingFace models in Triton inference server. It supports various NLP tasks and handles model loading, inference, and deployment configurations. - Parameters:
- hf_model_id_path (Optional[str]) – Path to the HuggingFace model or model identifier. Can be a local path or a model ID from HuggingFace Hub. 
- hf_peft_model_id_path (Optional[str]) – Path to the PEFT model or model identifier. Can be a local path or a model ID from HuggingFace Hub. 
- tokenizer_id_path (Optional[str]) – Path to the tokenizer or tokenizer identifier. If None, will use the same path as hf_model_id_path. 
- model (Optional[AutoModel]) – Pre-loaded HuggingFace model. 
- tokenizer (Optional[AutoTokenizer]) – Pre-loaded HuggingFace tokenizer. 
- tokenizer_padding (bool) – Whether to enable padding in tokenizer. Defaults to True. 
- tokenizer_truncation (bool) – Whether to enable truncation in tokenizer. Defaults to True. 
- tokenizer_padding_side (str) – Which side to pad on (‘left’ or ‘right’). Defaults to ‘left’. 
- task (str) – HuggingFace task type (e.g., “text-generation”). Defaults to “text-generation”. 
- **hf_kwargs – Additional keyword arguments to pass to HuggingFace model loading. 
 
 - Initialization - _load(**hf_kwargs) None[source]#
- Load the HuggingFace pipeline with the specified model and task. - This method initializes the HuggingFace AutoModel classes using the provided model configuration and task type. It handles the model and tokenizer loading process. - Raises:
- AssertionError – If task is not specified. 
 
 - generate(**kwargs: Any) List[str][source]#
- Generate text based on the provided input prompts. - This method processes input prompts through the loaded pipeline and generates text according to the specified parameters. - Parameters:
- **kwargs – - Generation parameters including: - text_inputs: List of input prompts 
- max_length: Maximum number of tokens to generate 
- num_return_sequences: Number of sequences to generate per prompt 
- temperature: Sampling temperature 
- top_k: Number of highest probability tokens to consider 
- top_p: Cumulative probability threshold for token sampling 
- do_sample: Whether to use sampling 
- return_full_text: Whether to return full text or only generated part 
 
- Returns:
- List[str]: A list of generated texts, one for each input prompt. If output logits and output scores are True: Dict: A dictionary containing: - sentences: List of generated texts - logits: List of logits - scores: List of scores 
- Return type:
- If output logits and output scores are False 
- Raises:
- RuntimeError – If the pipeline is not initialized. 
 
 - property get_triton_input#
 - property get_triton_output#
 - ray_infer_fn(inputs: Dict[Any, Any])[source]#
- Perform inference using Ray with dictionary inputs and outputs. - Parameters:
- inputs (Dict[Any, Any]) – - Dictionary containing input parameters: - prompts: List of input prompts 
- temperature: Sampling temperature (optional) 
- top_k: Number of highest probability tokens to consider (optional) 
- top_p: Cumulative probability threshold for token sampling (optional) 
- max_length: Maximum number of tokens to generate (optional) 
- output_logits: Whether to output logits (optional) 
- output_scores: Whether to output scores (optional) 
 
- Returns:
- Dictionary containing: - sentences: List of generated texts - scores: Optional array of scores if output_scores is True - logits: Optional array of logits if output_logits is True 
- Return type:
- Dict[str, Any] 
 
 - _infer_fn_common(
- prompts,
- temperature=1.0,
- top_k=1,
- top_p=0.0,
- num_tokens_to_generate=256,
- output_logits=False,
- output_scores=False,
- cast_output_func=None,
- Common internal function for inference operations. - Parameters:
- prompts – List of input prompts 
- temperature – Sampling temperature 
- top_k – Number of highest probability tokens to consider 
- top_p – Cumulative probability threshold for token sampling 
- num_tokens_to_generate – Maximum number of tokens to generate 
- output_logits – Whether to output logits 
- output_scores – Whether to output scores 
- cast_output_func – Optional function to cast output values 
 
- Returns:
- Dict containing inference results