nemo_deploy.nlp.hf_deployable_ray
#
Module Contents#
Classes#
A Ray Serve compatible wrapper for deploying HuggingFace models. |
Data#
API#
- nemo_deploy.nlp.hf_deployable_ray.LOGGER = 'getLogger(...)'#
- nemo_deploy.nlp.hf_deployable_ray.app = 'FastAPI(...)'#
- class nemo_deploy.nlp.hf_deployable_ray.HFRayDeployable(
- hf_model_id_path: str,
- task: str = 'text-generation',
- trust_remote_code: bool = True,
- model_id: str = 'nemo-model',
- device_map: Optional[str] = None,
- max_memory: Optional[str] = None,
- use_vllm_backend: bool = False,
A Ray Serve compatible wrapper for deploying HuggingFace models.
This class provides a standardized interface for deploying HuggingFace models in Ray Serve. It supports various NLP tasks and handles model loading, inference, and deployment configurations.
- Parameters:
hf_model_id_path (str) – Path to the HuggingFace model or model identifier. Can be a local path or a model ID from HuggingFace Hub.
task (str) – HuggingFace task type (e.g., “text-generation”). Defaults to “text-generation”.
trust_remote_code (bool) – Whether to trust remote code when loading the model. Defaults to True.
device_map (str) – Device mapping strategy for model placement. Defaults to “auto”.
tp_plan (str) – Tensor parallelism plan for distributed inference. Defaults to None.
model_id (str) – Identifier for the model in the API responses. Defaults to “nemo-model”.
Initialization
Initialize the HuggingFace model deployment.
- Parameters:
hf_model_id_path (str) – Path to the HuggingFace model or model identifier.
task (str) – HuggingFace task type. Defaults to “text-generation”.
trust_remote_code (bool) – Whether to trust remote code. Defaults to True.
device_map (str) – Device mapping strategy. Defaults to “auto”.
model_id (str) – Model identifier. Defaults to “nemo-model”.
max_memory (str) – Maximum memory allocation when using balanced device map.
use_vllm_backend (bool, optional) – Whether to use vLLM backend for deployment. If True, exports the HF ckpt
False. (to vLLM format and uses vLLM backend for inference. Defaults to)
- Raises:
ImportError – If Ray is not installed.
Exception – If model initialization fails.
- _setup_unique_distributed_parameters()#
Configure unique distributed communication parameters for each model replica.
This function sets up unique MASTER_PORT environment variables for each Ray Serve replica to ensure they can initialize their own torch.distributed process groups without port conflicts.
- async completions(request: Dict[Any, Any])#
Handle text completion requests.
This endpoint processes text completion requests in OpenAI API format and returns generated completions with token usage information.
- Parameters:
request (Dict[Any, Any]) –
Request dictionary containing:
prompts: List of input prompts
max_tokens: Maximum tokens to generate (optional)
temperature: Sampling temperature (optional)
top_k: Top-k sampling parameter (optional)
top_p: Top-p sampling parameter (optional)
model: Model identifier (optional)
- Returns:
id: Unique completion ID
object: Response type (“text_completion”)
created: Timestamp
model: Model identifier
choices: List of completion choices
usage: Token usage statistics
- Return type:
Dict containing
- Raises:
HTTPException – If inference fails.
- async chat_completions(request: Dict[Any, Any])#
Handle chat completion requests.
This endpoint processes chat completion requests in OpenAI API format and returns generated responses with token usage information.
- Parameters:
request (Dict[Any, Any]) –
Request dictionary containing:
messages: List of chat messages
max_tokens: Maximum tokens to generate (optional)
temperature: Sampling temperature (optional)
top_k: Top-k sampling parameter (optional)
top_p: Top-p sampling parameter (optional)
model: Model identifier (optional)
- Returns:
id: Unique chat completion ID
object: Response type (“chat.completion”)
created: Timestamp
model: Model identifier
choices: List of chat completion choices
usage: Token usage statistics
- Return type:
Dict containing
- Raises:
HTTPException – If inference fails.
- async list_models()#
List available models.
This endpoint returns information about the deployed model in OpenAI API format.
- Returns:
object: Response type (“list”)
data: List of model information
- Return type:
Dict containing
- async health_check()#
Check the health status of the service.
This endpoint is used to verify that the service is running and healthy.
- Returns:
status: Health status (“healthy”)
- Return type:
Dict containing