`nemo_deploy.nlp.hf_deployable_ray`#

Module Contents#

Classes#

HFRayDeployable

A Ray Serve compatible wrapper for deploying HuggingFace models.

Data#

`LOGGER`
`app`

API#

nemo_deploy.nlp.hf_deployable_ray.LOGGER = 'getLogger(...)'#

nemo_deploy.nlp.hf_deployable_ray.app = 'FastAPI(...)'#

class nemo_deploy.nlp.hf_deployable_ray.HFRayDeployable( hf_model_id_path: str, task: str = 'text-generation', trust_remote_code: bool = True, model_id: str = 'nemo-model', device_map: Optional[str] = None, max_memory: Optional[str] = None, use_vllm_backend: bool = False, )#

A Ray Serve compatible wrapper for deploying HuggingFace models.

This class provides a standardized interface for deploying HuggingFace models in Ray Serve. It supports various NLP tasks and handles model loading, inference, and deployment configurations.

Parameters:

hf_model_id_path (str) – Path to the HuggingFace model or model identifier. Can be a local path or a model ID from HuggingFace Hub.
task (str) – HuggingFace task type (e.g., “text-generation”). Defaults to “text-generation”.
trust_remote_code (bool) – Whether to trust remote code when loading the model. Defaults to True.
device_map (str) – Device mapping strategy for model placement. Defaults to “auto”.
tp_plan (str) – Tensor parallelism plan for distributed inference. Defaults to None.
model_id (str) – Identifier for the model in the API responses. Defaults to “nemo-model”.

Initialization

Initialize the HuggingFace model deployment.

Parameters:

hf_model_id_path (str) – Path to the HuggingFace model or model identifier.
task (str) – HuggingFace task type. Defaults to “text-generation”.
trust_remote_code (bool) – Whether to trust remote code. Defaults to True.
device_map (str) – Device mapping strategy. Defaults to “auto”.
model_id (str) – Model identifier. Defaults to “nemo-model”.
max_memory (str) – Maximum memory allocation when using balanced device map.
use_vllm_backend (bool, optional) – Whether to use vLLM backend for deployment. If True, exports the HF ckpt
False. (to vLLM format and uses vLLM backend for inference. Defaults to)

Raises:

ImportError – If Ray is not installed.
Exception – If model initialization fails.

_setup_unique_distributed_parameters()#

Configure unique distributed communication parameters for each model replica.

This function sets up unique MASTER_PORT environment variables for each Ray Serve replica to ensure they can initialize their own torch.distributed process groups without port conflicts.

async completions(request: Dict[Any, Any])#

Handle text completion requests.

This endpoint processes text completion requests in OpenAI API format and returns generated completions with token usage information.

Parameters:

request (Dict[Any, Any]) –

Request dictionary containing:

prompts: List of input prompts
max_tokens: Maximum tokens to generate (optional)
temperature: Sampling temperature (optional)
top_k: Top-k sampling parameter (optional)
top_p: Top-p sampling parameter (optional)
model: Model identifier (optional)

Returns:

id: Unique completion ID
object: Response type (“text_completion”)
created: Timestamp
model: Model identifier
choices: List of completion choices
usage: Token usage statistics

Return type:

Dict containing

Raises:

HTTPException – If inference fails.

async chat_completions(request: Dict[Any, Any])#

Handle chat completion requests.

This endpoint processes chat completion requests in OpenAI API format and returns generated responses with token usage information.

Parameters:

request (Dict[Any, Any]) –

Request dictionary containing:

messages: List of chat messages
max_tokens: Maximum tokens to generate (optional)
temperature: Sampling temperature (optional)
top_k: Top-k sampling parameter (optional)
top_p: Top-p sampling parameter (optional)
model: Model identifier (optional)

Returns:

id: Unique chat completion ID
object: Response type (“chat.completion”)
created: Timestamp
model: Model identifier
choices: List of chat completion choices
usage: Token usage statistics

Return type:

Dict containing

Raises:

HTTPException – If inference fails.

async list_models()#

List available models.

This endpoint returns information about the deployed model in OpenAI API format.

Returns:

object: Response type (“list”)
data: List of model information

Return type:

Dict containing

async health_check()#

Check the health status of the service.

This endpoint is used to verify that the service is running and healthy.

Returns:

status: Health status (“healthy”)

Return type:

Dict containing

nemo_deploy.nlp.hf_deployable_ray#

Module Contents#

Classes#

Data#

API#

`nemo_deploy.nlp.hf_deployable_ray`#