nemo_deploy.nlp.hf_deployable_ray#

Module Contents#

Classes#

HFRayDeployable

A Ray Serve compatible wrapper for deploying HuggingFace models.

Data#

API#

nemo_deploy.nlp.hf_deployable_ray.LOGGER = 'getLogger(...)'#
nemo_deploy.nlp.hf_deployable_ray.app = 'FastAPI(...)'#
class nemo_deploy.nlp.hf_deployable_ray.HFRayDeployable(
hf_model_id_path: str,
task: str = 'text-generation',
trust_remote_code: bool = True,
model_id: str = 'nemo-model',
device_map: Optional[str] = None,
max_memory: Optional[str] = None,
use_vllm_backend: bool = False,
)#

A Ray Serve compatible wrapper for deploying HuggingFace models.

This class provides a standardized interface for deploying HuggingFace models in Ray Serve. It supports various NLP tasks and handles model loading, inference, and deployment configurations.

Parameters:
  • hf_model_id_path (str) – Path to the HuggingFace model or model identifier. Can be a local path or a model ID from HuggingFace Hub.

  • task (str) – HuggingFace task type (e.g., “text-generation”). Defaults to “text-generation”.

  • trust_remote_code (bool) – Whether to trust remote code when loading the model. Defaults to True.

  • device_map (str) – Device mapping strategy for model placement. Defaults to “auto”.

  • tp_plan (str) – Tensor parallelism plan for distributed inference. Defaults to None.

  • model_id (str) – Identifier for the model in the API responses. Defaults to “nemo-model”.

Initialization

Initialize the HuggingFace model deployment.

Parameters:
  • hf_model_id_path (str) – Path to the HuggingFace model or model identifier.

  • task (str) – HuggingFace task type. Defaults to “text-generation”.

  • trust_remote_code (bool) – Whether to trust remote code. Defaults to True.

  • device_map (str) – Device mapping strategy. Defaults to “auto”.

  • model_id (str) – Model identifier. Defaults to “nemo-model”.

  • max_memory (str) – Maximum memory allocation when using balanced device map.

  • use_vllm_backend (bool, optional) – Whether to use vLLM backend for deployment. If True, exports the HF ckpt

  • False. (to vLLM format and uses vLLM backend for inference. Defaults to)

Raises:
  • ImportError – If Ray is not installed.

  • Exception – If model initialization fails.

_setup_unique_distributed_parameters()#

Configure unique distributed communication parameters for each model replica.

This function sets up unique MASTER_PORT environment variables for each Ray Serve replica to ensure they can initialize their own torch.distributed process groups without port conflicts.

async completions(request: Dict[Any, Any])#

Handle text completion requests.

This endpoint processes text completion requests in OpenAI API format and returns generated completions with token usage information.

Parameters:

request (Dict[Any, Any]) –

Request dictionary containing:

  • prompts: List of input prompts

  • max_tokens: Maximum tokens to generate (optional)

  • temperature: Sampling temperature (optional)

  • top_k: Top-k sampling parameter (optional)

  • top_p: Top-p sampling parameter (optional)

  • model: Model identifier (optional)

Returns:

  • id: Unique completion ID

  • object: Response type (“text_completion”)

  • created: Timestamp

  • model: Model identifier

  • choices: List of completion choices

  • usage: Token usage statistics

Return type:

Dict containing

Raises:

HTTPException – If inference fails.

async chat_completions(request: Dict[Any, Any])#

Handle chat completion requests.

This endpoint processes chat completion requests in OpenAI API format and returns generated responses with token usage information.

Parameters:

request (Dict[Any, Any]) –

Request dictionary containing:

  • messages: List of chat messages

  • max_tokens: Maximum tokens to generate (optional)

  • temperature: Sampling temperature (optional)

  • top_k: Top-k sampling parameter (optional)

  • top_p: Top-p sampling parameter (optional)

  • model: Model identifier (optional)

Returns:

  • id: Unique chat completion ID

  • object: Response type (“chat.completion”)

  • created: Timestamp

  • model: Model identifier

  • choices: List of chat completion choices

  • usage: Token usage statistics

Return type:

Dict containing

Raises:

HTTPException – If inference fails.

async list_models()#

List available models.

This endpoint returns information about the deployed model in OpenAI API format.

Returns:

  • object: Response type (“list”)

  • data: List of model information

Return type:

Dict containing

async health_check()#

Check the health status of the service.

This endpoint is used to verify that the service is running and healthy.

Returns:

  • status: Health status (“healthy”)

Return type:

Dict containing