nemo_export.vllm_exporter#

Module Contents#

Classes#

vLLMExporter

vLLMExporter enables deployment of Hugging Face or NeMo2 models using vLLM and Triton.

Data#

API#

nemo_export.vllm_exporter.LOGGER = 'getLogger(...)'#
class nemo_export.vllm_exporter.vLLMExporter#

Bases: nemo_deploy.ITritonDeployable

vLLMExporter enables deployment of Hugging Face or NeMo2 models using vLLM and Triton.

This class wraps vLLM APIs to load a model and make it deployable with Triton Inference Server. It supports exporting NeMo2 checkpoints to Hugging Face format if needed, and then loads the model with vLLM for fast inference.

.. rubric:: Example

from nemo_export import vLLMExporter from nemo_deploy import DeployPyTriton

exporter = vLLMExporter() exporter.export(model_path_id=”/path/to/model/”)

server = DeployPyTriton( model=exporter, triton_model_name=’model’ )

server.deploy() server.serve()

Initialization

Initializes the vLLMExporter instance.

This constructor sets up the exporter by initializing model and LoRA model attributes. It also checks for the availability of required dependencies (vLLM, PyTriton, NeMo2) and raises an UnavailableError if any are missing.

export(
model_path_id: str,
tokenizer: str = None,
trust_remote_code: bool = False,
enable_lora: bool = False,
tensor_parallel_size: int = 1,
dtype: str = 'auto',
quantization: str = None,
seed: int = 0,
gpu_memory_utilization: float = 0.9,
swap_space: float = 4,
cpu_offload_gb: float = 0,
enforce_eager: bool = False,
max_seq_len_to_capture: int = 8192,
task: Literal[auto, generate, embedding] = 'auto',
)#

Exports a Hugging Face or NeMo2 checkpoint to vLLM and initializes the engine.

Parameters:
  • model_path_id (str) – Model name or path to the checkpoint directory. Can be a Hugging Face or NeMo2 checkpoint.

  • tokenizer (str, optional) – Path to the tokenizer or tokenizer name. Defaults to None.

  • trust_remote_code (bool, optional) – Whether to trust remote code from Hugging Face Hub. Defaults to False.

  • enable_lora (bool, optional) – Whether to enable LoRA support. Defaults to False.

  • tensor_parallel_size (int, optional) – Number of tensor parallel partitions. Defaults to 1.

  • dtype (str, optional) – Data type for model weights. Defaults to “auto”.

  • quantization (str, optional) – Quantization type. Defaults to None.

  • seed (int, optional) – Random seed. Defaults to 0.

  • gpu_memory_utilization (float, optional) – Fraction of GPU memory to use. Defaults to 0.9.

  • swap_space (float, optional) – Amount of swap space (in GB) to use. Defaults to 4.

  • cpu_offload_gb (float, optional) – Amount of CPU offload memory (in GB). Defaults to 0.

  • enforce_eager (bool, optional) – Whether to enforce eager execution. Defaults to False.

  • max_seq_len_to_capture (int, optional) – Maximum sequence length to capture. Defaults to 8192.

  • task (Literal["auto", "generate", "embedding"], optional) – Task type for vLLM. Defaults to “auto”.

Raises:

Exception – If NeMo checkpoint conversion to Hugging Face format fails.

add_lora_models(lora_model_name, lora_model)#

Add a LoRA (Low-Rank Adaptation) model to the exporter.

Parameters:
  • lora_model_name (str) – The name or identifier for the LoRA model.

  • lora_model – The LoRA model object to be added.

property get_triton_input#

Returns the expected Triton model input signature for vLLMExporter.

Returns:

A tuple of Tensor objects describing the input fields: - prompts (np.bytes_): Input prompt strings. - max_tokens (np.int_, optional): Maximum number of tokens to generate. - min_tokens (np.int_, optional): Minimum number of tokens to generate. - top_k (np.int_, optional): Top-K sampling parameter. - top_p (np.single, optional): Top-P (nucleus) sampling parameter. - temperature (np.single, optional): Sampling temperature. - seed (np.int_, optional): Random seed for generation. - n_log_probs (np.int_, optional): Number of log probabilities to return for generated tokens. - n_prompt_log_probs (np.int_, optional): Number of log probabilities to return for prompt tokens.

Return type:

tuple

property get_triton_output#

Returns the expected Triton model output signature for vLLMExporter.

Returns:

A tuple of Tensor objects describing the output fields: - sentences (np.bytes_): Generated text. - log_probs (np.bytes_): Log probabilities for generated tokens. - prompt_log_probs (np.bytes_): Log probabilities for prompt tokens.

Return type:

tuple

triton_infer_fn(**inputs: numpy.ndarray)#

Triton inference function for vLLMExporter.

This function processes input prompts and generates text using vLLM. It supports optional parameters for maximum tokens, minimum tokens, log probabilities, and random seed.

_infer_fn(prompts, inputs)#

Shared helper function to prepare inference inputs and execute forward pass.

Parameters:
  • prompts – List of input prompts

  • inputs – Dictionary of input parameters

Returns:

Dictionary containing generated text and optional log probabilities

Return type:

output_dict

post_process_logprobs_to_OAI(
output_dict: Dict[str, Any],
echo: bool = False,
n_top_logprobs: int = 0,
) Dict[str, Any]#

Post-process log probabilities (log-probs and prompt-log-probs from vllm’s generate output) to OAI API format.

This method:

  1. Extracts log probability values for actual tokens (not full dicts)

  2. Creates top_logprobs containing n_top_logprobs number of top logprobs

  3. Excludes the actual/chosen token if it’s extra (not in top N)

  4. If echo is True, merges prompt token logprobs with generated token logprobs

Parameters:
  • output_dict (Dict[str, Any]) –

    Output dictionary from forward() containing:

    • log_probs: Raw log probabilities (JSON strings in numpy array)

    • prompt_log_probs: Raw prompt log probabilities (if echo is True)

    • token_ids: Generated token IDs

    • prompt_token_ids: Prompt token IDs (if echo is True)

    • sentences: Generated text

  • echo (bool) – Whether to include prompt token logprobs

  • n_top_logprobs (int) – Number of top logprobs to return per token

Returns:

Modified output_dict with processed log_probs and top_logprobs: - log_probs: List of lists of float values for actual tokens - top_logprobs: List of lists of dicts with top N tokens and their logprobs

Return type:

Dict[str, Any]

ray_infer_fn(
inputs: Dict[str, Any],
) Dict[str, Any]#

Ray inference function that processes input dictionary and returns output without byte casting.

Parameters:

inputs (Dict[str, Any]) –

Input dictionary containing:

  • prompts: List of input prompts

  • max_tokens: Maximum number of tokens to generate (optional)

  • min_tokens: Minimum number of tokens to generate (optional)

  • top_k: Top-k sampling parameter (optional)

  • top_p: Top-p sampling parameter (optional)

  • temperature: Sampling temperature (optional)

  • seed: Random seed for generation (optional)

  • lora_model_name: Name of the LoRA model to use for generation (optional)

  • compute_logprob: Whether to compute log probabilities (optional)

  • n_top_logprobs: Number of top log probabilities to return (optional)

  • echo: Whether to include prompt token log probabilities (optional)

Returns:

Output dictionary containing: - sentences: List of generated text outputs - log_probs: List of lists of float values, containing log probability values for the actual tokens. If echo is True, includes prompt token logprobs first, then generated token logprobs. Otherwise, only generated token logprobs. - top_logprobs: List of lists of dictionaries, where each dict contains the top n_top_logprobs token strings and their logprob values at each position. If echo is True, includes prompt token top_logprobs first, then generated token top_logprobs. Otherwise, only generated token top_logprobs. Format: [[{” token1”: -0.1, “ token2”: -2.5}, …], …] - token_ids: Token IDs for generated tokens - prompt_token_ids: Token IDs for prompt tokens (if echo is True)

Return type:

Dict[str, Any]

_dict_to_str(messages)#

Serializes dict to str.

forward(
input_texts: List[str],
max_tokens: int = 16,
min_tokens: int = 0,
top_k: int = 1,
top_p: float = 0.1,
temperature: float = 1.0,
n_log_probs: int = None,
n_prompt_log_probs: int = None,
seed: int = None,
lora_model_name: str = None,
)#

Generate text completions for a list of input prompts using the vLLM model.

Parameters:
  • input_texts (List[str]) – List of input prompt strings.

  • max_tokens (int, optional) – Maximum number of tokens to generate for each prompt. Defaults to 16.

  • min_tokens (int, optional) – Minimum number of tokens to generate for each prompt. Defaults to 0.

  • top_k (int, optional) – The number of highest probability vocabulary tokens to keep for top-k-filtering. Defaults to 1.

  • top_p (float, optional) – If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation. Defaults to 0.1.

  • temperature (float, optional) – Sampling temperature. Defaults to 1.0.

  • n_log_probs (int, optional) – Number of log probabilities to return for generated tokens. Defaults to None.

  • n_prompt_log_probs (int, optional) – Number of log probabilities to return for prompt tokens. Defaults to None.

  • seed (int, optional) – Random seed for generation. Defaults to None.

  • lora_model_name (str, optional) – Name of the LoRA model to use for generation. Defaults to None.

Returns:

A dictionary containing: - sentences (List[str]): Generated text completions. - token_ids (List[List[int]]): Token IDs for the generated tokens. - log_probs (np.ndarray, optional): Top log probabilities for generated tokens if n_log_probs > 0. - prompt_log_probs (np.ndarray, optional): Top log probabilities for prompt tokens if n_prompt_log_probs > 0. - prompt_token_ids (List[List[int]], optional): Token IDs for prompt tokens at positions where prompt_logprobs is not None.

Return type:

dict