nemo_export.vllm_exporter
#
Module Contents#
Classes#
vLLMExporter enables deployment of Hugging Face or NeMo2 models using vLLM and Triton. |
Data#
API#
- nemo_export.vllm_exporter.LOGGER = 'getLogger(...)'#
- class nemo_export.vllm_exporter.vLLMExporter#
Bases:
nemo_deploy.ITritonDeployable
vLLMExporter enables deployment of Hugging Face or NeMo2 models using vLLM and Triton.
This class wraps vLLM APIs to load a model and make it deployable with Triton Inference Server. It supports exporting NeMo2 checkpoints to Hugging Face format if needed, and then loads the model with vLLM for fast inference.
.. rubric:: Example
from nemo_export import vLLMExporter from nemo_deploy import DeployPyTriton
exporter = vLLMExporter() exporter.export(model_path_id=β/path/to/model/β)
server = DeployPyTriton( model=exporter, triton_model_name=βmodelβ )
server.deploy() server.serve()
Initialization
Initializes the vLLMExporter instance.
This constructor sets up the exporter by initializing model and LoRA model attributes. It also checks for the availability of required dependencies (vLLM, PyTriton, NeMo2) and raises an UnavailableError if any are missing.
- export(
- model_path_id: str,
- tokenizer: str = None,
- trust_remote_code: bool = False,
- enable_lora: bool = False,
- tensor_parallel_size: int = 1,
- dtype: str = 'auto',
- quantization: str = None,
- seed: int = 0,
- gpu_memory_utilization: float = 0.9,
- swap_space: float = 4,
- cpu_offload_gb: float = 0,
- enforce_eager: bool = False,
- max_seq_len_to_capture: int = 8192,
- task: Literal[auto, generate, embedding] = 'auto',
Exports a Hugging Face or NeMo2 checkpoint to vLLM and initializes the engine.
- Parameters:
model_path_id (str) β Model name or path to the checkpoint directory. Can be a Hugging Face or NeMo2 checkpoint.
tokenizer (str, optional) β Path to the tokenizer or tokenizer name. Defaults to None.
trust_remote_code (bool, optional) β Whether to trust remote code from Hugging Face Hub. Defaults to False.
enable_lora (bool, optional) β Whether to enable LoRA support. Defaults to False.
tensor_parallel_size (int, optional) β Number of tensor parallel partitions. Defaults to 1.
dtype (str, optional) β Data type for model weights. Defaults to βautoβ.
quantization (str, optional) β Quantization type. Defaults to None.
seed (int, optional) β Random seed. Defaults to 0.
gpu_memory_utilization (float, optional) β Fraction of GPU memory to use. Defaults to 0.9.
swap_space (float, optional) β Amount of swap space (in GB) to use. Defaults to 4.
cpu_offload_gb (float, optional) β Amount of CPU offload memory (in GB). Defaults to 0.
enforce_eager (bool, optional) β Whether to enforce eager execution. Defaults to False.
max_seq_len_to_capture (int, optional) β Maximum sequence length to capture. Defaults to 8192.
task (Literal["auto", "generate", "embedding"], optional) β Task type for vLLM. Defaults to βautoβ.
- Raises:
Exception β If NeMo checkpoint conversion to Hugging Face format fails.
- add_lora_models(lora_model_name, lora_model)#
Add a LoRA (Low-Rank Adaptation) model to the exporter.
- Parameters:
lora_model_name (str) β The name or identifier for the LoRA model.
lora_model β The LoRA model object to be added.
- property get_triton_input#
Returns the expected Triton model input signature for vLLMExporter.
- Returns:
A tuple of Tensor objects describing the input fields: - prompts (np.bytes_): Input prompt strings. - max_tokens (np.int_, optional): Maximum number of tokens to generate. - min_tokens (np.int_, optional): Minimum number of tokens to generate. - top_k (np.int_, optional): Top-K sampling parameter. - top_p (np.single, optional): Top-P (nucleus) sampling parameter. - temperature (np.single, optional): Sampling temperature. - seed (np.int_, optional): Random seed for generation. - n_log_probs (np.int_, optional): Number of log probabilities to return for generated tokens. - n_prompt_log_probs (np.int_, optional): Number of log probabilities to return for prompt tokens.
- Return type:
tuple
- property get_triton_output#
Returns the expected Triton model output signature for vLLMExporter.
- Returns:
A tuple of Tensor objects describing the output fields: - sentences (np.bytes_): Generated text. - log_probs (np.bytes_): Log probabilities for generated tokens. - prompt_log_probs (np.bytes_): Log probabilities for prompt tokens.
- Return type:
tuple
- triton_infer_fn(**inputs: numpy.ndarray)#
Triton inference function for vLLMExporter.
This function processes input prompts and generates text using vLLM. It supports optional parameters for maximum tokens, minimum tokens, log probabilities, and random seed.
- _infer_fn(prompts, inputs)#
Shared helper function to prepare inference inputs and execute forward pass.
- Parameters:
prompts β List of input prompts
inputs β Dictionary of input parameters
- Returns:
Dictionary containing generated text and optional log probabilities
- Return type:
output_dict
- ray_infer_fn(
- inputs: Dict[str, Any],
Ray inference function that processes input dictionary and returns output without byte casting.
- Parameters:
inputs (Dict[str, Any]) β
Input dictionary containing:
prompts: List of input prompts
max_tokens: Maximum number of tokens to generate (optional)
min_tokens: Minimum number of tokens to generate (optional)
top_k: Top-k sampling parameter (optional)
top_p: Top-p sampling parameter (optional)
temperature: Sampling temperature (optional)
seed: Random seed for generation (optional)
n_log_probs: Number of log probabilities to return for generated tokens (optional)
n_prompt_log_probs: Number of log probabilities to return for prompt tokens (optional)
lora_model_name: Name of the LoRA model to use for generation (optional)
- Returns:
Output dictionary containing: - sentences: List of generated text outputs - log_probs: Log probabilities for generated tokens (if requested) - prompt_log_probs: Log probabilities for prompt tokens (if requested)
- Return type:
Dict[str, Any]
- _dict_to_str(messages)#
Serializes dict to str.
- forward(
- input_texts: List[str],
- max_tokens: int = 16,
- min_tokens: int = 0,
- top_k: int = 1,
- top_p: float = 0.1,
- temperature: float = 1.0,
- n_log_probs: int = None,
- n_prompt_log_probs: int = None,
- seed: int = None,
- lora_model_name: str = None,
Generate text completions for a list of input prompts using the vLLM model.
- Parameters:
input_texts (List[str]) β List of input prompt strings.
max_tokens (int, optional) β Maximum number of tokens to generate for each prompt. Defaults to 16.
min_tokens (int, optional) β Minimum number of tokens to generate for each prompt. Defaults to 0.
top_k (int, optional) β The number of highest probability vocabulary tokens to keep for top-k-filtering. Defaults to 1.
top_p (float, optional) β If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation. Defaults to 0.1.
temperature (float, optional) β Sampling temperature. Defaults to 1.0.
n_log_probs (int, optional) β Number of log probabilities to return for generated tokens. Defaults to None.
n_prompt_log_probs (int, optional) β Number of log probabilities to return for prompt tokens. Defaults to None.
seed (int, optional) β Random seed for generation. Defaults to None.
lora_model_name (str, optional) β Name of the LoRA model to use for generation. Defaults to None.
- Returns:
Generated text completions and optionally log probabilities, depending on the input arguments.
- Return type:
dict or list