nemo_export.vllm_exporter
#
Module Contents#
Classes#
The vLLMExporter class implements conversion from a Nemo checkpoint format to something compatible with vLLM,loading the model in vLLM, and binding that model to a Triton server. |
Functions#
Used as batch if pytriton is not supported. |
Data#
API#
- nemo_export.vllm_exporter.LOGGER = 'getLogger(...)'#
- nemo_export.vllm_exporter.noop_decorator(func)#
Used as batch if pytriton is not supported.
- nemo_export.vllm_exporter.batch = None#
- nemo_export.vllm_exporter.use_pytriton = True#
- class nemo_export.vllm_exporter.vLLMExporter#
Bases:
nemo_deploy.ITritonDeployable
The vLLMExporter class implements conversion from a Nemo checkpoint format to something compatible with vLLM,loading the model in vLLM, and binding that model to a Triton server.
.. rubric:: Example
from nemo_export.vllm_exporter import vLLMExporter from nemo_deploy import DeployPyTriton
exporter = vLLMExporter()
exporter.export( nemo_checkpoint=’/path/to/checkpoint.nemo’, model_dir=’/path/to/temp_dir’, model_type=’llama’, )
server = DeployPyTriton( model=exporter, triton_model_name=’LLAMA’, )
server.deploy() server.serve()
Initialization
- export(
- nemo_checkpoint: str,
- model_dir: str,
- model_type: Optional[str] = 'auto',
- device: str = 'auto',
- tensor_parallel_size: int = 1,
- pipeline_parallel_size: int = 1,
- max_model_len: Optional[int] = None,
- lora_checkpoints: Optional[List[str]] = None,
- dtype: str = 'auto',
- seed: int = 0,
- log_stats: bool = True,
- weight_storage: str = 'auto',
- gpu_memory_utilization: float = 0.9,
- quantization: Optional[str] = None,
- delete_existing_files: bool = True,
Exports the Nemo checkpoint to vLLM and initializes the engine.
- Parameters:
nemo_checkpoint (str) – path to the nemo checkpoint.
model_dir (str) – path to a temporary directory to store weights and the tokenizer model. The temp dir may persist between subsequent export operations, in which case converted weights may be reused to speed up the export.
model_type (str) – type of the model, such as “llama”, “mistral”, “mixtral”. Needs to be compatible with transformers.AutoConfig. If “auto” or None, the model type is inferred from the given NeMo checkpoint.
device (str) – type of the device to use by the vLLM engine. Supported values are “auto”, “cuda”, “cpu”, “neuron”.
tensor_parallel_size (int) – tensor parallelism.
pipeline_parallel_size (int) – pipeline parallelism. Values over 1 are not currently supported by vLLM.
max_model_len (int) – model context length.
List[str] (lora_checkpoints) – paths to LoRA checkpoints.
dtype (str) – data type for model weights and activations. Possible choices: auto, half, float16, bfloat16, float, float32 “auto” will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models.
seed (int) – random seed value.
log_stats (bool) – enables logging inference performance statistics by vLLM.
weight_storage (str) – controls how converted weights are stored: “file” - always write weights into a file inside ‘model_dir’, “memory” - always do an in-memory conversion, “cache” - reuse existing files if they are newer than the nemo checkpoint, “auto” - use “cache” for multi-GPU runs and “memory” for single-GPU runs.
gpu_memory_utilization (float) – The fraction of GPU memory to be used for the model executor, which can range from 0 to 1.
quantization (str) – quantization method that is used to quantize the model weights. Possible choices are None (weights not quantized, default) and “fp8”.
delete_existing_files (bool) – if True, deletes all the files in model_dir.
- _prepare_lora_checkpoints(
- model_dir: str,
- lora_checkpoints: Optional[List[str]],
- dtype: str,
- _add_request_to_engine(
- prompt: str,
- max_output_len: int,
- temperature: float = 1.0,
- top_k: int = 1,
- top_p: float = 0.0,
- lora_uid: Optional[int] = None,
- _forward_regular(request_ids: List[str])#
- _forward_streaming(request_ids: List[str])#
- _add_triton_request_to_engine(
- inputs: numpy.ndarray,
- index: int,
- property get_triton_input#
- property get_triton_output#
- triton_infer_fn(**inputs: numpy.ndarray)#
This function is used to perform inference on a batch of prompts.
- triton_infer_fn_streaming(**inputs: numpy.ndarray)#
This function is used to perform streaming inference.
- forward(
- input_texts: List[str],
- max_output_len: int = 64,
- top_k: int = 1,
- top_p: float = 0.0,
- temperature: float = 1.0,
- stop_words_list: Optional[List[str]] = None,
- bad_words_list: Optional[List[str]] = None,
- no_repeat_ngram_size: Optional[int] = None,
- task_ids: Optional[List[str]] = None,
- lora_uids: Optional[List[str]] = None,
- prompt_embeddings_table=None,
- prompt_embeddings_checkpoint_path: Optional[str] = None,
- streaming: bool = False,
- output_log_probs: bool = False,
- output_generation_logits: bool = False,
- output_context_logits: bool = False,
The forward function performs LLM evaluation on the provided array of prompts with other parameters shared, and returns the generated texts.
If ‘streaming’ is True, the output texts are returned incrementally with a generator: one token appended to each output at a time. If ‘streaming’ is false, the final output texts are returned as a single list of responses.