`nemo_deploy.deploy_pytriton`#

Module Contents#

Classes#

DeployPyTriton

Deploys any models to Triton Inference Server that implements ITritonDeployable interface in nemo_deploy.

Data#

LOGGER

API#

nemo_deploy.deploy_pytriton.LOGGER = 'getLogger(...)'#

class nemo_deploy.deploy_pytriton.DeployPyTriton( triton_model_name: str, triton_model_version: int = 1, model=None, max_batch_size: int = 128, http_port: int = 8000, grpc_port: int = 8001, address='0.0.0.0', allow_grpc=True, allow_http=True, streaming=False, pytriton_log_verbose=0, )#

Bases: nemo_deploy.deploy_base.DeployBase

Deploys any models to Triton Inference Server that implements ITritonDeployable interface in nemo_deploy.

.. rubric:: Example

from nemo_deploy import DeployPyTriton, NemoQueryLLM from nemo_export.tensorrt_llm import TensorRTLLM

trt_llm_exporter = TensorRTLLM(model_dir=”/path/for/model/files”) trt_llm_exporter.export( nemo_checkpoint_path=”/path/for/nemo/checkpoint”, model_type=”llama”, tensor_parallelism_size=1, )

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name=”model_name”, http_port=8000) nm.deploy() nm.run() nq = NemoQueryLLM(url=”localhost”, model_name=”model_name”)

prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm(prompts=prompts, max_output_len=100) print(“prompts: “, prompts) print(“”) print(“output: “, output) print(“”)

prompts = [“Give me some info about Paris”, “Do you think Londan is a good city to visit?”, “What do you think about Rome?”] output = nq.query_llm(prompts=prompts, max_output_len=250) print(“prompts: “, prompts) print(“”) print(“output: “, output) print(“”)

Initialization

A nemo checkpoint or model is expected for serving on Triton Inference Server.

Parameters:

triton_model_name (str) – Name for the service
triton_model_version (int) – Version for the service
checkpoint_path (str) – path of the nemo file
model (ITritonDeployable) – A model that implements the ITritonDeployable from nemo_deploy import ITritonDeployable
max_batch_size (int) – max batch size
port (int) – port for the Triton server
address (str) – http address for Triton server to bind.

deploy()#: Deploys any models to Triton Inference Server.

serve()#: Starts serving the model and waits for the requests.

run()#: Starts serving the model asynchronously.

stop()#: Stops serving the model.

nemo_deploy.deploy_pytriton#

Module Contents#

Classes#

Data#

API#

`nemo_deploy.deploy_pytriton`#