nemo_deploy.deploy_pytriton#

Module Contents#

Classes#

DeployPyTriton

Deploys any models to Triton Inference Server that implements ITritonDeployable interface in nemo_deploy.

Data#

API#

nemo_deploy.deploy_pytriton.LOGGER = 'getLogger(...)'#
class nemo_deploy.deploy_pytriton.DeployPyTriton(
triton_model_name: str,
triton_model_version: int = 1,
model=None,
max_batch_size: int = 128,
http_port: int = 8000,
grpc_port: int = 8001,
address='0.0.0.0',
allow_grpc=True,
allow_http=True,
streaming=False,
pytriton_log_verbose=0,
)#

Bases: nemo_deploy.deploy_base.DeployBase

Deploys any models to Triton Inference Server that implements ITritonDeployable interface in nemo_deploy.

.. rubric:: Example

from nemo_deploy import DeployPyTriton, NemoQueryLLM from nemo_export.tensorrt_llm import TensorRTLLM

trt_llm_exporter = TensorRTLLM(model_dir=”/path/for/model/files”) trt_llm_exporter.export( nemo_checkpoint_path=”/path/for/nemo/checkpoint”, model_type=”llama”, tensor_parallelism_size=1, )

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name=”model_name”, http_port=8000) nm.deploy() nm.run() nq = NemoQueryLLM(url=”localhost”, model_name=”model_name”)

prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm(prompts=prompts, max_output_len=100) print(“prompts: “, prompts) print(“”) print(“output: “, output) print(“”)

prompts = [“Give me some info about Paris”, “Do you think Londan is a good city to visit?”, “What do you think about Rome?”] output = nq.query_llm(prompts=prompts, max_output_len=250) print(“prompts: “, prompts) print(“”) print(“output: “, output) print(“”)

Initialization

A nemo checkpoint or model is expected for serving on Triton Inference Server.

Parameters:
  • triton_model_name (str) – Name for the service

  • triton_model_version (int) – Version for the service

  • checkpoint_path (str) – path of the nemo file

  • model (ITritonDeployable) – A model that implements the ITritonDeployable from nemo_deploy import ITritonDeployable

  • max_batch_size (int) – max batch size

  • port (int) – port for the Triton server

  • address (str) – http address for Triton server to bind.

deploy()#

Deploys any models to Triton Inference Server.

serve()#

Starts serving the model and waits for the requests.

run()#

Starts serving the model asynchronously.

stop()#

Stops serving the model.