nemo_deploy.deploy_pytriton
#
Module Contents#
Classes#
Deploys any models to Triton Inference Server that implements ITritonDeployable interface in nemo_deploy. |
Data#
API#
- nemo_deploy.deploy_pytriton.LOGGER = 'getLogger(...)'#
- class nemo_deploy.deploy_pytriton.DeployPyTriton(
- triton_model_name: str,
- triton_model_version: int = 1,
- model=None,
- max_batch_size: int = 128,
- http_port: int = 8000,
- grpc_port: int = 8001,
- address='0.0.0.0',
- allow_grpc=True,
- allow_http=True,
- streaming=False,
- pytriton_log_verbose=0,
Bases:
nemo_deploy.deploy_base.DeployBase
Deploys any models to Triton Inference Server that implements ITritonDeployable interface in nemo_deploy.
.. rubric:: Example
from nemo_deploy import DeployPyTriton, NemoQueryLLM from nemo_export.tensorrt_llm import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir=”/path/for/model/files”) trt_llm_exporter.export( nemo_checkpoint_path=”/path/for/nemo/checkpoint”, model_type=”llama”, tensor_parallelism_size=1, )
nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name=”model_name”, http_port=8000) nm.deploy() nm.run() nq = NemoQueryLLM(url=”localhost”, model_name=”model_name”)
prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm(prompts=prompts, max_output_len=100) print(“prompts: “, prompts) print(“”) print(“output: “, output) print(“”)
prompts = [“Give me some info about Paris”, “Do you think Londan is a good city to visit?”, “What do you think about Rome?”] output = nq.query_llm(prompts=prompts, max_output_len=250) print(“prompts: “, prompts) print(“”) print(“output: “, output) print(“”)
Initialization
A nemo checkpoint or model is expected for serving on Triton Inference Server.
- Parameters:
triton_model_name (str) – Name for the service
triton_model_version (int) – Version for the service
checkpoint_path (str) – path of the nemo file
model (ITritonDeployable) – A model that implements the ITritonDeployable from nemo_deploy import ITritonDeployable
max_batch_size (int) – max batch size
port (int) – port for the Triton server
address (str) – http address for Triton server to bind.
- deploy()#
Deploys any models to Triton Inference Server.
- serve()#
Starts serving the model and waits for the requests.
- run()#
Starts serving the model asynchronously.
- stop()#
Stops serving the model.