`nemo_deploy.service.fastapi_interface_to_pytriton`#

Module Contents#

Classes#

`TritonSettings`	TritonSettings class that gets the values of TRITON_HTTP_ADDRESS and TRITON_PORT.
`BaseRequest`	Common parameters for completions and chat requests for the server.
`CompletionRequest`	Represents a request for text completion.
`ChatCompletionRequest`	Represents a request for chat completion.

Functions#

`health_check`	Health check endpoint to verify that the API is running.
`check_triton_health`	This method exposes endpoint “/triton_health”.
`convert_numpy`	Convert NumPy arrays in output to lists.
`_helper_fun`	run_in_executor doesn’t allow to pass kwargs, so we have this helper function to pass args as a list.
`query_llm_async`	Sends requests to `NemoQueryLLMPyTorch.query_llm` in a non-blocking way.
`completions_v1`	Defines the completions endpoint and queries the model deployed on PyTriton server.
`dict_to_str`	Serializes dict to str.
`chat_completions_v1`	Defines the chat completions endpoint and queries the model deployed on PyTriton server.

Data#

`app`
`triton_settings`

API#

class nemo_deploy.service.fastapi_interface_to_pytriton.TritonSettings#

Bases: pydantic_settings.BaseSettings

TritonSettings class that gets the values of TRITON_HTTP_ADDRESS and TRITON_PORT.

Initialization

_triton_service_port: int = None#

_triton_service_ip: str = None#

property triton_service_port#: Returns the port number for the Triton service.

property triton_service_ip#: Returns the IP address for the Triton service.

nemo_deploy.service.fastapi_interface_to_pytriton.app = 'FastAPI(...)'#

nemo_deploy.service.fastapi_interface_to_pytriton.triton_settings = 'TritonSettings(...)'#

class nemo_deploy.service.fastapi_interface_to_pytriton.BaseRequest#

Bases: pydantic.BaseModel

Common parameters for completions and chat requests for the server.

.. attribute:: model

The name of the model to use for completion.

Type:: str

.. attribute:: max_tokens

The maximum number of tokens to generate in the response.

Type:: int

.. attribute:: temperature

Sampling temperature for randomness in generation.

Type:: float

.. attribute:: top_p

Cumulative probability for nucleus sampling.

Type:: float

.. attribute:: top_k

Number of highest-probability tokens to consider for sampling.

Type:: int

model: str = None#

max_tokens: int = 512#

temperature: float = 1.0#

top_p: float = 0.0#

top_k: int = 0#

set_greedy_params()#: Validate parameters for greedy decoding.

class nemo_deploy.service.fastapi_interface_to_pytriton.CompletionRequest#

Bases: nemo_deploy.service.fastapi_interface_to_pytriton.BaseRequest

Represents a request for text completion.

.. attribute:: prompt

The input text to generate a response from.

Type:: str

.. attribute:: logprobs

Number of log probabilities to include in the response, if applicable.

Type:: int

.. attribute:: echo

Whether to return the input text as part of the response.

Type:: bool

prompt: str = None#

logprobs: int = None#

echo: bool = False#

class nemo_deploy.service.fastapi_interface_to_pytriton.ChatCompletionRequest#

Bases: nemo_deploy.service.fastapi_interface_to_pytriton.BaseRequest

Represents a request for chat completion.

.. attribute:: messages

A list of message dictionaries for chat completion.

Type:: list[dict]

.. attribute:: logprobs

Whether to return log probabilities for output tokens.

Type:: bool

.. attribute:: top_logprobs

Number of log probabilities to include in the response, if applicable. logprobs must be set to true if this parameter is used.

Type:: int

messages: list[dict] = None#

nemo_deploy.service.fastapi_interface_to_pytriton.health_check()#

Health check endpoint to verify that the API is running.

Returns:: A dictionary indicating the status of the application.
Return type:: dict

async nemo_deploy.service.fastapi_interface_to_pytriton.check_triton_health()#

This method exposes endpoint “/triton_health”.

This can be used to verify if Triton server is accessible while running the REST or FastAPI application. Verify by running: curl http://service_http_address:service_port/v1/triton_health and the returned status should inform if the server is accessible.

nemo_deploy.service.fastapi_interface_to_pytriton.convert_numpy(obj)#: Convert NumPy arrays in output to lists.

nemo_deploy.service.fastapi_interface_to_pytriton._helper_fun( url, model, prompts, temperature, top_k, top_p, compute_logprob, max_length, apply_chat_template, n_top_logprobs, echo, )#: run_in_executor doesn’t allow to pass kwargs, so we have this helper function to pass args as a list.

async nemo_deploy.service.fastapi_interface_to_pytriton.query_llm_async( *, url, model, prompts, temperature, top_k, top_p, compute_logprob, max_length, apply_chat_template, n_top_logprobs, echo, )#

Sends requests to NemoQueryLLMPyTorch.query_llm in a non-blocking way.

This allows the server to process concurrent requests. This way enables batching of requests in the underlying Triton server.

async nemo_deploy.service.fastapi_interface_to_pytriton.completions_v1( request: nemo_deploy.service.fastapi_interface_to_pytriton.CompletionRequest, )#: Defines the completions endpoint and queries the model deployed on PyTriton server.

nemo_deploy.service.fastapi_interface_to_pytriton.dict_to_str(messages)#: Serializes dict to str.

async nemo_deploy.service.fastapi_interface_to_pytriton.chat_completions_v1( request: nemo_deploy.service.fastapi_interface_to_pytriton.ChatCompletionRequest, )#: Defines the chat completions endpoint and queries the model deployed on PyTriton server.

nemo_deploy.service.fastapi_interface_to_pytriton#