nemo_deploy.service.fastapi_interface_to_pytriton#

Module Contents#

Classes#

TritonSettings

TritonSettings class that gets the values of TRITON_HTTP_ADDRESS and TRITON_PORT.

BaseRequest

Common parameters for completions and chat requests for the server.

CompletionRequest

Represents a request for text completion.

ChatCompletionRequest

Represents a request for chat completion.

Functions#

health_check

Health check endpoint to verify that the API is running.

check_triton_health

This method exposes endpoint “/triton_health”.

convert_numpy

Convert NumPy arrays in output to lists.

_helper_fun

run_in_executor doesn’t allow to pass kwargs, so we have this helper function to pass args as a list.

query_llm_async

Sends requests to NemoQueryLLMPyTorch.query_llm in a non-blocking way.

completions_v1

Defines the completions endpoint and queries the model deployed on PyTriton server.

dict_to_str

Serializes dict to str.

chat_completions_v1

Defines the chat completions endpoint and queries the model deployed on PyTriton server.

Data#

API#

class nemo_deploy.service.fastapi_interface_to_pytriton.TritonSettings#

Bases: pydantic_settings.BaseSettings

TritonSettings class that gets the values of TRITON_HTTP_ADDRESS and TRITON_PORT.

Initialization

_triton_service_port: int = None#
_triton_service_ip: str = None#
property triton_service_port#

Returns the port number for the Triton service.

property triton_service_ip#

Returns the IP address for the Triton service.

nemo_deploy.service.fastapi_interface_to_pytriton.app = 'FastAPI(...)'#
nemo_deploy.service.fastapi_interface_to_pytriton.triton_settings = 'TritonSettings(...)'#
class nemo_deploy.service.fastapi_interface_to_pytriton.BaseRequest#

Bases: pydantic.BaseModel

Common parameters for completions and chat requests for the server.

.. attribute:: model

The name of the model to use for completion.

Type:

str

.. attribute:: max_tokens

The maximum number of tokens to generate in the response.

Type:

int

.. attribute:: temperature

Sampling temperature for randomness in generation.

Type:

float

.. attribute:: top_p

Cumulative probability for nucleus sampling.

Type:

float

.. attribute:: top_k

Number of highest-probability tokens to consider for sampling.

Type:

int

model: str = None#
max_tokens: int = 512#
temperature: float = 1.0#
top_p: float = 0.0#
top_k: int = 0#
set_greedy_params()#

Validate parameters for greedy decoding.

class nemo_deploy.service.fastapi_interface_to_pytriton.CompletionRequest#

Bases: nemo_deploy.service.fastapi_interface_to_pytriton.BaseRequest

Represents a request for text completion.

.. attribute:: prompt

The input text to generate a response from.

Type:

str

.. attribute:: logprobs

Number of log probabilities to include in the response, if applicable.

Type:

int

.. attribute:: echo

Whether to return the input text as part of the response.

Type:

bool

prompt: str = None#
logprobs: int = None#
echo: bool = False#
class nemo_deploy.service.fastapi_interface_to_pytriton.ChatCompletionRequest#

Bases: nemo_deploy.service.fastapi_interface_to_pytriton.BaseRequest

Represents a request for chat completion.

.. attribute:: messages

A list of message dictionaries for chat completion.

Type:

list[dict]

.. attribute:: logprobs

Whether to return log probabilities for output tokens.

Type:

bool

.. attribute:: top_logprobs

Number of log probabilities to include in the response, if applicable. logprobs must be set to true if this parameter is used.

Type:

int

messages: list[dict] = None#
nemo_deploy.service.fastapi_interface_to_pytriton.health_check()#

Health check endpoint to verify that the API is running.

Returns:

A dictionary indicating the status of the application.

Return type:

dict

async nemo_deploy.service.fastapi_interface_to_pytriton.check_triton_health()#

This method exposes endpoint “/triton_health”.

This can be used to verify if Triton server is accessible while running the REST or FastAPI application. Verify by running: curl http://service_http_address:service_port/v1/triton_health and the returned status should inform if the server is accessible.

nemo_deploy.service.fastapi_interface_to_pytriton.convert_numpy(obj)#

Convert NumPy arrays in output to lists.

nemo_deploy.service.fastapi_interface_to_pytriton._helper_fun(
url,
model,
prompts,
temperature,
top_k,
top_p,
compute_logprob,
max_length,
apply_chat_template,
n_top_logprobs,
echo,
)#

run_in_executor doesn’t allow to pass kwargs, so we have this helper function to pass args as a list.

async nemo_deploy.service.fastapi_interface_to_pytriton.query_llm_async(
*,
url,
model,
prompts,
temperature,
top_k,
top_p,
compute_logprob,
max_length,
apply_chat_template,
n_top_logprobs,
echo,
)#

Sends requests to NemoQueryLLMPyTorch.query_llm in a non-blocking way.

This allows the server to process concurrent requests. This way enables batching of requests in the underlying Triton server.

async nemo_deploy.service.fastapi_interface_to_pytriton.completions_v1(
request: nemo_deploy.service.fastapi_interface_to_pytriton.CompletionRequest,
)#

Defines the completions endpoint and queries the model deployed on PyTriton server.

nemo_deploy.service.fastapi_interface_to_pytriton.dict_to_str(messages)#

Serializes dict to str.

async nemo_deploy.service.fastapi_interface_to_pytriton.chat_completions_v1(
request: nemo_deploy.service.fastapi_interface_to_pytriton.ChatCompletionRequest,
)#

Defines the chat completions endpoint and queries the model deployed on PyTriton server.