nemo_deploy.service.fastapi_interface_to_pytriton
#
Module Contents#
Classes#
TritonSettings class that gets the values of TRITON_HTTP_ADDRESS and TRITON_PORT. |
|
Common parameters for completions and chat requests for the server. |
|
Represents a request for text completion. |
|
Represents a request for chat completion. |
Functions#
Health check endpoint to verify that the API is running. |
|
This method exposes endpoint “/triton_health”. |
|
Convert NumPy arrays in output to lists. |
|
run_in_executor doesn’t allow to pass kwargs, so we have this helper function to pass args as a list. |
|
Sends requests to |
|
Defines the completions endpoint and queries the model deployed on PyTriton server. |
|
Serializes dict to str. |
|
Defines the chat completions endpoint and queries the model deployed on PyTriton server. |
Data#
API#
- class nemo_deploy.service.fastapi_interface_to_pytriton.TritonSettings#
Bases:
pydantic_settings.BaseSettings
TritonSettings class that gets the values of TRITON_HTTP_ADDRESS and TRITON_PORT.
Initialization
- _triton_service_port: int = None#
- _triton_service_ip: str = None#
- property triton_service_port#
Returns the port number for the Triton service.
- property triton_service_ip#
Returns the IP address for the Triton service.
- nemo_deploy.service.fastapi_interface_to_pytriton.app = 'FastAPI(...)'#
- nemo_deploy.service.fastapi_interface_to_pytriton.triton_settings = 'TritonSettings(...)'#
- class nemo_deploy.service.fastapi_interface_to_pytriton.BaseRequest#
Bases:
pydantic.BaseModel
Common parameters for completions and chat requests for the server.
.. attribute:: model
The name of the model to use for completion.
- Type:
str
.. attribute:: max_tokens
The maximum number of tokens to generate in the response.
- Type:
int
.. attribute:: temperature
Sampling temperature for randomness in generation.
- Type:
float
.. attribute:: top_p
Cumulative probability for nucleus sampling.
- Type:
float
.. attribute:: top_k
Number of highest-probability tokens to consider for sampling.
- Type:
int
- model: str = None#
- max_tokens: int = 512#
- temperature: float = 1.0#
- top_p: float = 0.0#
- top_k: int = 0#
- set_greedy_params()#
Validate parameters for greedy decoding.
- class nemo_deploy.service.fastapi_interface_to_pytriton.CompletionRequest#
Bases:
nemo_deploy.service.fastapi_interface_to_pytriton.BaseRequest
Represents a request for text completion.
.. attribute:: prompt
The input text to generate a response from.
- Type:
str
.. attribute:: logprobs
Number of log probabilities to include in the response, if applicable.
- Type:
int
.. attribute:: echo
Whether to return the input text as part of the response.
- Type:
bool
- prompt: str = None#
- logprobs: int = None#
- echo: bool = False#
- class nemo_deploy.service.fastapi_interface_to_pytriton.ChatCompletionRequest#
Bases:
nemo_deploy.service.fastapi_interface_to_pytriton.BaseRequest
Represents a request for chat completion.
.. attribute:: messages
A list of message dictionaries for chat completion.
- Type:
list[dict]
.. attribute:: logprobs
Whether to return log probabilities for output tokens.
- Type:
bool
.. attribute:: top_logprobs
Number of log probabilities to include in the response, if applicable. logprobs must be set to true if this parameter is used.
- Type:
int
- messages: list[dict] = None#
- nemo_deploy.service.fastapi_interface_to_pytriton.health_check()#
Health check endpoint to verify that the API is running.
- Returns:
A dictionary indicating the status of the application.
- Return type:
dict
- async nemo_deploy.service.fastapi_interface_to_pytriton.check_triton_health()#
This method exposes endpoint “/triton_health”.
This can be used to verify if Triton server is accessible while running the REST or FastAPI application. Verify by running: curl http://service_http_address:service_port/v1/triton_health and the returned status should inform if the server is accessible.
- nemo_deploy.service.fastapi_interface_to_pytriton.convert_numpy(obj)#
Convert NumPy arrays in output to lists.
- nemo_deploy.service.fastapi_interface_to_pytriton._helper_fun(
- url,
- model,
- prompts,
- temperature,
- top_k,
- top_p,
- compute_logprob,
- max_length,
- apply_chat_template,
- n_top_logprobs,
- echo,
run_in_executor doesn’t allow to pass kwargs, so we have this helper function to pass args as a list.
- async nemo_deploy.service.fastapi_interface_to_pytriton.query_llm_async(
- *,
- url,
- model,
- prompts,
- temperature,
- top_k,
- top_p,
- compute_logprob,
- max_length,
- apply_chat_template,
- n_top_logprobs,
- echo,
Sends requests to
NemoQueryLLMPyTorch.query_llm
in a non-blocking way.This allows the server to process concurrent requests. This way enables batching of requests in the underlying Triton server.
- async nemo_deploy.service.fastapi_interface_to_pytriton.completions_v1( )#
Defines the completions endpoint and queries the model deployed on PyTriton server.
- nemo_deploy.service.fastapi_interface_to_pytriton.dict_to_str(messages)#
Serializes dict to str.
- async nemo_deploy.service.fastapi_interface_to_pytriton.chat_completions_v1( )#
Defines the chat completions endpoint and queries the model deployed on PyTriton server.