`nemo_deploy.nlp.query_llm`#

Module Contents#

Classes#

`NemoQueryLLMBase`	Abstract base class for querying a Large Language Model (LLM).
`NemoQueryLLMPyTorch`	Sends a query to Triton for LLM inference.
`NemoQueryLLMHF`	Sends a query to Triton for LLM inference.
`NemoQueryLLM`	Sends a query to Triton for LLM inference.
`NemoQueryTRTLLMAPI`	Sends a query to Triton for TensorRT-LLM API deployment inference.
`NemoQueryvLLM`	Sends a query to Triton for TensorRT-LLM API deployment inference.

API#

class nemo_deploy.nlp.query_llm.NemoQueryLLMBase(url, model_name)#

Bases: abc.ABC

Abstract base class for querying a Large Language Model (LLM).

Args: url (str): The URL of the inference server. model_name (str): The name of the model to be queried.

Initialization

class nemo_deploy.nlp.query_llm.NemoQueryLLMPyTorch(url, model_name)#

Bases: nemo_deploy.nlp.query_llm.NemoQueryLLMBase

Sends a query to Triton for LLM inference.

.. rubric:: Example

from nemo_deploy import NemoTritonQueryLLMPyTorch

nq = NemoTritonQueryLLMPyTorch(url=”localhost”, model_name=”GPT-2B”)

prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_length=100, top_k=1, top_p=0.0, temperature=0.0, ) print(“prompts: “, prompts)

Initialization

query_llm( prompts: List[str], use_greedy: Optional[bool] = None, temperature: Optional[float] = None, top_k: Optional[int] = None, top_p: Optional[float] = None, repetition_penalty: Optional[float] = None, add_BOS: Optional[bool] = None, all_probs: Optional[bool] = None, compute_logprob: Optional[bool] = None, end_strings: Optional[List[str]] = None, min_length: Optional[int] = None, max_length: Optional[int] = None, apply_chat_template: bool = False, n_top_logprobs: Optional[int] = None, init_timeout: float = 60.0, echo: Optional[bool] = None, )#

Query the Triton server synchronously and return a list of responses.

Parameters:

prompts (List(str)) – list of sentences.
use_greedy (bool) – use greedy sampling, effectively the same as top_k=1
temperature (float) – A parameter of the softmax function, which is the last layer in the network.
top_k (int) – limits us to a certain number (K) of the top tokens to consider.
top_p (float) – limits us to the top tokens within a certain probability mass (p).
repetition_penalty (float) – penalty applied to repeated sequences, 1.0 means no penalty.
add_BOS (bool) – whether or not to add a BOS (beginning of sentence) token.
all_probs (bool) – when using compute_logprob, returns probabilities for all tokens in vocabulary.
compute_logprob (bool) – get back probabilities of all tokens in the sequence.
end_strings (List(str)) – list of strings which will terminate generation when they appear in the output.
min_length (int) – min generated tokens.
max_length (int) – max generated tokens.
apply_chat_template (bool) – applies chat template if its a chat model. Default: False
init_timeout (flat) – timeout for the connection.

class nemo_deploy.nlp.query_llm.NemoQueryLLMHF(url, model_name)#

Bases: nemo_deploy.nlp.query_llm.NemoQueryLLMBase

Sends a query to Triton for LLM inference.

.. rubric:: Example

from nemo_deploy import NemoQueryLLMHF

nq = NemoQueryLLMHF(url=”localhost”, model_name=”GPT-2B”)

Initialization

query_llm( prompts: List[str], use_greedy: Optional[bool] = None, temperature: Optional[float] = None, top_k: Optional[int] = None, top_p: Optional[float] = None, repetition_penalty: Optional[float] = None, add_BOS: Optional[bool] = None, all_probs: Optional[bool] = None, output_logits: Optional[bool] = None, output_scores: Optional[bool] = None, end_strings: Optional[List[str]] = None, min_length: Optional[int] = None, max_length: Optional[int] = None, init_timeout: float = 60.0, )#

Query the Triton server synchronously and return a list of responses.

Parameters:

prompts (List[str]) – list of sentences.
use_greedy (Optional[bool]) – use greedy sampling, effectively the same as top_k=1
temperature (Optional[float]) – A parameter of the softmax function, which is the last layer in the network.
top_k (Optional[int]) – limits us to a certain number (K) of the top tokens to consider.
top_p (Optional[float]) – limits us to the top tokens within a certain probability mass (p).
repetition_penalty (Optional[float]) – penalty applied to repeated sequences, 1.0 means no penalty.
add_BOS (Optional[bool]) – whether or not to add a BOS (beginning of sentence) token.
all_probs (Optional[bool]) – when using compute_logprob, returns probabilities for all tokens in vocabulary.
output_logits (Optional[bool]) – whether to return logits for each token
output_scores (Optional[bool]) – whether to return scores for each token
end_strings (Optional[List[str]]) – list of strs which will stop generation when they appear in the output.
min_length (Optional[int]) – min generated tokens.
max_length (Optional[int]) – max generated tokens.
init_timeout (float) – timeout for the connection.

class nemo_deploy.nlp.query_llm.NemoQueryLLM(url, model_name)#

Bases: nemo_deploy.nlp.query_llm.NemoQueryLLMBase

Sends a query to Triton for LLM inference.

.. rubric:: Example

from nemo_deploy import NemoQueryLLM

nq = NemoQueryLLM(url=”localhost”, model_name=”GPT-2B”)

prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_output_len=100, top_k=1, top_p=0.0, temperature=0.0, ) print(“prompts: “, prompts)

Initialization

query_llm( prompts, stop_words_list=None, bad_words_list=None, no_repeat_ngram_size=None, min_output_len=None, max_output_len=None, top_k=None, top_p=None, temperature=None, random_seed=None, lora_uids=None, use_greedy: bool = None, repetition_penalty: float = None, add_BOS: bool = None, all_probs: bool = None, compute_logprob: bool = None, end_strings=None, init_timeout=60.0, openai_format_response: bool = False, output_context_logits: bool = False, output_generation_logits: bool = False, )#

Query the Triton server synchronously and return a list of responses.

Parameters:

prompts (List(str)) – list of sentences.
max_output_len (int) – max generated tokens.
top_k (int) – limits us to a certain number (K) of the top tokens to consider.
top_p (float) – limits us to the top tokens within a certain probability mass (p).
temperature (float) – A parameter of the softmax function, which is the last layer in the network.
random_seed (int) – Seed to condition sampling.
stop_words_list (List(str)) – list of stop words.
bad_words_list (List(str)) – list of bad words.
no_repeat_ngram_size (int) – no repeat ngram size.
init_timeout (flat) – timeout for the connection.
openai_format_response – return response similar to OpenAI API format
output_generation_logits – return generation logits from model on PyTriton

class nemo_deploy.nlp.query_llm.NemoQueryTRTLLMAPI(url, model_name)#

Bases: nemo_deploy.nlp.query_llm.NemoQueryLLMBase

Sends a query to Triton for TensorRT-LLM API deployment inference.

.. rubric:: Example

from nemo_deploy import NemoQueryTRTLLMAPI

nq = NemoQueryTRTLLMAPI(url=”localhost”, model_name=”GPT-2B”)

prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_length=100, top_k=1, top_p=None, temperature=None, ) print(“prompts: “, prompts)

Initialization

query_llm( prompts: List[str], max_length: int = 256, top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, init_timeout: float = 60.0, )#

Query the Triton server synchronously and return a list of responses.

Parameters:

prompts (List(str)) – list of sentences.
max_length (int) – max generated tokens.
top_k (int) – limits us to a certain number (K) of the top tokens to consider.
top_p (float) – limits us to the top tokens within a certain probability mass (p).
temperature (float) – A parameter of the softmax function, which is the last layer in the network.
init_timeout (flat) – timeout for the connection.

Returns:

A list of generated texts, one for each input prompt.

Return type:

List[str]

class nemo_deploy.nlp.query_llm.NemoQueryvLLM(url, model_name)#

Bases: nemo_deploy.nlp.query_llm.NemoQueryLLMBase

Sends a query to Triton for TensorRT-LLM API deployment inference.

.. rubric:: Example

from nemo_deploy import NemoQueryvLLM

nq = NemoQueryvLLM(url=”localhost”, model_name=”GPT-2B”)

prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_tokens=100, top_k=1, top_p=None, temperature=None, ) print(“prompts: “, prompts)

Initialization

query_llm( prompts: List[str], max_tokens: int = None, min_tokens: int = None, n_log_probs: Optional[bool] = None, n_prompt_log_probs: Optional[bool] = None, seed: Optional[int] = None, top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, init_timeout: float = 60.0, )#

Query the Triton server synchronously and return a response in OpenAI-compatible format.

Parameters:

prompts (List[str]) – List of input prompt strings.
max_tokens (Optional[int]) – Maximum number of tokens to generate.
min_tokens (Optional[int]) – Minimum number of tokens to generate.
n_log_probs (Optional[int]) – Number of log probabilities to return for generated tokens.
n_prompt_log_probs (Optional[int]) – Number of log probabilities to return for prompt tokens.
seed (Optional[int]) – Random seed for generation.
top_k (Optional[int]) – Limits to the top K tokens to consider at each step.
top_p (Optional[float]) – Limits to the top tokens within cumulative probability p.
temperature (Optional[float]) – Sampling temperature.
init_timeout (float) – Timeout (in seconds) for connecting to the server.

Returns:

OpenAI-style response containing generated text and optionally log probabilities.

Return type:

dict

nemo_deploy.nlp.query_llm#

Module Contents#

Classes#

API#

`nemo_deploy.nlp.query_llm`#