nemo_deploy.nlp.query_llm#

Module Contents#

Classes#

NemoQueryLLMBase

Abstract base class for querying a Large Language Model (LLM).

NemoQueryLLMPyTorch

Sends a query to Triton for LLM inference.

NemoQueryLLMHF

Sends a query to Triton for LLM inference.

NemoQueryLLM

Sends a query to Triton for LLM inference.

NemoQueryTRTLLMAPI

Sends a query to Triton for TensorRT-LLM API deployment inference.

API#

class nemo_deploy.nlp.query_llm.NemoQueryLLMBase(url, model_name)[source]#

Bases: abc.ABC

Abstract base class for querying a Large Language Model (LLM).

Args: url (str): The URL of the inference server. model_name (str): The name of the model to be queried.

Initialization

class nemo_deploy.nlp.query_llm.NemoQueryLLMPyTorch(url, model_name)[source]#

Bases: nemo_deploy.nlp.query_llm.NemoQueryLLMBase

Sends a query to Triton for LLM inference.

.. rubric:: Example

from nemo_deploy import NemoTritonQueryLLMPyTorch

nq = NemoTritonQueryLLMPyTorch(url=”localhost”, model_name=”GPT-2B”)

prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_length=100, top_k=1, top_p=0.0, temperature=0.0, ) print(“prompts: “, prompts)

Initialization

query_llm(
prompts: List[str],
use_greedy: Optional[bool] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
repetition_penalty: Optional[float] = None,
add_BOS: Optional[bool] = None,
all_probs: Optional[bool] = None,
compute_logprob: Optional[bool] = None,
end_strings: Optional[List[str]] = None,
min_length: Optional[int] = None,
max_length: Optional[int] = None,
apply_chat_template: bool = False,
n_top_logprobs: Optional[int] = None,
init_timeout: float = 60.0,
echo: Optional[bool] = None,
)[source]#

Query the Triton server synchronously and return a list of responses.

Parameters:
  • prompts (List(str)) – list of sentences.

  • use_greedy (bool) – use greedy sampling, effectively the same as top_k=1

  • temperature (float) – A parameter of the softmax function, which is the last layer in the network.

  • top_k (int) – limits us to a certain number (K) of the top tokens to consider.

  • top_p (float) – limits us to the top tokens within a certain probability mass (p).

  • repetition_penalty (float) – penalty applied to repeated sequences, 1.0 means no penalty.

  • add_BOS (bool) – whether or not to add a BOS (beginning of sentence) token.

  • all_probs (bool) – when using compute_logprob, returns probabilities for all tokens in vocabulary.

  • compute_logprob (bool) – get back probabilities of all tokens in the sequence.

  • end_strings (List(str)) – list of strings which will terminate generation when they appear in the output.

  • min_length (int) – min generated tokens.

  • max_length (int) – max generated tokens.

  • apply_chat_template (bool) – applies chat template if its a chat model. Default: False

  • init_timeout (flat) – timeout for the connection.

class nemo_deploy.nlp.query_llm.NemoQueryLLMHF(url, model_name)[source]#

Bases: nemo_deploy.nlp.query_llm.NemoQueryLLMBase

Sends a query to Triton for LLM inference.

.. rubric:: Example

from nemo_deploy import NemoQueryLLMHF

nq = NemoQueryLLMHF(url=”localhost”, model_name=”GPT-2B”)

prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_length=100, top_k=1, top_p=0.0, temperature=0.0, ) print(“prompts: “, prompts)

Initialization

query_llm(
prompts: List[str],
use_greedy: Optional[bool] = None,
temperature: Optional[float] = None,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
repetition_penalty: Optional[float] = None,
add_BOS: Optional[bool] = None,
all_probs: Optional[bool] = None,
output_logits: Optional[bool] = None,
output_scores: Optional[bool] = None,
end_strings: Optional[List[str]] = None,
min_length: Optional[int] = None,
max_length: Optional[int] = None,
init_timeout: float = 60.0,
)[source]#

Query the Triton server synchronously and return a list of responses.

Parameters:
  • prompts (List[str]) – list of sentences.

  • use_greedy (Optional[bool]) – use greedy sampling, effectively the same as top_k=1

  • temperature (Optional[float]) – A parameter of the softmax function, which is the last layer in the network.

  • top_k (Optional[int]) – limits us to a certain number (K) of the top tokens to consider.

  • top_p (Optional[float]) – limits us to the top tokens within a certain probability mass (p).

  • repetition_penalty (Optional[float]) – penalty applied to repeated sequences, 1.0 means no penalty.

  • add_BOS (Optional[bool]) – whether or not to add a BOS (beginning of sentence) token.

  • all_probs (Optional[bool]) – when using compute_logprob, returns probabilities for all tokens in vocabulary.

  • output_logits (Optional[bool]) – whether to return logits for each token

  • output_scores (Optional[bool]) – whether to return scores for each token

  • end_strings (Optional[List[str]]) – list of strs which will stop generation when they appear in the output.

  • min_length (Optional[int]) – min generated tokens.

  • max_length (Optional[int]) – max generated tokens.

  • init_timeout (float) – timeout for the connection.

class nemo_deploy.nlp.query_llm.NemoQueryLLM(url, model_name)[source]#

Bases: nemo_deploy.nlp.query_llm.NemoQueryLLMBase

Sends a query to Triton for LLM inference.

.. rubric:: Example

from nemo_deploy import NemoQueryLLM

nq = NemoQueryLLM(url=”localhost”, model_name=”GPT-2B”)

prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_output_len=100, top_k=1, top_p=0.0, temperature=0.0, ) print(“prompts: “, prompts)

Initialization

query_llm(
prompts,
stop_words_list=None,
bad_words_list=None,
no_repeat_ngram_size=None,
min_output_len=None,
max_output_len=None,
top_k=None,
top_p=None,
temperature=None,
random_seed=None,
lora_uids=None,
use_greedy: bool = None,
repetition_penalty: float = None,
add_BOS: bool = None,
all_probs: bool = None,
compute_logprob: bool = None,
end_strings=None,
init_timeout=60.0,
openai_format_response: bool = False,
output_context_logits: bool = False,
output_generation_logits: bool = False,
)[source]#

Query the Triton server synchronously and return a list of responses.

Parameters:
  • prompts (List(str)) – list of sentences.

  • max_output_len (int) – max generated tokens.

  • top_k (int) – limits us to a certain number (K) of the top tokens to consider.

  • top_p (float) – limits us to the top tokens within a certain probability mass (p).

  • temperature (float) – A parameter of the softmax function, which is the last layer in the network.

  • random_seed (int) – Seed to condition sampling.

  • stop_words_list (List(str)) – list of stop words.

  • bad_words_list (List(str)) – list of bad words.

  • no_repeat_ngram_size (int) – no repeat ngram size.

  • init_timeout (flat) – timeout for the connection.

  • openai_format_response – return response similar to OpenAI API format

  • output_generation_logits – return generation logits from model on PyTriton

class nemo_deploy.nlp.query_llm.NemoQueryTRTLLMAPI(url, model_name)[source]#

Bases: nemo_deploy.nlp.query_llm.NemoQueryLLMBase

Sends a query to Triton for TensorRT-LLM API deployment inference.

.. rubric:: Example

from nemo_deploy import NemoQueryTRTLLMAPI

nq = NemoQueryTRTLLMAPI(url=”localhost”, model_name=”GPT-2B”)

prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_length=100, top_k=1, top_p=None, temperature=None, ) print(“prompts: “, prompts)

Initialization

query_llm(
prompts: List[str],
max_length: int = 256,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
temperature: Optional[float] = None,
init_timeout: float = 60.0,
)[source]#

Query the Triton server synchronously and return a list of responses.

Parameters:
  • prompts (List(str)) – list of sentences.

  • max_length (int) – max generated tokens.

  • top_k (int) – limits us to a certain number (K) of the top tokens to consider.

  • top_p (float) – limits us to the top tokens within a certain probability mass (p).

  • temperature (float) – A parameter of the softmax function, which is the last layer in the network.

  • init_timeout (flat) – timeout for the connection.

Returns:

A list of generated texts, one for each input prompt.

Return type:

List[str]