nemo_deploy.nlp.query_llm
#
Module Contents#
Classes#
Abstract base class for querying a Large Language Model (LLM). |
|
Sends a query to Triton for LLM inference. |
|
Sends a query to Triton for LLM inference. |
|
Sends a query to Triton for LLM inference. |
|
Sends a query to Triton for TensorRT-LLM API deployment inference. |
API#
- class nemo_deploy.nlp.query_llm.NemoQueryLLMBase(url, model_name)[source]#
Bases:
abc.ABC
Abstract base class for querying a Large Language Model (LLM).
Args: url (str): The URL of the inference server. model_name (str): The name of the model to be queried.
Initialization
- class nemo_deploy.nlp.query_llm.NemoQueryLLMPyTorch(url, model_name)[source]#
Bases:
nemo_deploy.nlp.query_llm.NemoQueryLLMBase
Sends a query to Triton for LLM inference.
.. rubric:: Example
from nemo_deploy import NemoTritonQueryLLMPyTorch
nq = NemoTritonQueryLLMPyTorch(url=”localhost”, model_name=”GPT-2B”)
prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_length=100, top_k=1, top_p=0.0, temperature=0.0, ) print(“prompts: “, prompts)
Initialization
- query_llm(
- prompts: List[str],
- use_greedy: Optional[bool] = None,
- temperature: Optional[float] = None,
- top_k: Optional[int] = None,
- top_p: Optional[float] = None,
- repetition_penalty: Optional[float] = None,
- add_BOS: Optional[bool] = None,
- all_probs: Optional[bool] = None,
- compute_logprob: Optional[bool] = None,
- end_strings: Optional[List[str]] = None,
- min_length: Optional[int] = None,
- max_length: Optional[int] = None,
- apply_chat_template: bool = False,
- n_top_logprobs: Optional[int] = None,
- init_timeout: float = 60.0,
- echo: Optional[bool] = None,
Query the Triton server synchronously and return a list of responses.
- Parameters:
prompts (List(str)) – list of sentences.
use_greedy (bool) – use greedy sampling, effectively the same as top_k=1
temperature (float) – A parameter of the softmax function, which is the last layer in the network.
top_k (int) – limits us to a certain number (K) of the top tokens to consider.
top_p (float) – limits us to the top tokens within a certain probability mass (p).
repetition_penalty (float) – penalty applied to repeated sequences, 1.0 means no penalty.
add_BOS (bool) – whether or not to add a BOS (beginning of sentence) token.
all_probs (bool) – when using compute_logprob, returns probabilities for all tokens in vocabulary.
compute_logprob (bool) – get back probabilities of all tokens in the sequence.
end_strings (List(str)) – list of strings which will terminate generation when they appear in the output.
min_length (int) – min generated tokens.
max_length (int) – max generated tokens.
apply_chat_template (bool) – applies chat template if its a chat model. Default: False
init_timeout (flat) – timeout for the connection.
- class nemo_deploy.nlp.query_llm.NemoQueryLLMHF(url, model_name)[source]#
Bases:
nemo_deploy.nlp.query_llm.NemoQueryLLMBase
Sends a query to Triton for LLM inference.
.. rubric:: Example
from nemo_deploy import NemoQueryLLMHF
nq = NemoQueryLLMHF(url=”localhost”, model_name=”GPT-2B”)
prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_length=100, top_k=1, top_p=0.0, temperature=0.0, ) print(“prompts: “, prompts)
Initialization
- query_llm(
- prompts: List[str],
- use_greedy: Optional[bool] = None,
- temperature: Optional[float] = None,
- top_k: Optional[int] = None,
- top_p: Optional[float] = None,
- repetition_penalty: Optional[float] = None,
- add_BOS: Optional[bool] = None,
- all_probs: Optional[bool] = None,
- output_logits: Optional[bool] = None,
- output_scores: Optional[bool] = None,
- end_strings: Optional[List[str]] = None,
- min_length: Optional[int] = None,
- max_length: Optional[int] = None,
- init_timeout: float = 60.0,
Query the Triton server synchronously and return a list of responses.
- Parameters:
prompts (List[str]) – list of sentences.
use_greedy (Optional[bool]) – use greedy sampling, effectively the same as top_k=1
temperature (Optional[float]) – A parameter of the softmax function, which is the last layer in the network.
top_k (Optional[int]) – limits us to a certain number (K) of the top tokens to consider.
top_p (Optional[float]) – limits us to the top tokens within a certain probability mass (p).
repetition_penalty (Optional[float]) – penalty applied to repeated sequences, 1.0 means no penalty.
add_BOS (Optional[bool]) – whether or not to add a BOS (beginning of sentence) token.
all_probs (Optional[bool]) – when using compute_logprob, returns probabilities for all tokens in vocabulary.
output_logits (Optional[bool]) – whether to return logits for each token
output_scores (Optional[bool]) – whether to return scores for each token
end_strings (Optional[List[str]]) – list of strs which will stop generation when they appear in the output.
min_length (Optional[int]) – min generated tokens.
max_length (Optional[int]) – max generated tokens.
init_timeout (float) – timeout for the connection.
- class nemo_deploy.nlp.query_llm.NemoQueryLLM(url, model_name)[source]#
Bases:
nemo_deploy.nlp.query_llm.NemoQueryLLMBase
Sends a query to Triton for LLM inference.
.. rubric:: Example
from nemo_deploy import NemoQueryLLM
nq = NemoQueryLLM(url=”localhost”, model_name=”GPT-2B”)
prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_output_len=100, top_k=1, top_p=0.0, temperature=0.0, ) print(“prompts: “, prompts)
Initialization
- query_llm(
- prompts,
- stop_words_list=None,
- bad_words_list=None,
- no_repeat_ngram_size=None,
- min_output_len=None,
- max_output_len=None,
- top_k=None,
- top_p=None,
- temperature=None,
- random_seed=None,
- lora_uids=None,
- use_greedy: bool = None,
- repetition_penalty: float = None,
- add_BOS: bool = None,
- all_probs: bool = None,
- compute_logprob: bool = None,
- end_strings=None,
- init_timeout=60.0,
- openai_format_response: bool = False,
- output_context_logits: bool = False,
- output_generation_logits: bool = False,
Query the Triton server synchronously and return a list of responses.
- Parameters:
prompts (List(str)) – list of sentences.
max_output_len (int) – max generated tokens.
top_k (int) – limits us to a certain number (K) of the top tokens to consider.
top_p (float) – limits us to the top tokens within a certain probability mass (p).
temperature (float) – A parameter of the softmax function, which is the last layer in the network.
random_seed (int) – Seed to condition sampling.
stop_words_list (List(str)) – list of stop words.
bad_words_list (List(str)) – list of bad words.
no_repeat_ngram_size (int) – no repeat ngram size.
init_timeout (flat) – timeout for the connection.
openai_format_response – return response similar to OpenAI API format
output_generation_logits – return generation logits from model on PyTriton
- class nemo_deploy.nlp.query_llm.NemoQueryTRTLLMAPI(url, model_name)[source]#
Bases:
nemo_deploy.nlp.query_llm.NemoQueryLLMBase
Sends a query to Triton for TensorRT-LLM API deployment inference.
.. rubric:: Example
from nemo_deploy import NemoQueryTRTLLMAPI
nq = NemoQueryTRTLLMAPI(url=”localhost”, model_name=”GPT-2B”)
prompts = [“hello, testing GPT inference”, “another GPT inference test?”] output = nq.query_llm( prompts=prompts, max_length=100, top_k=1, top_p=None, temperature=None, ) print(“prompts: “, prompts)
Initialization
- query_llm(
- prompts: List[str],
- max_length: int = 256,
- top_k: Optional[int] = None,
- top_p: Optional[float] = None,
- temperature: Optional[float] = None,
- init_timeout: float = 60.0,
Query the Triton server synchronously and return a list of responses.
- Parameters:
prompts (List(str)) – list of sentences.
max_length (int) – max generated tokens.
top_k (int) – limits us to a certain number (K) of the top tokens to consider.
top_p (float) – limits us to the top tokens within a certain probability mass (p).
temperature (float) – A parameter of the softmax function, which is the last layer in the network.
init_timeout (flat) – timeout for the connection.
- Returns:
A list of generated texts, one for each input prompt.
- Return type:
List[str]