Send Queries to the NVIDIA Triton Server for NeMo LLMs#
After starting the service with the scripts supplied in the TensorRT-LLM, vLLM, and In-Framework sections, the service will be in standby mode, ready to receive incoming requests. There are multiple methods available for sending queries to this service.
Use the Query Script: Execute the query script within the currently running container.
PyTriton: Utilize PyTriton to send requests directly.
HTTP Requests: Make HTTP requests using various tools or libraries.
Send a Query using the Script#
The following example shows how to execute the query script within the currently running container.
To use a query script, run the following command:
python scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name nemotron --prompt "What is the capital of United States?"
Change the url and the
model_name
based on your server and the model name of your service. The code in the script can be used as a basis for your client code as well.If the there is a prompt embedding table, run the following command to send a query:
python scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name nemotron --prompt "What is the capital of United States?" --task_id "task 1"
The following parameters are defined in the
deploy_triton.py
script:--url
: url for the triton server. Default=”0.0.0.0”.--model_name
: name of the triton model to query.--prompt
: user prompt.--max_output_len
: Max output token length. Default=128.--top_k
: considers only the top N most likely tokens at each step.--top_p
: determines the cumulative probability distribution used for sampling the next token in the generated response. Controls the diversity of the output.--temperature
: controls the randomness of the generated output. Higher value, such as 1.0, leads to more randomness and diversity in the generated text, a lower value, like 0.2, produces more focused and deterministic responses.--task_id
: id of a task if ptuning is enabled.
Send a Query using the NeMo APIs#
The NeMo Framework provides NemoQueryLLM APIs to send a query to the Triton server for convenience. These APIs are only accessible from the NeMo Framework container.
To run the request example using NeMo APIs, run the following command:
from nemo.deploy.nlp import NemoQueryLLM nq = NemoQueryLLM(url="localhost:8000", model_name="nemotron") output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_len=10, top_k=1, top_p=0.0, temperature=1.0) print(output)
Change the url and the
model_name
based on your server and the model name of your service. Please check the NeMoQuery docstrings for details.If there is a prompt embedding table, run the following command to send a query:
output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_len=10, top_k=1, top_p=0.0, temperature=1.0, task_id="0")