Prompt Embeddings#
NVIDIA NIM for Large Language Models supports prompt embeddings, also known as prompt embeds, as a secure alternative to traditional text prompts. Applications can use precomputed embeddings for inference to support more flexible prompt engineering and improve privacy and data security. With prompt embeddings, applications transform sensitive user data into embeddings before sending requests to the inference server, which reduces the risk of exposing confidential information during the AI workflow.
Prompt embeddings support the following use cases:
Privacy-Preserving AI: Convert sensitive prompts to embeddings before sending them to the server.
Custom Embedding Models: Use specialized, domain-specific embedding models.
Embedding Caching: Precompute and cache frequently used embeddings.
Advanced Prompt Engineering: Implement sophisticated preprocessing pipelines.
Multistage Pipelines: Integrate with proxy services that operate on embeddings.
For background information about prompt embeddings, refer to the vLLM Prompt Embeds documentation.
Architecture#
The following diagram shows how prompt embeddings flow through the system:
The system uses three main components:
Client: Sends OpenAI API-compliant requests by using
/v1/completions.Transform Proxy: Acts as an intermediary that:
Validates incoming requests
Transforms text prompts into embeddings
Forwards the transformed embeddings to the upstream inference server
Processes and streams responses back to the client in the standard OpenAI API format
NIM Inference Server: Receives inference requests with prompt embeddings and returns generated responses to the Transform Proxy.
Prerequisites#
Before you begin, make sure you have the following:
A compatible NIM container and backend.
Access to the model’s embedding layer or a compatible embedding model.
Python packages:
torch,transformers,openai.
Warning
Prompt embeddings are currently supported only with the vLLM backend (V1 engine). The following backends are not supported: TRT-LLM and SGLang.
Run NIM#
Enable prompt embeddings by launching NIM with the --enable-prompt-embeds flag:
docker run --gpus=all \
-e HF_TOKEN \
-e NIM_MODEL_PATH=hf://meta-llama/Meta-Llama-3.1-8B-Instruct \
-v ~/.cache/nim:/opt/nim/.cache \
-p 8000:8000 \
nim-llm:local \
--enable-prompt-embeds
API Request Format#
Prompt embeddings are supported through the Completions API (/v1/completions).
{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "",
"prompt-embeds": "my-base64-encoded-tensor-string",
"max_tokens": 100,
"temperature": 0.1,
"stream": false
}
Request Parameters#
The following table describes the request parameters:
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Model identifier for inference |
|
string |
No* |
Text prompt (use empty string with |
|
string |
No* |
Base64-encoded PyTorch tensor |
|
integer |
No |
Maximum number of tokens to generate |
|
float |
No |
Sampling temperature (0.0 - 2.0) |
|
boolean |
No |
Enable response streaming |
Note
Either prompt or prompt-embeds must be provided. They cannot be used together.
The OpenAI client does not accept None for prompt. Use an empty string "" when
sending prompt-embeds.
Response Format#
The response follows the standard OpenAI completions schema, as shown in the following example:
{
"id": "cmpl-...",
"object": "text_completion",
"created": 1234567890,
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"choices": [
{
"text": "Generated text...",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 42,
"total_tokens": 42
}
}
Note
prompt_tokens might be 0 when using embeddings because the token count is not computed from
embeddings.
cURL Example#
The following curl command sends a Completions API request with prompt embeddings:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "",
"prompt-embeds": "'"${ENCODED_EMBEDDINGS}"'",
"max_tokens": 100,
"temperature": 0.1
}'
Python Examples#
The following Python examples show common ways to send prompt embeddings to NIM:
Privacy-Preserving Pattern#
Use this pattern to send precomputed embeddings without including the original prompt text:
completion = client.completions.create(
model=model_name,
max_tokens=100,
temperature=0.1,
prompt="",
extra_body={"prompt-embeds": transformed_embeds},
)
Complete Example#
The following example demonstrates the full workflow: tokenizing a chat prompt, computing embeddings from the model’s embedding layer, serializing the tensor, and sending the request to NIM.
import base64
import io
import torch
import transformers
from openai import OpenAI
def main():
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# Load tokenizer and model to extract embeddings
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
# Tokenize a chat prompt
chat = [{"role": "user", "content": "Tell me about France's capital."}]
token_ids = tokenizer.apply_chat_template(
chat, add_generation_prompt=True, return_tensors="pt"
)
# Compute embeddings from the model's embedding layer
embedding_layer = transformers_model.get_input_embeddings()
prompt_embeds = embedding_layer(token_ids).squeeze(0)
# Serialize and base64-encode the tensor
buffer = io.BytesIO()
torch.save(prompt_embeds, buffer)
buffer.seek(0)
binary_data = buffer.read()
encoded_embeds = base64.b64encode(binary_data).decode("utf-8")
# Send the request to NIM
completion = client.completions.create(
model=model_name,
prompt="",
max_tokens=10,
temperature=0.0,
extra_body={"prompt-embeds": encoded_embeds},
)
print("-" * 30)
print(completion.choices[0].text)
print("-" * 30)
if __name__ == "__main__":
main()