Is this page helpful?

Prompt Embeddings#

NVIDIA NIM for Large Language Models supports prompt embeddings, also known as prompt embeds, as a secure alternative to traditional text prompts. Applications can use precomputed embeddings for inference to support more flexible prompt engineering and improve privacy and data security. With prompt embeddings, applications transform sensitive user data into embeddings before sending requests to the inference server, which reduces the risk of exposing confidential information during the AI workflow.

Prompt embeddings support the following use cases:

Privacy-Preserving AI: Convert sensitive prompts to embeddings before sending them to the server.
Custom Embedding Models: Use specialized, domain-specific embedding models.
Embedding Caching: Precompute and cache frequently used embeddings.
Advanced Prompt Engineering: Implement sophisticated preprocessing pipelines.
Multistage Pipelines: Integrate with proxy services that operate on embeddings.

For background information about prompt embeddings, refer to the vLLM Prompt Embeds documentation.

Architecture#

The following diagram shows how prompt embeddings flow through the system:

sequenceDiagram participant Client participant Transform Proxy box NIM Inference Server participant NIM participant vLLM end Client->>Transform Proxy: POST /v1/completions<br/>(text prompt or prompt_embeds) Transform Proxy->>Transform Proxy: Validate request &<br/>transform text → embeddings Transform Proxy->>NIM: Forward request<br/>with prompt_embeds NIM->>vLLM: Dispatch inference<br/>with prompt embeddings vLLM->>NIM: Generated tokens NIM->>Transform Proxy: Generated response Transform Proxy->>Client: OpenAI API-formatted response

The system uses three main components:

Client: Sends OpenAI API-compliant requests by using /v1/completions.
Transform Proxy: Acts as an intermediary that:
- Validates incoming requests
- Transforms text prompts into embeddings
- Forwards the transformed embeddings to the upstream inference server
- Processes and streams responses back to the client in the standard OpenAI API format
NIM Inference Server: Receives inference requests with prompt embeddings and returns generated responses to the Transform Proxy.

Prerequisites#

Before you begin, make sure you have the following:

A compatible NIM container and backend.
Access to the model’s embedding layer or a compatible embedding model.
Python packages: torch, transformers, openai.

Warning

Prompt embeddings are currently supported only with the vLLM backend (V1 engine). The following backends are not supported: TRT-LLM and SGLang.

Run NIM#

Enable prompt embeddings by launching NIM with the --enable-prompt-embeds flag:

docker run --gpus=all \
  -e HF_TOKEN \
  -e NIM_MODEL_PATH=hf://meta-llama/Meta-Llama-3.1-8B-Instruct \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -p 8000:8000 \
  nim-llm:local \
  --enable-prompt-embeds

API Request Format#

Prompt embeddings are supported through the Completions API (/v1/completions).

{
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "prompt": "",
  "prompt_embeds": "my-base64-encoded-tensor-string",
  "max_tokens": 100,
  "temperature": 0.1,
  "stream": false
}

Request Parameters#

The following table describes the request parameters:

Parameter	Type	Required	Description
`model`	string	Yes	Model identifier for inference
`prompt`	string	No*	Text prompt (use empty string with `prompt_embeds`)
`prompt_embeds`	string	No*	Base64-encoded PyTorch tensor
`max_tokens`	integer	No	Maximum number of tokens to generate
`temperature`	float	No	Sampling temperature (0.0 - 2.0)
`stream`	boolean	No	Enable response streaming

Note

Either prompt or prompt_embeds must be provided. They cannot be used together. The OpenAI client does not accept None for prompt. Use an empty string "" when sending prompt_embeds.

Response Format#

The response follows the standard OpenAI completions schema, as shown in the following example:

{
  "id": "cmpl-...",
  "object": "text_completion",
  "created": 1234567890,
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "choices": [
    {
      "text": "Generated text...",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 42,
    "total_tokens": 42
  }
}

Note

prompt_tokens might be 0 when using embeddings because the token count is not computed from embeddings.

cURL Example#

The following curl command sends a Completions API request with prompt embeddings:

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "",
    "prompt_embeds": "'"${ENCODED_EMBEDDINGS}"'",
    "max_tokens": 100,
    "temperature": 0.1
  }'

Python Examples#

The following Python examples show common ways to send prompt embeddings to NIM:

Privacy-Preserving Pattern#

Use this pattern to send precomputed embeddings without including the original prompt text:

completion = client.completions.create(
    model=model_name,
    max_tokens=100,
    temperature=0.1,
    prompt="",
    extra_body={"prompt_embeds": transformed_embeds},
)

Complete Example#

The following example demonstrates the full workflow: tokenizing a chat prompt, computing embeddings from the model’s embedding layer, serializing the tensor, and sending the request to NIM.

import base64
import io

import torch
import transformers
from openai import OpenAI


def main():
    client = OpenAI(
        api_key="EMPTY",
        base_url="http://localhost:8000/v1",
    )

    model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

    # Load tokenizer and model to extract embeddings
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
    transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)

    # Tokenize a chat prompt
    chat = [{"role": "user", "content": "Tell me about France's capital."}]
    token_ids = tokenizer.apply_chat_template(
        chat, add_generation_prompt=True, return_tensors="pt"
    )

    # Compute embeddings from the model's embedding layer
    embedding_layer = transformers_model.get_input_embeddings()
    prompt_embeds = embedding_layer(token_ids).squeeze(0)

    # Serialize and base64-encode the tensor
    buffer = io.BytesIO()
    torch.save(prompt_embeds, buffer)
    buffer.seek(0)
    binary_data = buffer.read()
    encoded_embeds = base64.b64encode(binary_data).decode("utf-8")

    # Send the request to NIM
    completion = client.completions.create(
        model=model_name,
        prompt="",
        max_tokens=10,
        temperature=0.0,
        extra_body={"prompt_embeds": encoded_embeds},
    )

    print("-" * 30)
    print(completion.choices[0].text)
    print("-" * 30)


if __name__ == "__main__":
    main()