Prompt Embeddings#

NVIDIA NIM for Large Language Models supports prompt embeddings, also known as prompt embeds, as a secure alternative to traditional text prompts. Applications can use precomputed embeddings for inference to support more flexible prompt engineering and improve privacy and data security. With prompt embeddings, applications transform sensitive user data into embeddings before sending requests to the inference server, which reduces the risk of exposing confidential information during the AI workflow.

Prompt embeddings support the following use cases:

  • Privacy-Preserving AI: Convert sensitive prompts to embeddings before sending them to the server.

  • Custom Embedding Models: Use specialized, domain-specific embedding models.

  • Embedding Caching: Precompute and cache frequently used embeddings.

  • Advanced Prompt Engineering: Implement sophisticated preprocessing pipelines.

  • Multistage Pipelines: Integrate with proxy services that operate on embeddings.

For background information about prompt embeddings, refer to the vLLM Prompt Embeds documentation.

Architecture#

The following diagram shows how prompt embeddings flow through the system:

sequenceDiagram participant Client participant Transform Proxy box NIM Inference Server participant NIM participant vLLM end Client->>Transform Proxy: POST /v1/completions<br/>(text prompt or prompt-embeds) Transform Proxy->>Transform Proxy: Validate request &<br/>transform text → embeddings Transform Proxy->>NIM: Forward request<br/>with prompt-embeds NIM->>vLLM: Dispatch inference<br/>with prompt embeddings vLLM->>NIM: Generated tokens NIM->>Transform Proxy: Generated response Transform Proxy->>Client: OpenAI API-formatted response

The system uses three main components:

  1. Client: Sends OpenAI API-compliant requests by using /v1/completions.

  2. Transform Proxy: Acts as an intermediary that:

    • Validates incoming requests

    • Transforms text prompts into embeddings

    • Forwards the transformed embeddings to the upstream inference server

    • Processes and streams responses back to the client in the standard OpenAI API format

  3. NIM Inference Server: Receives inference requests with prompt embeddings and returns generated responses to the Transform Proxy.

Prerequisites#

Before you begin, make sure you have the following:

  • A compatible NIM container and backend.

  • Access to the model’s embedding layer or a compatible embedding model.

  • Python packages: torch, transformers, openai.

Warning

Prompt embeddings are currently supported only with the vLLM backend (V1 engine). The following backends are not supported: TRT-LLM and SGLang.

Run NIM#

Enable prompt embeddings by launching NIM with the --enable-prompt-embeds flag:

docker run --gpus=all \
  -e HF_TOKEN \
  -e NIM_MODEL_PATH=hf://meta-llama/Meta-Llama-3.1-8B-Instruct \
  -v ~/.cache/nim:/opt/nim/.cache \
  -p 8000:8000 \
  nim-llm:local \
  --enable-prompt-embeds

API Request Format#

Prompt embeddings are supported through the Completions API (/v1/completions).

{
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "prompt": "",
  "prompt-embeds": "my-base64-encoded-tensor-string",
  "max_tokens": 100,
  "temperature": 0.1,
  "stream": false
}

Request Parameters#

The following table describes the request parameters:

Parameter

Type

Required

Description

model

string

Yes

Model identifier for inference

prompt

string

No*

Text prompt (use empty string with prompt-embeds)

prompt-embeds

string

No*

Base64-encoded PyTorch tensor

max_tokens

integer

No

Maximum number of tokens to generate

temperature

float

No

Sampling temperature (0.0 - 2.0)

stream

boolean

No

Enable response streaming

Note

Either prompt or prompt-embeds must be provided. They cannot be used together. The OpenAI client does not accept None for prompt. Use an empty string "" when sending prompt-embeds.

Response Format#

The response follows the standard OpenAI completions schema, as shown in the following example:

{
  "id": "cmpl-...",
  "object": "text_completion",
  "created": 1234567890,
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "choices": [
    {
      "text": "Generated text...",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 42,
    "total_tokens": 42
  }
}

Note

prompt_tokens might be 0 when using embeddings because the token count is not computed from embeddings.

cURL Example#

The following curl command sends a Completions API request with prompt embeddings:

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "",
    "prompt-embeds": "'"${ENCODED_EMBEDDINGS}"'",
    "max_tokens": 100,
    "temperature": 0.1
  }'

Python Examples#

The following Python examples show common ways to send prompt embeddings to NIM:

Privacy-Preserving Pattern#

Use this pattern to send precomputed embeddings without including the original prompt text:

completion = client.completions.create(
    model=model_name,
    max_tokens=100,
    temperature=0.1,
    prompt="",
    extra_body={"prompt-embeds": transformed_embeds},
)

Complete Example#

The following example demonstrates the full workflow: tokenizing a chat prompt, computing embeddings from the model’s embedding layer, serializing the tensor, and sending the request to NIM.

import base64
import io

import torch
import transformers
from openai import OpenAI


def main():
    client = OpenAI(
        api_key="EMPTY",
        base_url="http://localhost:8000/v1",
    )

    model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"

    # Load tokenizer and model to extract embeddings
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
    transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)

    # Tokenize a chat prompt
    chat = [{"role": "user", "content": "Tell me about France's capital."}]
    token_ids = tokenizer.apply_chat_template(
        chat, add_generation_prompt=True, return_tensors="pt"
    )

    # Compute embeddings from the model's embedding layer
    embedding_layer = transformers_model.get_input_embeddings()
    prompt_embeds = embedding_layer(token_ids).squeeze(0)

    # Serialize and base64-encode the tensor
    buffer = io.BytesIO()
    torch.save(prompt_embeds, buffer)
    buffer.seek(0)
    binary_data = buffer.read()
    encoded_embeds = base64.b64encode(binary_data).decode("utf-8")

    # Send the request to NIM
    completion = client.completions.create(
        model=model_name,
        prompt="",
        max_tokens=10,
        temperature=0.0,
        extra_body={"prompt-embeds": encoded_embeds},
    )

    print("-" * 30)
    print(completion.choices[0].text)
    print("-" * 30)


if __name__ == "__main__":
    main()