> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

> Generate text embeddings using vLLM for high-throughput GPU-accelerated inference with large embedding models

# vLLM Embedder

Generate text embeddings using vLLM's optimized inference engine. The `VLLMEmbeddingModelStage` provides high-throughput embedding generation, particularly for large embedding models where vLLM's batching and GPU memory management provide significant performance advantages over Sentence Transformers.

<Note>
  **Installation**: The vLLM embedder is included in the `text_cuda12` installation. Install it with:

  ```bash
  uv pip install nemo_curator[text_cuda12]
  ```

  vLLM is only available on x86\_64 Linux systems.
</Note>

## How It Works

`VLLMEmbeddingModelStage` is a single-stage embedder that handles both tokenization and embedding generation within one stage. Unlike `EmbeddingCreatorStage` (which splits tokenization and model inference into separate stages), the vLLM embedder delegates all GPU operations to vLLM's inference engine.

Key features:

* **Optional pretokenization**: When `pretokenize=True`, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput
* **vLLM-managed batching**: Leverages vLLM's built-in request scheduling for optimal GPU utilization
* **Model download caching**: Automatically downloads and caches models from Hugging Face Hub
* **Character truncation**: Optional `max_chars` parameter to limit input length before tokenization

## Quick Start

```python
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import ParquetReader
from nemo_curator.stages.text.io.writer import ParquetWriter

pipeline = Pipeline(
    name="vllm_embeddings",
    stages=[
        ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
        VLLMEmbeddingModelStage(
            model_identifier="google/embeddinggemma-300m",
            text_field="text",
            embedding_field="embeddings",
        ),
        ParquetWriter(path="output/", fields=["text", "embeddings"]),
    ],
)

executor = XennaExecutor()
pipeline.run(executor)
```

## Configuration

### Parameters

| Parameter          | Type   | Default        | Description                                                                                      |
| ------------------ | ------ | -------------- | ------------------------------------------------------------------------------------------------ |
| `model_identifier` | `str`  | Required       | Hugging Face model name or path for the embedding model                                          |
| `vllm_init_kwargs` | `dict` | `None`         | Additional keyword arguments passed to `vllm.LLM()` for engine configuration                     |
| `text_field`       | `str`  | `"text"`       | Name of the input text column in the data                                                        |
| `pretokenize`      | `bool` | `False`        | Tokenize text on CPU before passing to vLLM. Whether this improves throughput is model-dependent |
| `embedding_field`  | `str`  | `"embeddings"` | Name of the output embedding column                                                              |
| `max_chars`        | `int`  | `None`         | Maximum characters per document (truncates before tokenization)                                  |
| `cache_dir`        | `str`  | `None`         | Directory for caching downloaded model files                                                     |
| `hf_token`         | `str`  | `None`         | Hugging Face token for accessing gated models                                                    |
| `verbose`          | `bool` | `False`        | Enable verbose logging and progress bars                                                         |

### vLLM Engine Options

Pass additional vLLM configuration through `vllm_init_kwargs`:

```python
VLLMEmbeddingModelStage(
    model_identifier="google/embeddinggemma-300m",
    pretokenize=True,
    vllm_init_kwargs={
        "enforce_eager": True,       # Disable CUDA graph for debugging
        "tensor_parallel_size": 2,   # Distribute across 2 GPUs
        "gpu_memory_utilization": 0.9,
        "max_model_len": 512,
    },
)
```

Default vLLM settings applied by the stage (can be overridden):

* `enforce_eager=False` — Uses CUDA graphs for faster inference
* `runner="pooling"` — Configures vLLM for embedding (pooling) tasks
* `model_impl="vllm"` — Uses vLLM's native model implementation
* `disable_log_stats=True` — Suppresses stats logging when `verbose=False`

### Pretokenization

When `pretokenize=True`, the stage:

1. Loads a Hugging Face Auto Tokenizer for the specified model
2. Tokenizes the input text batch on CPU with truncation to `max_model_len`
3. Passes token IDs directly to vLLM using `TokensPrompt`

Whether to use pretokenization depends on the model. For `google/embeddinggemma-300m` (the default for semantic deduplication), `pretokenize=False` is recommended and is the default. For other models, benchmarks show pretokenization can provide better per-task throughput by reducing GPU idle time during tokenization.

```python
# Direct text mode (recommended for google/embeddinggemma-300m)
VLLMEmbeddingModelStage(
    model_identifier="google/embeddinggemma-300m",
    pretokenize=False,  # vLLM handles tokenization internally
)

# Pretokenize mode (can improve throughput for other models)
VLLMEmbeddingModelStage(
    model_identifier="intfloat/e5-large-v2",
    pretokenize=True,  # Tokenize on CPU, embed on GPU
)
```

## Resources

The `VLLMEmbeddingModelStage` requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure `tensor_parallel_size` in `vllm_init_kwargs`.