Curate TextProcess DataEmbeddings

vLLM Embedder

View as Markdown

Generate text embeddings using vLLM’s optimized inference engine. The VLLMEmbeddingModelStage provides high-throughput embedding generation, particularly for large embedding models where vLLM’s batching and GPU memory management provide significant performance advantages over Sentence Transformers.

Installation: The vLLM embedder is included in the text_cuda12 installation. Install it with:

$uv pip install nemo_curator[text_cuda12]

vLLM is only available on x86_64 Linux systems.

How It Works

VLLMEmbeddingModelStage is a single-stage embedder that handles both tokenization and embedding generation within one stage. Unlike EmbeddingCreatorStage (which splits tokenization and model inference into separate stages), the vLLM embedder delegates all GPU operations to vLLM’s inference engine.

Key features:

  • Optional pretokenization: When pretokenize=True, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput
  • vLLM-managed batching: Leverages vLLM’s built-in request scheduling for optimal GPU utilization
  • Model download caching: Automatically downloads and caches models from Hugging Face Hub
  • Character truncation: Optional max_chars parameter to limit input length before tokenization

Quick Start

1from nemo_curator.backends.xenna import XennaExecutor
2from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
3from nemo_curator.pipeline import Pipeline
4from nemo_curator.stages.text.io.reader import ParquetReader
5from nemo_curator.stages.text.io.writer import ParquetWriter
6
7pipeline = Pipeline(
8 name="vllm_embeddings",
9 stages=[
10 ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
11 VLLMEmbeddingModelStage(
12 model_identifier="google/embeddinggemma-300m",
13 text_field="text",
14 embedding_field="embeddings",
15 ),
16 ParquetWriter(path="output/", fields=["text", "embeddings"]),
17 ],
18)
19
20executor = XennaExecutor()
21pipeline.run(executor)

Configuration

Parameters

ParameterTypeDefaultDescription
model_identifierstrRequiredHugging Face model name or path for the embedding model
vllm_init_kwargsdictNoneAdditional keyword arguments passed to vllm.LLM() for engine configuration
text_fieldstr"text"Name of the input text column in the data
pretokenizeboolFalseTokenize text on CPU before passing to vLLM. Whether this improves throughput is model-dependent
embedding_fieldstr"embeddings"Name of the output embedding column
max_charsintNoneMaximum characters per document (truncates before tokenization)
cache_dirstrNoneDirectory for caching downloaded model files
hf_tokenstrNoneHugging Face token for accessing gated models
verboseboolFalseEnable verbose logging and progress bars

vLLM Engine Options

Pass additional vLLM configuration through vllm_init_kwargs:

1VLLMEmbeddingModelStage(
2 model_identifier="google/embeddinggemma-300m",
3 pretokenize=True,
4 vllm_init_kwargs={
5 "enforce_eager": True, # Disable CUDA graph for debugging
6 "tensor_parallel_size": 2, # Distribute across 2 GPUs
7 "gpu_memory_utilization": 0.9,
8 "max_model_len": 512,
9 },
10)

Default vLLM settings applied by the stage (can be overridden):

  • enforce_eager=False — Uses CUDA graphs for faster inference
  • runner="pooling" — Configures vLLM for embedding (pooling) tasks
  • model_impl="vllm" — Uses vLLM’s native model implementation
  • disable_log_stats=True — Suppresses stats logging when verbose=False

Pretokenization

When pretokenize=True, the stage:

  1. Loads a Hugging Face Auto Tokenizer for the specified model
  2. Tokenizes the input text batch on CPU with truncation to max_model_len
  3. Passes token IDs directly to vLLM using TokensPrompt

Whether to use pretokenization depends on the model. For google/embeddinggemma-300m (the default for semantic deduplication), pretokenize=False is recommended and is the default. For other models, benchmarks show pretokenization can provide better per-task throughput by reducing GPU idle time during tokenization.

1# Direct text mode (recommended for google/embeddinggemma-300m)
2VLLMEmbeddingModelStage(
3 model_identifier="google/embeddinggemma-300m",
4 pretokenize=False, # vLLM handles tokenization internally
5)
6
7# Pretokenize mode (can improve throughput for other models)
8VLLMEmbeddingModelStage(
9 model_identifier="intfloat/e5-large-v2",
10 pretokenize=True, # Tokenize on CPU, embed on GPU
11)

Resources

The VLLMEmbeddingModelStage requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure tensor_parallel_size in vllm_init_kwargs.