vLLM Embedder | NeMo Curator

Generate text embeddings using vLLM’s optimized inference engine. The VLLMEmbeddingModelStage provides high-throughput embedding generation, particularly for large embedding models where vLLM’s batching and GPU memory management provide significant performance advantages over Sentence Transformers.

Installation: The vLLM embedder is included in the text_cuda12 installation. Install it with:

$ uv pip install nemo_curator[text_cuda12]

vLLM is only available on x86_64 Linux systems.

How It Works

VLLMEmbeddingModelStage is a single-stage embedder that handles both tokenization and embedding generation within one stage. Unlike EmbeddingCreatorStage (which splits tokenization and model inference into separate stages), the vLLM embedder delegates all GPU operations to vLLM’s inference engine.

Key features:

Optional pretokenization: When pretokenize=True, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput
vLLM-managed batching: Leverages vLLM’s built-in request scheduling for optimal GPU utilization
Model download caching: Automatically downloads and caches models from Hugging Face Hub
Character truncation: Optional max_chars parameter to limit input length before tokenization

Quick Start

1 from nemo_curator.backends.xenna import XennaExecutor
2 from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
3 from nemo_curator.pipeline import Pipeline
4 from nemo_curator.stages.text.io.reader import ParquetReader
5 from nemo_curator.stages.text.io.writer import ParquetWriter
6 
7 pipeline = Pipeline(
8     name="vllm_embeddings",
9     stages=[
10         ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
11         VLLMEmbeddingModelStage(
12             model_identifier="google/embeddinggemma-300m",
13             text_field="text",
14             embedding_field="embeddings",
15         ),
16         ParquetWriter(path="output/", fields=["text", "embeddings"]),
17     ],
18 )
19 
20 executor = XennaExecutor()
21 pipeline.run(executor)

Configuration

Parameters

Parameter	Type	Default	Description
`model_identifier`	`str`	Required	Hugging Face model name or path for the embedding model
`vllm_init_kwargs`	`dict`	`None`	Additional keyword arguments passed to `vllm.LLM()` for engine configuration
`text_field`	`str`	`"text"`	Name of the input text column in the data
`pretokenize`	`bool`	`False`	Tokenize text on CPU before passing to vLLM. Whether this improves throughput is model-dependent
`embedding_field`	`str`	`"embeddings"`	Name of the output embedding column
`max_chars`	`int`	`None`	Maximum characters per document (truncates before tokenization)
`cache_dir`	`str`	`None`	Directory for caching downloaded model files
`hf_token`	`str`	`None`	Hugging Face token for accessing gated models
`verbose`	`bool`	`False`	Enable verbose logging and progress bars

vLLM Engine Options

Pass additional vLLM configuration through vllm_init_kwargs:

1 VLLMEmbeddingModelStage(
2     model_identifier="google/embeddinggemma-300m",
3     pretokenize=True,
4     vllm_init_kwargs={
5         "enforce_eager": True,       # Disable CUDA graph for debugging
6         "tensor_parallel_size": 2,   # Distribute across 2 GPUs
7         "gpu_memory_utilization": 0.9,
8         "max_model_len": 512,
9     },
10 )

Default vLLM settings applied by the stage (can be overridden):

enforce_eager=False — Uses CUDA graphs for faster inference
runner="pooling" — Configures vLLM for embedding (pooling) tasks
model_impl="vllm" — Uses vLLM’s native model implementation
disable_log_stats=True — Suppresses stats logging when verbose=False

Pretokenization

When pretokenize=True, the stage:

Loads a Hugging Face Auto Tokenizer for the specified model
Tokenizes the input text batch on CPU with truncation to max_model_len
Passes token IDs directly to vLLM using TokensPrompt

Whether to use pretokenization depends on the model. For google/embeddinggemma-300m (the default for semantic deduplication), pretokenize=False is recommended and is the default. For other models, benchmarks show pretokenization can provide better per-task throughput by reducing GPU idle time during tokenization.

1 # Direct text mode (recommended for google/embeddinggemma-300m)
2 VLLMEmbeddingModelStage(
3     model_identifier="google/embeddinggemma-300m",
4     pretokenize=False,  # vLLM handles tokenization internally
5 )
6 
7 # Pretokenize mode (can improve throughput for other models)
8 VLLMEmbeddingModelStage(
9     model_identifier="intfloat/e5-large-v2",
10     pretokenize=True,  # Tokenize on CPU, embed on GPU
11 )

Resources

The VLLMEmbeddingModelStage requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure tensor_parallel_size in vllm_init_kwargs.