vLLM Embedder
Generate text embeddings using vLLM’s optimized inference engine. The VLLMEmbeddingModelStage provides high-throughput embedding generation, particularly for large embedding models where vLLM’s batching and GPU memory management provide significant performance advantages over Sentence Transformers.
Installation: The vLLM embedder is included in the text_cuda12 installation. Install it with:
vLLM is only available on x86_64 Linux systems.
How It Works
VLLMEmbeddingModelStage is a single-stage embedder that handles both tokenization and embedding generation within one stage. Unlike EmbeddingCreatorStage (which splits tokenization and model inference into separate stages), the vLLM embedder delegates all GPU operations to vLLM’s inference engine.
Key features:
- Optional pretokenization: When
pretokenize=True, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughput - vLLM-managed batching: Leverages vLLM’s built-in request scheduling for optimal GPU utilization
- Model download caching: Automatically downloads and caches models from Hugging Face Hub
- Character truncation: Optional
max_charsparameter to limit input length before tokenization
Quick Start
Configuration
Parameters
vLLM Engine Options
Pass additional vLLM configuration through vllm_init_kwargs:
Default vLLM settings applied by the stage (can be overridden):
enforce_eager=False— Uses CUDA graphs for faster inferencerunner="pooling"— Configures vLLM for embedding (pooling) tasksmodel_impl="vllm"— Uses vLLM’s native model implementationdisable_log_stats=True— Suppresses stats logging whenverbose=False
Pretokenization
When pretokenize=True, the stage:
- Loads a Hugging Face Auto Tokenizer for the specified model
- Tokenizes the input text batch on CPU with truncation to
max_model_len - Passes token IDs directly to vLLM using
TokensPrompt
Whether to use pretokenization depends on the model. For google/embeddinggemma-300m (the default for semantic deduplication), pretokenize=False is recommended and is the default. For other models, benchmarks show pretokenization can provide better per-task throughput by reducing GPU idle time during tokenization.
Resources
The VLLMEmbeddingModelStage requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure tensor_parallel_size in vllm_init_kwargs.