Generate text embeddings using vLLM’s optimized inference engine. The VLLMEmbeddingModelStage provides high-throughput embedding generation, particularly for large embedding models where vLLM’s batching and GPU memory management provide significant performance advantages over Sentence Transformers.
Installation: The vLLM embedder is included in the text_cuda12 installation. Install it with:
vLLM is only available on x86_64 Linux systems.
VLLMEmbeddingModelStage is a single-stage embedder that handles both tokenization and embedding generation within one stage. Unlike EmbeddingCreatorStage (which splits tokenization and model inference into separate stages), the vLLM embedder delegates all GPU operations to vLLM’s inference engine.
Key features:
pretokenize=True, the stage tokenizes text on CPU before passing tokens to vLLM, reducing GPU idle time and improving throughputmax_chars parameter to limit input length before tokenizationPass additional vLLM configuration through vllm_init_kwargs:
Default vLLM settings applied by the stage (can be overridden):
enforce_eager=False — Uses CUDA graphs for faster inferencerunner="pooling" — Configures vLLM for embedding (pooling) tasksmodel_impl="vllm" — Uses vLLM’s native model implementationdisable_log_stats=True — Suppresses stats logging when verbose=FalseWhen pretokenize=True, the stage:
max_model_lenTokensPromptWhether to use pretokenization depends on the model. For google/embeddinggemma-300m (the default for semantic deduplication), pretokenize=False is recommended and is the default. For other models, benchmarks show pretokenization can provide better per-task throughput by reducing GPU idle time during tokenization.
The VLLMEmbeddingModelStage requests 1 CPU and 1 GPU per worker by default. For multi-GPU models, configure tensor_parallel_size in vllm_init_kwargs.