Using Speculative Decoding with the vLLM backend.
See also: Speculative Decoding Overview for cross-backend documentation.
This guide walks through deploying Meta-Llama-3.1-8B-Instruct with Eagle3 speculative decoding on a single node.
First, initialize a Docker container using the vLLM backend. See the vLLM Quickstart Guide for details.
The Meta-Llama-3.1-8B-Instruct model is gated. Request access on Hugging Face: Meta-Llama-3.1-8B-Instruct repository
Approval time varies depending on Hugging Face review traffic.
Once approved, set your access token inside the container:
Once the weights finish downloading, the server will be ready for inference requests.
Speculative decoding in vLLM uses Eagle3 as the draft model. The launch script configures:
meta-llama/Meta-Llama-3.1-8B-InstructSee examples/backends/vllm/launch/agg_spec_decoding.sh for the full configuration.