For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Digest
  • Getting Started
    • Quickstart
    • Introduction
    • Local Installation
    • Building from Source
    • Contribution Guide
  • Resources
    • Support Matrix
    • Feature Matrix
    • Release Artifacts
    • Examples
  • Kubernetes Deployment
    • Deployment Guide
  • User Guides
    • KV Cache Aware Routing
    • Disaggregated Serving
    • KV Cache Offloading
    • Dynamo Benchmarking
    • Multimodal
    • Diffusion (Preview)
    • Tool Calling
    • LoRA Adapters
    • Agents
    • Observability (Local)
    • Fault Tolerance
    • Writing Python Workers
  • Backends
    • SGLang
    • TensorRT-LLM
    • vLLM
  • Components
    • Frontend
    • Router
    • Planner
    • Profiler
    • KVBM
  • Integrations
    • LMCache
    • SGLang HiCache
    • FlexKV
    • KV Events for Custom Engines
  • Design Docs
    • Overall Architecture
    • Architecture Flow
    • Disaggregated Serving
    • Distributed Runtime
    • Blog
  • Documentation
    • Dynamo Docs Guide
  • Additional Resources
    • Speculative Decoding
      • Speculative Decoding with vLLM
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
Digest
On this page
  • Prerequisites
  • Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3
  • Step 1: Set Up Your Docker Environment
  • Step 2: Get Access to the Llama-3 Model
  • Step 3: Run Aggregated Speculative Decoding
  • Step 4: Test the Deployment
  • Example Output
  • Configuration
  • Limitations
  • See Also
Additional ResourcesSpeculative Decoding

Speculative Decoding with vLLM

||View as Markdown|
Edit this page
Previous

Speculative Decoding

Using Speculative Decoding with the vLLM backend.

See also: Speculative Decoding Overview for cross-backend documentation.

Prerequisites

  • vLLM container with Eagle3 support
  • GPU with at least 16GB VRAM
  • Hugging Face access token (for gated models)

Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3

This guide walks through deploying Meta-Llama-3.1-8B-Instruct with Eagle3 speculative decoding on a single node.

Step 1: Set Up Your Docker Environment

First, initialize a Docker container using the vLLM backend. See the vLLM Quickstart Guide for details.

$# Launch infrastructure services
$docker compose -f deploy/docker-compose.yml up -d
$
$# Build the container
$./container/build.sh --framework VLLM
$
$# Run the container
$./container/run.sh -it --framework VLLM --mount-workspace

Step 2: Get Access to the Llama-3 Model

The Meta-Llama-3.1-8B-Instruct model is gated. Request access on Hugging Face: Meta-Llama-3.1-8B-Instruct repository

Approval time varies depending on Hugging Face review traffic.

Once approved, set your access token inside the container:

$export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
$export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN

Step 3: Run Aggregated Speculative Decoding

$# Requires only one GPU
$cd examples/backends/vllm
$bash launch/agg_spec_decoding.sh

Once the weights finish downloading, the server will be ready for inference requests.

Step 4: Test the Deployment

$curl http://localhost:8000/v1/chat/completions \
> -H "Content-Type: application/json" \
> -d '{
> "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
> "messages": [
> {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
> ],
> "max_tokens": 250
> }'

Example Output

1{
2 "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
3 "choices": [
4 {
5 "message": {
6 "role": "assistant",
7 "content": "In cherry blossom's gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes."
8 },
9 "index": 0,
10 "finish_reason": "stop"
11 }
12 ],
13 "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
14 "usage": {
15 "prompt_tokens": 16,
16 "completion_tokens": 250,
17 "total_tokens": 266
18 }
19}

Configuration

Speculative decoding in vLLM uses Eagle3 as the draft model. The launch script configures:

  • Target model: meta-llama/Meta-Llama-3.1-8B-Instruct
  • Draft model: Eagle3 variant
  • Aggregated serving mode

See examples/backends/vllm/launch/agg_spec_decoding.sh for the full configuration.

Limitations

  • Currently only supports Eagle3 as the draft model
  • Requires compatible model architectures between target and draft

See Also

DocumentPath
Speculative Decoding OverviewREADME.md
vLLM Backend GuidevLLM README
Meta-Llama-3.1-8B-InstructHugging Face