Running Meta-Llama-3.1-8B-Instruct with Speculative Decoding (Eagle3)#
This guide walks through how to deploy Meta-Llama-3.1-8B-Instruct using aggregated speculative decoding with Eagle3 on a single node. Since the model is only 8B parameters, you can run it on any GPU with at least 16GB VRAM.
Step 1: Set Up Your Docker Environment#
First, we’ll initialize a Docker container using the VLLM backend. You can refer to the VLLM Quickstart Guide — or follow the full steps below.
1. Launch Docker Compose#
docker compose -f deploy/docker-compose.yml up -d
2. Build the Container#
./container/build.sh --framework VLLM
3. Run the Container#
./container/run.sh -it --framework VLLM --mount-workspace
Step 2: Get Access to the Llama-3 Model#
The Meta-Llama-3.1-8B-Instruct model is gated, so you’ll need to request access on Hugging Face. Go to the official Meta-Llama-3.1-8B-Instruct repository and fill out the access form. Approval usually takes around 5 minutes.
Once you have access, generate a Hugging Face access token with permission for gated repositories, then set it inside your container:
export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
Step 3: Run Aggregated Speculative Decoding#
Now that your environment is ready, start the aggregated server with speculative decoding.
# Requires only one GPU
cd examples/backends/vllm
bash launch/agg_spec_decoding.sh
Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model.
Step 4: Example Request#
To verify your setup, try sending a simple prompt to your model:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
],
"max_tokens": 250
}'
Example Output#
{
"id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
"choices": [
{
"text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.",
"index": 0,
"finish_reason": "stop"
}
],
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"usage": {
"prompt_tokens": 16,
"completion_tokens": 250,
"total_tokens": 266
}
}