Running Meta-Llama-3.1-8B-Instruct with Speculative Decoding (Eagle3)

This guide walks through how to deploy Meta-Llama-3.1-8B-Instruct using aggregated speculative decoding with Eagle3 on a single node. Since the model is only 8B parameters, you can run it on any GPU with at least 16GB VRAM.

Step 1: Set Up Your Docker Environment

First, we’ll initialize a Docker container using the VLLM backend. You can refer to the VLLM Quickstart Guide — or follow the full steps below.

1. Launch Docker Compose

$ docker compose -f deploy/docker-compose.yml up -d

2. Build the Container

$ ./container/build.sh --framework VLLM

3. Run the Container

$ ./container/run.sh -it --framework VLLM --mount-workspace

Step 2: Get Access to the Llama-3 Model

The Meta-Llama-3.1-8B-Instruct model is gated, so you’ll need to request access on Hugging Face. Go to the official Meta-Llama-3.1-8B-Instruct repository and fill out the access form. Approval usually takes around 5 minutes.

Once you have access, generate a Hugging Face access token with permission for gated repositories, then set it inside your container:

$ export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
$ export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN

Step 3: Run Aggregated Speculative Decoding

Now that your environment is ready, start the aggregated server with speculative decoding.

$ # Requires only one GPU
$ cd examples/backends/vllm
$ bash launch/agg_spec_decoding.sh

Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model.

Step 4: Example Request

To verify your setup, try sending a simple prompt to your model:

$ curl http://localhost:8000/v1/chat/completions \
>    -H "Content-Type: application/json" \
>    -d '{
>      "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
>      "messages": [
>        {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
>      ],
>      "max_tokens": 250
>    }'

Example Output

1 {
2   "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
3   "choices": [
4     {
5       "text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.",
6       "index": 0,
7       "finish_reason": "stop"
8     }
9   ],
10   "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
11   "usage": {
12     "prompt_tokens": 16,
13     "completion_tokens": 250,
14     "total_tokens": 266
15   }
16 }