***

title: Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)
---------------------

For clean Markdown content of this page, append .md to this URL. For the complete documentation index, see https://docs.nvidia.com/dynamo/llms.txt. For full content including API reference and SDK examples, see https://docs.nvidia.com/dynamo/llms-full.txt.

This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node.
Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**.


## Step 1: Set Up Your Docker Environment

First, we’ll initialize a Docker container using the VLLM backend.
You can refer to the [VLLM Quickstart Guide](/dynamo/v-0-8-1/components/backends/v-llm#vllm-quick-start) — or follow the full steps below.

### 1. Launch Docker Compose

```bash
docker compose -f deploy/docker-compose.yml up -d
```

### 2. Build the Container

```bash
./container/build.sh --framework VLLM
```

### 3. Run the Container

```bash
./container/run.sh -it --framework VLLM --mount-workspace
```


## Step 2: Get Access to the Llama-3 Model

The **Meta-Llama-3.1-8B-Instruct** model is gated, so you’ll need to request access on Hugging Face.
Go to the official [Meta-Llama-3.1-8B-Instruct repository](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) and fill out the access form.
Approval usually takes around **5 minutes**.

Once you have access, generate a **Hugging Face access token** with permission for gated repositories, then set it inside your container:

```bash
export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN
```


## Step 3: Run Aggregated Speculative Decoding

Now that your environment is ready, start the aggregated server with **speculative decoding**.

```bash
# Requires only one GPU
cd examples/backends/vllm
bash launch/agg_spec_decoding.sh
```

Once the weights finish downloading and serving begins, you’ll be ready to send inference requests to your model.


## Step 4: Example Request

To verify your setup, try sending a simple prompt to your model:

```bash
curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
     "messages": [
       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
     ],
     "max_tokens": 250
   }'
```

### Example Output

```json
{
  "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
  "choices": [
    {
      "text": "In cherry blossom’s gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes.",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 250,
    "total_tokens": 266
  }
}
```


## Additional Resources

* [VLLM Quickstart](/dynamo/v-0-8-1/components/backends/v-llm#vllm-quick-start)
* [Meta-Llama-3.1-8B-Instruct on Hugging Face](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)