Llama4 + Eagle | NVIDIA Dynamo Documentation

This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the multi-node deployment instructions to set up the environment for the following scenarios:

Aggregated Serving: Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
Disaggregated Serving: Distribute the workload across two GB200x4 nodes:
- One node runs the decode worker.
- The other node runs the prefill worker.

Notes

Make sure the (eagle3_one_model: true) is set in the LLM API config inside the examples/backends/trtllm/engine_configs/llama4/eagle folder.

Setup

Assuming you have already allocated your nodes via salloc, and are inside an interactive shell on one of the allocated nodes, set the following environment variables based:

$ cd $DYNAMO_HOME/examples/backends/trtllm
$ 
$ export IMAGE="<dynamo_trtllm_image>"
$ # export MOUNTS="${PWD}/:/mnt,/lustre:/lustre"
$ export MOUNTS="${PWD}/:/mnt"
$ export MODEL_PATH="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
$ export SERVED_MODEL_NAME="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"

See the multinode setup instructions to learn more about the above options.

Aggregated Serving

$ export NUM_NODES=1
$ export ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yml"
$ ./multinode/srun_aggregated.sh

Disaggregated Serving

$ export NUM_PREFILL_NODES=1
$ export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/eagle/eagle_prefill.yml"
$ export NUM_DECODE_NODES=1
$ export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yml"
$ ./multinode/srun_disaggregated.sh

Example Request

See the example request section to learn how to send a request to the deployment.

curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8",
        "messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
        "max_tokens": 1024
    }' -w "\n"
# output:
{"id":"cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8","choices":[{"text":"NVIDIA is considered a great company for several reasons:\n\n1. **Technological Innovation**: NVIDIA is a leader in the field of graphics processing units (GPUs) and has been at the forefront of technological innovation.
...
and the broader tech industry.\n\nThese factors combined have contributed to NVIDIA's status as a great company in the technology sector.","index":0,"logprobs":null,"finish_reason":"stop"}],"created":1753329671,"model":"nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8","system_fingerprint":null,"object":"text_completion","usage":{"prompt_tokens":16,"completion_tokens":562,"total_tokens":578,"prompt_tokens_details":null,"completion_tokens_details":null}}