Llama4 + Eagle
Llama4 + Eagle
Llama4 + Eagle
This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the multi-node deployment instructions to set up the environment for the following scenarios:
Aggregated Serving: Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
Disaggregated Serving: Distribute the workload across two GB200x4 nodes:
eagle3_one_model: true) is set in the LLM API config inside the examples/backends/trtllm/engine_configs/llama4/eagle folder.Assuming you have already allocated your nodes via salloc, and are
inside an interactive shell on one of the allocated nodes, set the
following environment variables based:
See the multinode setup instructions to learn more about the above options.
See the example request section to learn how to send a request to the deployment.