Multi-Node Triton + TRT-LLM Deployment on EKS#
This repository provides instructions for multi-node deployment of LLMs on EKS (Amazon Elastic Kubernetes Service). This includes instructions for building custom image to enable features like EFA, Helm chart and associated Python script. This deployment flow uses NVIDIA TensorRT-LLM as the inference engine and NVIDIA Triton Inference Server as the model server.
We have 1 pod per node, so the main challenge in deploying models that require multi-node is that one instance of the model spans multiple nodes hence multiple pods. Consequently, the atomic unit that needs to be ready before requests can be served, as well as the unit that needs to be scaled becomes group of pods. This example shows how to get around these problems and provides code to set up the following
LeaderWorkerSet for launching Triton+TRT-LLM on groups of pods: To launch Triton and TRT-LLM across nodes you use MPI to have one node launch TRT-LLM processes on all the nodes (including itself) that will make up one instance of the model. Doing this requires knowing the hostnames of all involved nodes. Consequently we need to spawn groups of pods and know which model instance group they belong to. To achieve this we use LeaderWorkerSet, which lets us create “megapods” that consist of a group of pods - one leader pod and a specified number of worker pods - and provides pod labels identifying group membership. We configure the LeaderWorkerSet and launch Triton+TRT-LLM via MPI in
deployment.yaml
and server.py.Gang Scheduling: Gang scheduling simply means ensuring all pods that make up a model instance are ready before Triton+TRT-LLM is launched. We show how to use
kubessh
to achieve this in thewait_for_workers
function of server.py.Autoscaling: By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each “megapod”. However, since these are GPU workloads we don’t want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in
triton-metrics_prometheus-rule.yaml
. We also demonstrate how to properly set up PodMonitors and an HPA inpod-monitor.yaml
andhpa.yaml
(the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in Configure EKS Cluster and Install Dependencies. To enable deployment to dynamically add more nodes in response to HPA, we also setup Cluster AutoscalerLoadBalancer Setup: Although there are multiple pods in each instance of the model, only one pod within each group accepts requests. We show how to correctly set up a LoadBalancer Service to allow external clients to submit requests in
service.yaml