Scaling Triton Inference Server

As the number of requests on the server increases, the concurrency, and the batch size need to be tuned for keeping up with the latency demands. In a production deployment, Service Level Agreement (SLA) requirements can dictate latency requirements, and it may become necessary to share the load of the inference requests between multiple GPUs. There are two ways to tackle this. The DevOps Engineer can scale the server vertically (adding more GPUs to the VM) or horizontally (adding more VMs with GPUs to the deployment). To manage to scale horizontally across multiple Triton Inference Server VMs, IT admins can either use a traditional approach using a Load Balancer or use Kubernetes to deploy and auto scale Triton. The following sections describe the different scaling options in further detail and the two approaches for deploying scaling.

Vertical Scaling on vSphere VMs

Scaling vertically involves adding more than one full vGPU profile to a single Triton VM. The following graphic illustrates multi-GPU and how a single VM can be assigned four shared PCIe devices.

../_images/cb-mn-02.png

Note

VMWare supports up to four shared PCIe devices.

In order for the Triton Inference Server VM to use the additional GPU(s), the DevOps Engineer can modify the startup.sh file, which was created within the IT Administrator workflow. This file is located /home/nvidia/startup.sh on Triton Inference Server VM.

The following contents to the file show the --gpus=all flag to the docker run command and verify that Triton will use all GPUs.

1
2
3
#!/bin/bash
docker rm -f $(docker ps -a -q)
docker run --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 --name triton_server_cont -v $HOME/triton_models:/models nvcr.io/nvaie/tritonserver-<NVAIE-MAJOR-VERSION>:<NVAIE-CONTAINER-TAG> tritonserver --model-store=/models --strict-model-config=false --log-verbose=1

Horizontal Scaling on vSphere VMs (Advanced)

Scaling horizontally involves more than one Triton Inference Server VM. Since the IT admin created a Triton Inference Server template, cloning/creating an additional VM(s) is quick. NVIDIA MiG GPUs can be added to these VMs. MiG can partition a A100 GPU into as many as seven instances. One instance from one GPU can be added to a VM and another MiG instance from a different GPU can be added to the other VM. This provides flexibility and optimal resource utilization in your cluster. Once created, the DevOps Engineer would repeat the workflow to add the trained model to each Triton Inference Server VM’s model directory. Using the following steps, the client would point to a load balancer which would serve as a reverse proxy between the Triton Server VMs. We use HAproxy to deploy a load balancer on a separate VM. NVIDIA AI Entperise doesn’t provide deployment support to HAproxy and the user has freedom to use any load balancer available.

Deploying a load balancer for Horizontal Scaling

  • On each Triton Server, edit /etc/hosts file to add the IP address of load balancer.

    sudo nano /etc/hosts
    
  • Add the following to the /etc/host file.

    hostname-of-HAproxy IP-address-of-HAproxy
    
  • On the load balancer, add the IPs of the Triton Servers to /etc/hosts and install HAproxy.

    sudo apt install haproxy
    
  • Configure HAproxy to point to the gRPC endpoint of both the servers.

    sudo nano /etc/haproxy/haproxy.cfg
    
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    frontend triton-frontend
            bind     10.110.16.221:80 #IP of load balancer and port
            mode    http
            default_backend    triton-backend
    
    backend triton-backend
            balance roundrobin
            server triton 10.110.16.186:8001 check #gRPC endpoint of Triton
            server triton2 10.110.16.218:8001 check
    
    listen stats
            bind 10.110.16.221:8080 #port for showing load balancer statistics
            mode http
            option forwardfor
            option httpclose
            stats enable
            stats show-legends
            stats refresh 5s
            stats uri /stats
            stats realm Haproxy\ Statistics
            stats auth nvidia:nvidia #auth for statistics
    
  • Now point to the load balancer IP for the server URL in the client as shown in Using Triton gRPC client to run Inference on the VM.

  • Restart the haproxy server.

    sudo systemctl restart haproxy.service
    

Running Scale-Out Inference Using a Load Balancer

Follow the same steps in Using Triton gRPC client to run Inference on the VM, but instead of pointing to localhost, point to the load balancer IP.

An example is as follows:

1
python triton/run_squad_triton_client.py --triton_model_name=bert --triton_model_version=1 --vocab_file=/workspace/bert/data/download/nvidia_pretrained/bert_tf_squad11_large_384/vocab.txt --predict_batch_size=1 --max_seq_length=384 --doc_stride=128 --triton_server_url=localhost:8001 --context="Good password practices fall into a few broad categories. Maintain an 8-character minimum length requirement. Don't require character composition requirements. For example, *&(^%$. Don't use a password that is the same or similar to one you use on any other websites. Don't use a single word, for example, password, or a commonly-used phrase. Most people use similar patterns, for example, a capital letter in the first position, a symbol in the last, and a number in the last 2. Cyber criminals know this, so they run their dictionary attacks using the most common substitutions for example are "@" for "a," "1" for "l". " --question="What are the common substitutions for letters in password?"

The console output shows that the predicted answer is @ for a, 1 for l.

Deploying Triton Inference Server on Kubernetes

Auto scaling is one of the significant benefits of deploying Triton Inference Server on Kubernetes, which is elasticity. Without auto scaling, DevOps Engineers would need to manually provision resources every time resource utilization increases and then scale down to ensure optimal resource utilization. Within this guide, we will discuss the critical topics for successfully deploying Triton on Kubernetes. Using the same User Personas previously spoken to throughout this guide, we will focus primarily on the IT Administrator and DevOps workflows since they require a modified workflow.

Note

AI Practitioner and Software Engineer workflows do not change.

What is Kubernetes?

Kubernetes is an open-source container orchestration platform that makes the job of a DevOps engineer easier. Applications can be deployed on Kubernetes as logical units which are easy to manage, upgrade and deploy with zero downtime (rolling upgrades) and high availability using replication. Deploying Triton Inference Server on Kubernetes offers these same benefits to AI in the Enterprise. To easily manage GPU resources in the Kubernetes cluster, the NVIDIA GPU operator is leveraged.

What is the NVIDIA GPU Operator?

The GPU Operator allows administrators of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster. Instead of providing a special OS image for GPU nodes, administrators can deploy a standard OS image for both CPU and GPU nodes and then rely on the GPU Operator to provide the required software components for GPUs. The components include the NVIDIA drivers, Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labeling, DCGM based monitoring, etc.

IT Administrator Workflow

The following graphic illustrates the IT Administrator workflow to support Triton Inference Server on Kubernetes:

../_images/cb-ti-07.png

Clone VM - Standard OS Ubuntu 20.04

For this guide to test optional autoscaling two or more VMs with GPU’s is a prerequisite. Using the same VM Hardware Configuration in the IT Administrator workflow, create as many VMs as the number of nodes needed in the cluster.

Note

vGPU profile should be assigned to the VMs by the IT Administrator. The NVIDIA driver and container runtime would be installed as part of the DevOps workflow during the installation of the GPU Operator.

Install Kubernetes

Follow the steps in the Kubernetes website to install and configure the cluster using kubeadm.

DevOps Engineer Workflow

Now that Kubernetes VMs have been provisioned, the DevOps Engineer needs to perform specific application-level configurations inside the Kubernetes VMs.

  • Install GPU Operator.

  • Deploy Triton Inference Server on Kubernetes.

  • Optional - Add a horizontal pod auto scaler to scale the deployment automatically.

Install GPU Operator

Prerequisites:

  • NVIDIA vGPU Driver

  • Access to a private container registry or NVIDIA’s NGC registry

Follow the steps using the NVIDIA GPU Operator documentation to install GPU Operator using Helm.

Deploy Triton Inference Server on Kubernetes

To quickly deploy Triton Inference Server on Kubernetes, the DevOps Engineer will need access to an NFS share that will serve as the Triton Model Repository and then manage the storage. A Kubernetes Deployment is then designed for Triton Inference Server and a Kubernetes Service for the Deployment. These steps are detailed below.

Access an NFS share to hold all Trained GPU TensorFlow models

Place the exported Triton model that was part of Export Model to Triton Inference Server format mentioned above under the AI Practitioner workflow and copy it to the NFS share.

Manage the Storage

Containers are immutable, meaning that all the data created during its lifetime is lost when a container shuts down. Therefore, there is a need for containers to have a place to store information persistently, Kubernetes provides a persistent storage mechanism for containers, and it is based upon a Persistent Volume.

A PersistentVolume (PV) is a storage piece in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes. The persistent volume abstracts away the underlying storage resource from Kubernetes. This means that the underlying storage can be anything (S3, GCFS, NFS, etc.).

For a compute resource like a pod to start using these volumes, it must request a volume by issuing a Persistent Volume Claim (PVC). A PVC ties itself to a Persistent Volume with resources that match the claim that a PVC makes.

Create a Persistent Volume (PV) for Kubernetes to access the NFS server

We will use a YAML file to create a Persistent Volume (PV) of type NFS within this guide.

Some things to note:

  • Storage capacity in the YAML file is set to 500GiB. Change it to whatever is the storage capacity in your cluster.

  • Change the server IP address to your own NFS server IP.

  • Change the path to the path you copied the BERT Triton model to.

The following is the contents of the YAML file; in this guide, the file is named nfs-pv.yaml.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv
spec:
capacity:
    storage: 500Gi
accessModes:
    - ReadWriteMany
persistentVolumeReclaimPolicy: Retain
nfs:
    path: /Datasets/triton/triton_models
    server: 10.136.144.98
    readOnly: false

Use the YAML above and run the following command to create a PV on the cluster.

kubectl apply -f nfs-pv.yaml
Create a Persistent Volume Claim (PVC)

Within this guide, we will use a YAML file for creating a PVC. The following is the contents of the file and is named nfs-pvc.yaml.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  accessModes:
  - ReadWriteMany
resources:
  requests:
    storage: 500Gi

Use the YAML above and run the following command to create a PVC on the cluster.

kubectl apply -f nfs-pvc.yaml

After creating the PVC, the PVC should latch on to the PV. To confirm this run the command below.

kubectl get pvc

The following output indicates that the NFS PVC is bound to the NFS PV.

1
2
3
nvidia@node1:~$ kubectl get pvc
NAME      STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
nfs-pvc   Bound    nfs-pv   500Gi      RWX                           3d1h
Create a Triton Inference Server Kubernetes Deployment

Since Kubernetes does not run containers directly, it wraps one or more containers into a higher level called Pods. A Kubernetes Pod is a group of one or more containers with shared storage and network resources and a specification for running the containers. Pods are also typically managed by a layer of abstraction, the Deployment. Using a Deployment, you don’t have to deal with pods manually; it can create and destroy Pods dynamically. A Kubernetes Deployment manages a set of pods as a replica set.

Multiple replicas of the same pod can be used to provide high availability. Using Kubernetes to deploy a Triton Inference Server offers these same benefits to AI in the Enterprise. Because the deployment has replication, if one of the Triton Inference Server pods fails, other replica pods which are part of the deployment can still serve the end-user. Rolling updates allow Deployment updates to take place, such as upgrading an application, with zero downtime.

Within this guide, we will use the triton_deployment.yaml (shown below) to deploy the same Natural Language Processing BERT use case example, which we have used throughout as a Kubernetes deployment. Couple of exciting things to note about the deployment object.

  • The replica field is used to specify the number of Triton Inference Server pods we want in our deployment. These pods share the incoming load between themselves, thus serving multiple parallel requests between themselves. In our example, each replica would receive requests in a round-robin fashion, and if one of the replicas goes down, then the other would still be available to serve the requests.

  • The nfs-pvc is mounted as a volume into the pod at /models directory, and the pod runs a Triton Inference Server pointing to that directory inside the command field.

  • Ports 8000, 8001, and 8002 inside the container/pod are exposed because the gRPC server, the HTTP server, and the metrics server are part of the Triton Inference Server run on these ports.

  • We attach one GPU per pod in the resources field.

The following is the contents of the triton_deployment.yaml file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
apiVersion: apps/v1
kind: Deployment
metadata:
  name: bert-qa
  labels:
    app: triton-server
spec:
  selector:
    matchLabels:
      app: triton-server
  replicas: 2
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      volumes:
      - name: model-repo
        persistentVolumeClaim:
          claimName: nfs-pvc
      containers:
      - name: serving
        image: docker pull nvcr.io/nvaie/tritonserver-<NVAIE-MAJOR-VERSION>:<NVAIE-CONTAINER-TAG>
        ports:
        - name: grpc
          containerPort: 8001
        - name: http
          containerPort: 8000
        - name: metrics
          containerPort: 8002
        volumeMounts:
        - name: model-repo
          mountPath: "/models"
        resources:
          limits:
            nvidia.com/gpu: 1
        command: ["tritonserver", "--model-store=/models"]
      volumes:
      - name: model-repo
        persistentVolumeClaim:
          claimName: nfs-pvc

Use the YAML above and run the following command to create a Triton Kubernetes Deployment on the cluster.

kubectl apply -f triton_service.yaml

Run the following command to verify the service was added.

kubectl get svc

The following output describes the newly added service:

1
2
3
nvidia@node1:~/yaml$ kubectl get svc
NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
bert-qa                               ClusterIP   10.111.115.59   <none>        8001/TCP,8000/TCP,8002/TCP   4h29m
Check the health of Kubernetes deployment

Once the Service is deployed, you can check if the Triton Service is accessible inside the cluster and is healthy by creating a client pod and pinging the health of the service.

Within this guide, we will use the client.yaml (shown below) for the Triton client.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: apps/v1
kind: Deployment
metadata:
name: client
labels:
    app: client
spec:
replicas: 1
selector:
    matchLabels:
    app: client
template:
    metadata:
    labels:
        app: client
    spec:
    containers:
        - name: serving
        image: docker pull nvcr.io/nvaie/tritonserver-<NVAIE-MAJOR-VERSION>:<NVAIE-CONTAINER-TAG>
        command: [ "curl -m 1 -L -s -o /dev/null -w %{http_code} http://$SERVICE_CLUSTER_IP:8000/v2/health/ready" ]

Note

The $SERVICE_CLUSTER_IP is the IP of the Triton service in the cluster which can be obtained by running the following command. In our case, it is 10.111.115.59, as shown below.

1
2
3
nvidia@node1:~$ kubectl get svc
NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
bert-qa                               ClusterIP   10.111.115.59   <none>        8001/TCP,8000/TCP,8002/TCP   4h39m

Use the YAML above and run the following command to create a client deployment.

kubectl apply -f client.yaml

Run the following command to check the Pod logs.

kubectl logs -f <name of the triton client pod>

A 200 OK HTTP code indicates good health.

Autoscaling the Triton Inference Server deployment with Kubernetes

Horizontal Pod auto scaler

The Kubernetes Horizontal Pod auto scaler automatically scales the number of Pods in a Deployment, replication controller, or replica set based on that resource’s CPU utilization. Custom metrics like GPU utilization are not readily available to the Horizontal pod auto scaler. By providing GPU metrics, The Triton Inference Server deployment objects can auto scale based on custom metrics such as Average GPU utilization and GPU duty cycle, etc.

The following need to be installed on the cluster to facilitate the Horizontal pod auto scaler to use GPU metrics.

  • Custom Metrics server

  • NVIDIA Data Center GPU Manager (DCGM) exporter service

  • Prometheus server

  • Prometheus adapter

Custom Metrics server

The custom metrics server exposes custom metrics for Horizontal Pod auto scaler to the API server. Run the following command to install a custom metrics server on your Kubernetes cluster.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.4.1/components.yaml
NVIDIA DCGM Exporter Service

To gather GPU telemetry in Kubernetes, the Nvidia Data Center GPU Manager (DCGM) is used. This suite of data center management tools allows you to manage and monitor GPU resources in an accelerated data center. Since the DevOps Engineer already installed the GPU Operator, the NVIDIA DCGM exporter service is already installed onto the cluster.

To verify, run the following command:

kubectl get svc -A | grep dcgm

The following output verifies that the NVIDIA DCGM is installed.

1
2
nvidia@node1:~$ kubectl get svc -A | grep dcgm
gpu-operator-resources nvidia-dcgm-exporter ClusterIP 10.102.108.202 <none> 9400/TCP 4d6h
Prometheus Server

To expose cluster-level and node-level metrics, Prometheus is used. Prometheus, a Cloud Native Computing Foundation project, is a systems and service monitoring system. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specific conditions are observed. Refer to the guide on the GPU Operator website to set up Prometheus on your cluster. To validate the install, check if the Prometheus server is installed as a NodePort service, and you can access the server from any node of your cluster. Access the server from the browser at http://$node_IP:30090.

../_images/cb-ti-08.png
Install Prometheus Adapter

The Prometheus adapter exposes the Prometheus metrics from the DCGM exporter to the custom metrics server we deployed. Therefore, this adapter is suitable for use with the autoscaling/v2 Horizontal Pod auto scaler in Kubernetes 1.16+. It can also replace the metrics server on clusters that already run Prometheus and collect the appropriate metrics.

To install the Prometheus adapter run the following command:

helm install --name-template prometheus-adapter --set rbac.create=true,prometheus.url=http://$PROMETHEUS_SERVICE_IP.prometheus.svc.cluster.local,prometheus.port=9090 prometheus-community/prometheus-adapter

Get the Prometheus server IP for the above by running the following command:

kubectl get svc -n prometheus -lapp=kube-prometheus-stack-prometheus
Verify if the Custom Metrics are Available to the Metrics Server

The GPU metrics from the NVIDIA DGCM exporter should now be available to the metrics server. You can verify this by running the following command.

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1

The following output verifies the metrics are available.

1
2
3
4
nvidia@node1:~/yaml$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_MEM_COPY_UTIL
    "name": "pods/DCGM_FI_DEV_MEM_COPY_UTIL",
    "name": "jobs.batch/DCGM_FI_DEV_MEM_COPY_UTIL",
    "name": "namespaces/DCGM_FI_DEV_MEM_COPY_UTIL",
Create a Horizontal Pod Auto Scaler Object

Within this guide, we will use the hps_gpu.yaml (shown below) to create a Kubernetes Horizontal Pod auto scaler.

A couple of interesting points to note:

  • minReplicas is the lower bound to the number of pods to scale down to in the pod auto scaler deployment, with maxReplicas being the upper bound. The custom metric on which the pods are to be auto scaled is DCGM_FI_DEV_GPU_UTIL (which is the average GPU Utilization). If it exceeds the average target value of 40 percent, a new pod is scheduled.

  • We are targeting the pods of the deployment we created before.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
name: gpu-hpa
spec:
scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: bert-qa
minReplicas: 1
maxReplicas: 3
metrics:
- type: Pods
    pods:
    metricName: DCGM_FI_DEV_GPU_UTIL # Average GPU usage of the pod.
    targetAverageValue: 40

Use the YAML file above, run the following command to create an HPA object.

kubectl apply -f hpa_gpu.yaml

Check if the HPA deployment succeeded.

1
2
3
nvidia@node1:~/yaml$ kubectl get hpa
NAME      REFERENCE            TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa   Deployment/bert-qa   0/40      1         2         1          4h18m
Generating Load to Show Auto Scaling

The last step is to test the autoscaling capabilities which we just deployed. We will create a perf client deployment to show autoscaling on the cluster by generating an artificial load and verify that pods are getting added on their own.

Within this guide, we will use the gpu_load.yaml (shown below) to create a perf client deployment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: apps/v1
kind: Deployment
metadata:
  name: perf-client
  labels:
    app: perf-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: perf-client
  template:
    metadata:
      labels:
        app: perf-client
    spec:
      containers:
        - name: serving
          image: vbagade/bert
          command: [ "/workspace/install/bin/perf_client --max-threads 10 -m bert -x 1 -p 200000 -d -v -z -i gRPC -u $TRITON_SERVICE_IP:8001 -b 1 -l 100 -c 50 "]

Note

The Triton service IP is the service we created as part of Create a Kubernetes Service to point to your deployment. Run the following command to get the IP kubectl get svc.

1
2
3
nvidia@node1:~$ kubectl get svc
NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
bert-qa                               ClusterIP   10.111.115.59   <none>        8001/TCP,8000/TCP,8002/TCP   4h39m

Cluster IP here is 10.111.115.59.

Before running the client, execute the following command to output the number of pods.

kubectl get pods

This output shows there is only one pod.

1
2
3
nvidia@node1:~/yaml$ kubectl get pods
NAME                                                          READY   STATUS    RESTARTS   AGE
bert-qa-74469c8b8-62472                                       1/1     Running   0          5h56m

Use the YAML file above, run the following command to which will run the client.

kubectl apply -f gpu_load.yaml

Execute the following command to output the number of pods again.

kubectl get pods

After launching the client, the Kubernetes auto scaler kicks in, and we now have a new pod that is being created and the existing pod.

1
2
3
4
nvidia@node1:~$ kubectl get pods
NAME                                                          READY   STATUS              RESTARTS   AGE
bert-qa-74469c8b8-62472                                       1/1     Running             0          97m
bert-qa-74469c8b8-h4zls                                       0/1     ContainerCreating   0          1s