Triton Management Service Deployment Guide

Triton Management Service (TMS) is a Kubernetes microservice, and expects to be deployed into a Kubernetes managed cluster. To facilitate its installation and deployment into your Kubernetes cluster, TMS provides a Helm chart.

To deploy TMS the helm and the TMS Helm chart must be installed on the local system. Additionally, the local user must have cluster administrator privileges.

Preparing Your Cluster

To run TMS, a properly-configured Kubernetes cluster is required. Depending on which TMS features you want to leverage and whether you plan to run inference on GPUs, you must install some dependencies in addition to the default installation.

As a baseline, production TMS installations are recommended to have at least two nodes:

  • one on which to run the API server and database

  • one on which to run inference

Typical deployments have many nodes on which to run inference.

Inference nodes require the ability to run large container images. The default images for Triton can exceed over fourteen gigabytes. You must configure your cluster to handle a minimum of fourteen gigabyte container images.

Because of that image transfer, the first time Triton starts on each node can take a bit of time.

If you plan to run inference on GPUs, you must ensure that your inference nodes properly recognize the GPUs and list them as resources. You can verify recognition of the GPU by running kubectl describe node $NODE_NAME and seeing whether there is an entry with a key of nvidia.com/gpu in the Capacity and Allocatable sections. If your cluster is not already properly configured, see the documentation for the GPU operator or your cloud service provider.

If your deployment requires the autoscaling feature, see the autoscaling section.

For the specifics about the versions of Kubernetes and other tools with which TMS was tested, see the release notes for the version of TMS your are deploying.

Obtaining TMS Helm Chart

The TMS Helm chart can be downloaded from NVIDIA NGC using the following command:

Copy
Copied!
            

helm fetch https://helm.ngc.nvidia.com/nvaie/charts/triton-management-service-1.1.0.tgz --username='$oauthtoken' --password=<YOUR API KEY>

You can extract the values.yaml file from the downloaded chart’s TAR file using the following command:

Copy
Copied!
            

helm show values triton-management-service-1.1.0.tgz > values.yaml

This creates a values.yaml file in the current directory and can be modified to meet deployment needs.

See Helm Chart Values for a listing of the configurable values.

Configuring the API Server Pod

By default, TMS requests minimal CPU and memory resources from Kubernetes to run the pod containing the API server and database. While this works for initial testing of TMS’s features and for smaller, more stable deployments, it is likely to be insufficient if many clients are expected to be making concurrent API calls. It is highly recommended that system administrators change the default settings.

To change the default settings, use the configuration options in server.resources in the values.yaml file. The amount of CPU and memory resources is relatively low compared to that of the database. For that reason, it is recommended that initially the database be allocated 75% of the available resources, and the API server the other 25%. The following sample configuration can do the 75%/25% allocation on a node with 8 CPUs and 16Gi of memory:

Copy
Copied!
            

resources: apiServer: cpu: 2 memory: 4Gi database: cpu: 6 memory: 12Gi

Kubernetes Secrets

This section covers the basics of setting up secrets in Kubernetes for TMS.

Creation of Kubernetes secrets requires sufficient cluster privileges. If you lack sufficient privileges, have a cluster administrator to create them on your behalf.

Container Pull Secrets

TMS Helm chart includes any secrets listed under values.yaml#images.secrets. The default values.yaml file contains an example secret named “ngc-container-pull”.

To create an image-pull secret, use:

Copy
Copied!
            

kubectl create secret docker-registry <secret-name> --docker-server=<docker-server-urn> --docker-username=<username> --docker-password=<password>

Then add the value of <secret-name> to the values.yaml#images.secrets list.

Configuring Model Repositories

To connect to a model repository, see the model repository page.

Configuring Autoscaling

To enable and configure autoscaling, see the separate autoscaling configuration guide.

Configuring Triton Containers

TMS allows the TMS administrator to configure some aspects of the containers that are created for Triton instances. These can be configured using the top-level triton object in values.yaml.

Resource constraints are all listed under resources. TMS admins can specify the default resources that Triton containers get, and the limits.maximum values that users can request on a per-lease basis.

A sample configuration is:

Copy
Copied!
            

triton: resources: default: cpu: 2 gpu: 1 systemMemory: 4Gi sharedMemory: 256Mi limits: minimum: cpu: 1 gpu: 1 systemMemory: 1Gi sharedMemory: 128Mi maximum: cpu: 4 gpu: 2 systemMemory: 8Gi sharedMemory: 512Mi

The fields in default, minimum, and maximum sections are defined as follows:

  • Each value in the maximum section must be at least as large as the default and minimum value.

  • Each value in the minimum section must be smaller than the default and maximum value.

  • cpu: The number of whole or factional CPUs assigned to Triton. Can be specified as a number of cores (for example, 4), or a number followed by m, which represents milli-CPUs (for example, 1500m).

    • Minimum value: 1 (or 1000m).

    • Default: 2

  • gpu: The number of whole GPUs assigned to Triton. Must be a whole number. GPUs cannot be fractionally assigned.

    • Minimum value: 0

    • Default: 1

  • repositorySize: The amount of disk space allocated for the Triton model repository, as a number plus units (for example, 4Gi).

    • Units allowed: Mi, Gi, Ti

    • Minimum value: 256Mi

    • Default: 2Gi

  • systemMemory: The amount of system memory, as a number plus units (for example, 4Gi).

    • Units allowed: Ki, Mi, Gi, Ti

    • Minimum value: 256Mi, and at least 128Mi more than sharedMemory

    • Default: 4Gi

  • sharedMemory: The amount of shared memory, as number plus units (same units as memory).

    • Minimum value: 32Mi

    • Default: 256Mi

    Note

    Some backends (for example, PyTorch) allow you to use shared memory to allocate tensors. If you plan on using this, you must set a higher value.

Configuring Persisted Database

To enable and configure TMS to persist database contents, a volume claim bounded to a sizeable Kuberenetes persistent volume must be provided to values.yaml#server.databaseStorage.volumeClaimName.

If a server failure or restart happens, TMS can reload the contents of the database from this volume.

Server performance can be affected by slow or unreliable storage solutions used for the persisted volume.

Assuming you’ve followed all the steps of this deployment guide, use the following command to install and deploy TMS:

Copy
Copied!
            

helm install <name-of-tms-installation> -f values.yaml triton-management-service-1.0.tgz

The Kubernetes cluster where TMS is installed must properly secured according to best practices and the security posture of your organization.

Any additional, optional services connected to TMS such as Prometheus and Prometheus adapter must also be secured. We recommend that the cluster administrator properly secure access to any S3 or other external model repositories which TMS uses. We recommend leveraging encryption in transit and at rest, scoping access to cluster resources following the principle of least privilege, and configuring audit logging for your cluster.

The TMS default configuration does not allow connections from outside of the Kubernetes cluster. You assume responsibility for securing any external connections when changing the default configuration values.

© Copyright 2023, NVIDIA. Last updated on Dec 11, 2023.