Deployment Guide (1.1.0)
Deployment Guide (1.1.0)

Helm Chart Values

The TMS Helm chart contains a values.yaml file that contains all of the deployment configuration options available.

TMS configuration is broken into three sections:

  • images: Container image information used by TMS.

    • server: Name of the container image containing TMS Server. (required)

    • sidecar: Name of the container image containing TMS Triton Sidecar. (required)

    • triton: Name of the container image containing Triton Inference Server. (required)

    • mongodb: Name of the container image containing MongoDB database used by TMS Server. (required)

    • rest: Name of the container image containing TMS HTTP API Server. (optional)

    • secrets: Names of Kubernetes secrets used to pull container image during pod deployment.

  • kubernetes: Configuration options affecting how TMS deploys objects with Kubernetes. (optional)

    • customAnnotations: Custom annotations added to the metadata of pods deployed by TMS.

    • customLabels: Custom labels added to the metadata of pods deployed by TMS.

    • partOf: Name of a higher level application that TMS is a part of, applied as a label ‘app.kubernetes.io/part-of’.

  • server: Configuration options related to how a TMS Server is deployed and operates.

    • apiService: Configuration options related to the network services provided by the TMS Server.

      • port: Port to use to connect to the gRPC API service when external to the Kubernetes cluster (default: 30345).

        External ports must be in the range [30000, 32767].

      • type: Type of Kubernetes service connection used by the server to provide network API services (default: ClusterIP).

        Valid options are ExternalName, ClusterIP, NodePort, and LoadBalancer.

    • resources: Defines the computing and memory resources allocated and reserved for the pod hosting the API server and database. Typically, if you expect many concurrent requests to the API server from many different clients, changing these values is recommended. Finding the best value for your situation can require several adjustments. Dedicating 25% of the resources to the API server and 75% to the database are the typical starting suggestions.

      • apiServer: Defines the resources allocated and reserved for the API server’s container.

        • cpu: The number of CPUs to be allocated and reserved for the API server container (default: 0). If 0, no CPUs are requested, but a limit of 1 is set. If it is any other value, that value is used used as the request and limit.

        • memory: The amount of memory to be allocated and reserved for the API server container (default: 1Gi). Must be a number with memory units (for example, Mi, Gi).

      • database: Defines the resources allocated and reserved for the database container.

        • cpu: The number of CPUs to be allocated and reserved for the database container (default: 1). This is used for the request and limit.

        • memory: The amount of memory to be allocated and reserved for the database container (default: 2Gi). Must be a number with memory units (for example, Mi, Gi).

    • lease: Configuration options for the creation and management of leases.

      • timeout: Configures the amount of time a lease is allowed to attempt loading before timing out.

      • duration: Configuration options related to lease durations.

        • initial: Configuration options related to initial requested duration of leases.

          • default: Default requested duration of a lease (default: 10m).

          • maximum: Maximum requested duration of a lease (default: 30m).

        • renewal: Configures the options related to requested renewal duration of leases.

          • default: Default requested renewal duration of a lease (default: 10m).

          • maximum: Maximum requested renewal duration of a lease (default: 30m).

      • automaticRenewal: Configuration options related to automatic renewal of leases. (optional)

        • enabled: Determines if the service supports automatically renewed leases or not (default: true).

          When not enabled, leases are not allowed to request automatic renewal.

        • window: Configuration options related to the lease last active time and the eligibility for automatic renewal.

          • default: Default amount of time since a lease has last was last active that it is still eligible for automatic renewal (default: 5m).

          • maximum: Maximum amount of time since a lease has last been active that it is still eligible for automatic renewal (default: 5m).

      • databaseStorage: Configuration option that defines persistent storage for the server’s database.

        • volumeClaimName : Kubernetes persistent volume claim (pvc) attached to the volume where TMS server’s database is stored.

      • shareTriton: Configuration options related to the sharing of Triton Server instances by leases. (optional)

        • enabled: Determines if the service supports the sharing of Triton Server instances by leases or not. (default: false)

        • byDefault: Default value applied to lease requests when not specified (default: false).

    • modelRepositories: Configuration options related to model repositories with models available to instances of Triton.

      • s3: Model repositories that contain models stored in an S3 bucket.

        Access is managed by the ARN specified by server.security.aws.role.

        • repositoryName: Name used to reference this model repository as part of lease acquisition.

          Can contain only lowercase alphanumeric characters without spaces. Hyphens - are permitted.

        • bucketName: Name of the S3 bucket used to fetch models.

        • awsRegion: Region code of the S3 bucket.

          Must be a valid code designating an existing AWS region (for example, “us-west-2”).

          For additional information, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html.

        • endpoint: Service URL of the S3 bucket.

          When both ‘endpoint’ and ‘awsRegion’ fields are specified, the ‘endpoint’ value is used instead of the awsRegion value.

          Must be a valid URL designating an existing endpoint (for example, “http:/s3.us-west-2.amazonaws.com” or “http:/play.min.io:9000”).

          For additional information, see https://docs.aws.amazon.com/general/latest/gr/s3.html#amazon_s3_website_endpoints.

        • accessKey: Name of the Kubernetes secret to read and provide as the access key ID for downloading objects from the S3 bucket.

          Optional value when IAM or default AWS environment variables are not used for authorizing TMS to read from an S3 bucket.

        • accessSecret: Name of the Kubernetes secret containing the secret access key to read from the S3 bucket.

          Optional value when IAM or default AWS environment variables are not used for authorizing TMS to read from an S3 bucket.

      • https: Model repositories that provide models as compressed archive downloads through a web-service using HTTP GET.

        • secretName: Name of the Kubernetes secret to read and provide as a Authorization header for download requests.

        • targetUri: URL of the remote web-sever in <domain_label_or_ip_address>/<path> format, used to determine if secrets apply to a model request or not.

      • volumes: Model repositories that contain models stored in a file-system-like structure.

        • repositoryName: Name used to reference this model repository as part of lease acquisition.

          Can contain only lowercase alphanumeric characters without spaces. Hyphens - are permitted.

        • volumeClaimName: Kubernetes persistent volume claim (pvc) used to fetch models.

    • autoscaling: Configuration options related to autoscaling Triton instances. If this section is missing, autoscaling is disabled. (optional)

      • enabled: Determines if the TMS Server supports autoscaling Triton instances or not (default: false).

        When not enabled, requests for autoscaling leases are not allowed.

      • replicas: Configuration options related to replication of autoscaling Triton instances.

        • default: Values used for autoscaling Triton instances when values are not provided during lease acquisition.

          • maximum: The maximum number of replicas. Must be within the limits specified in the “limits” section (default: 5).

          • minimum: The minimum number of replicas. Must be within the limits specified in the “limits” section (default: 1).

            Must be a positive integer.

        • limits: Defines the limits imposed on the number of replicas that can be requested for a lease.

          • maximum: The maximum number of replicas (default: 10). Must be a non-negative number greater than or equal to maximum-idle.

          • maximum-idle: The maximum number of idle instances that are allowed (default: 1). For example, the maximum value for the minimum number of replicas a user can request. Must be a non-negative number less than or equal to maximum.

          • minimum: The minimum number of replicas (default: 1). Must be a non-negative number less than or equal to maximum-idle.

      • metrics: Configuration options related to how metrics are used by autoscaling Triton instances to determine availability and scale.

        At least one of the following metrics must be enabled when support for autoscaling Triton instances is enabled:

        • cpuUtilization: Metric used to determine scaling based on CPU utilization.

          • allowed: Determines whether autoscaling based on CPU utilization is allowed (default: false).

          • enabled: Determines if scaling based on CPU utilization is enabled by default (default: false).

          • threshold: Threshold, expressed as a percentage, used to determine scaling (default: 90).

            Must be a positive integer in the exclusive range (0, 100).

            • default: Default value used for the threshold, as a percentage (default: 90).

            • minimum: Minimum value for the threshold, as a percentage (default: 50).

            • maximum: Maximum value for the threshold, as a percentage (default: 100).

        • gpuUtilization: Metric used to determine scaling based on GPU utilization.

          • allowed: Determines whether autoscaling based on GPU utilization is allowed (default: false).

          • enabled: Determines if scaling based on GPU utilization is enabled by default (default: false).

          • threshold: Threshold, expressed as a percentage, used to determine scaling (default: 90).

            Must be a positive integer in the exclusive range (0, 100).

            • default: Default value used for the threshold, as a percentage (default: 90).

            • minimum: Minimum value for the threshold, as a percentage (default: 50).

            • maximum: Maximum value for the threshold, as a percentage (default: 100).

        • queueTime: Metric used to determine scaling, where scaling is based on Triton inference-query queue times.

          • allowed: Determines whether autoscaling based on Triton inference-query queue times is allowed (default: false).

          • enabled: Determines if scaling based on Triton inference-query queue times is enabled by default (default: false).

          • threshold: Threshold, in microseconds, used to determine scaling.

            • default: Default value for the threshold, as a time in microseconds (default: 10000).

            • minimum: Minimum value for the threshold, as a time in microseconds (default: 10000).

            • maximum: Maximum value for the threshold, as a time in microseconds, with 0 and negative numbers meaning no limit (default: 0).

    • metrics: Configuration options for the collection and reporting of runtime metrics by the TMS Server.

      • verbosity: Verbosity (volume of total metrics) of metrics collected and reported (default: 0). (optional)

        Must be in the range [0, 3].

      • reportingWindow: Period of time from the time of request used when determining metric values reported (default: 60s).

      • port: Port used to connect to the metrics service when external to the Kubernetes cluster (default: 30543).

        Must be in the range [30000, 32767].

      • models: Configuration options controlling model deployment (fetching, loading into Triton, etc.) metrics collection.

        • verbosity: Verbosity (volume of total metrics) of metrics collected and reported (default: 0).

          Must be in the range [0, 3].

        • reportingWindow: Frequency within which the model metrics are pushed from Triton sidecar to TMS Server (default: 15s).

    • security: Configuration options related to the Transport Layer Security (TLS) connection encryption and security.

      • aws: Configuration options for instances deployed using Amazon EKS.

        • role: AWS IAM role used to read model S3 buckets configured in server.modelRepositories.awsS3.

      • tls: Configuration options related to the Transport Layer Security (TLS) connection encryption.

        • enabled: Determines if TLS is enabled or not (default: false).

          When enabled, TMS provisions a certificate issuer as part of its deployment. The issuer is used to issue TLS certificates for each Triton Inference Server instance deployed by TMS.

        • certManager: Configuration options related to cert-manager supplied TLS certificates that are used to encrypt network traffic.

          TMS manages and applies certificates for TLS based secure communications using cert-manager.

          • group: Kubernetes resource group of the CA issuer to use when creating service certificates (default: cert-manager.io).

          • kind: Kubernetes resource kind of the CA issuer to use when creating service certificates (default: ClusterIssuer).

          • name: Name of the issuer to use when creating service certificates.

          • privateKey: Configuration options related to the creation of certificate private keys.

            • algorithm: Algorithm of the private key for the certificate (default: RSA).

              Supported values are RSA , ECDSA, or Ed25519.

            • size: Size, in bits, of the corresponding private key for the certificate (default: 4096).

              Supported values depend on the value of algorithm:

              • RSA: 2048, 4096 or 8192

              • ECDSA: 256, 384 or 521

              • Ed25519: (property is ignored)

    • traceLevel: Configures the verbosity of the logging produced by the server. (optional)

      Typically, TMS produces logs for Kubernetes to collect through standard output and standard error when this value is not provided.

  • triton: Configuration options related to the deployment of Triton Inference Server.

    Values can be customized based on capacity of your cluster’s hardware and expected workload characteristics.

    • resources: Configuration options related to default and maximum resource requests per Triton instance.

      • default: Values used to determine the resources assigned to a Triton instance when not provided during lease acquisition.

        • cpu: Number of logical CPU cores to assign to a Triton instance (default: 2).

          Must be a positive integer.

        • gpu: Number of logical GPU devices to assign to a Triton instance (default: 1).

          Must be a positive integer.

        • sharedMemory: Amount of a Triton instance’s memory to reserve for shared-memory (default: 256Mi).

          Must be a positive integer, followed by a scale suffix of Ki, Mi, or Gi.

        • systemMemory: Amount of main memory to assign to a Triton instance (default: 4Gi).

          Must be a positive integer, followed by a scale suffix of Ki, Mi, or Gi.

      • limits: Range restrictions on resources allowed to be assigned to a Triton instance.

        • minimum: Minimum resources allowed to be assigned to a Triton instance.

          • cpu: Number of logical CPU cores to assign to a Triton instance (default: 2).

            Must be a positive integer.

          • gpu: Number of logical GPU devices to assign to a Triton instance (default: 1).

            Must be a positive integer.

          • sharedMemory: Amount of a Triton instance’s memory to reserve for shared-memory (default: 128Mi).

            Must be a positive integer, followed by a scale suffix of Ki, Mi, or Gi.

          • systemMemory: Amount of main memory to assign to a Triton instance (default: 1Gi).

            Must be a positive integer, followed by a scale suffix of Ki, Mi, or Gi.

        • maximum: Maximum resources allowed to be assigned to a Triton instance.

          • cpu: Number of logical CPU cores to assign to a Triton instance (default: 16).

            Must be a positive integer.

          • gpu: Number of logical GPU devices to assign to a Triton instance (default: 4).

            Must be a positive integer.

          • sharedMemory: Amount of a Triton instance’s memory to reserve for shared-memory (default: 2Gi).

            Must be a positive integer, followed by a scale suffix of Ki, Mi, or Gi.

          • systemMemory: Amount of main memory to assign to a Triton instance (default: 32Gi).

            Must be a positive integer, followed by a scale suffix of Ki, Mi, or Gi.

© Copyright 2023, NVIDIA. Last updated on Dec 11, 2023.