Deployment Guide

The NVIDIA Cloud Native Service Add-on Pack is packaged components for AI Workflows. These include enterprise-ready implementation examples for authentication, monitoring, reporting, and load balancing while allowing a path for you to deviate.

These packaged components follow guidelines for enterprise production requirements and serve as standards compatible with NVIDIA’s AI frameworks for building and deploying AI solutions as microservices. The guidelines generally fall within the following categories:

  • Deployment and Orchestration

    • OCI-Compliant Container Images

    • Liveness/Readiness/Startup Probes

    • Security and Vulnerability Scanning/Patching

  • Security

    • OIDC/OAuth2 User Authentication

    • External Secrets Management

    • Secure API Endpoints

  • Networking

    • Ingress Control

    • Proxy Sidecar

  • Logging and Reporting

    • Open Telemetry Protocol (OTLP) monitoring

    • OTLP support within application containers

    • Log Aggregation

The packaged components within AI Workflows also include the AI framework specific to your use case, which is delivered as an OCI-compliant base container image. The following graphic illustrates the additional opinionated components which are included within NVIDIA AI Workflows to meet the above guideline requirements:

image1.png

Keycloak

Standard OIDC/OAuth2 provider to provide user management and authentication. It is an open-source identity provider that is OIDC/OAuth2 compliant and can also integrate with other identity providers, allowing the customer to connect their existing environments into the Workflows with minimal additional development.

Cert-Manager

Cert Manager has been deployed with a custom Certificate Authority set up to generate/rotate HTTPS certs used by applications and AI Workflow components

Trust Manager

Trust Manager is deployed for injecting the Certificate Authority public key into AI Workflow or application namespaces

Ingress Controller

An HAProxy ingress controller is deployed with a wildcard DNS certificate to manage access to the services deployed within the cluster.

Prometheus

A Prometheus operator and a centralized Prometheus service have been deployed on the cluster to scrape and pull metrics from the application services and provide an OTLP-compliant monitoring system and database.

Grafana

A Grafana operator and a centralized Grafana service is deployed and can be used to create and host dashboards visualizing the appropriate metrics and monitoring data for the particular use case or AI Workflow. The centralized Grafana service is connected to the centralized Prometheus server by default.

Postgres Operator

The Postgres Operator from CrunchyDB has been deployed for creating relational databases. One Postgres database is instantiated and is used to back Keycloak.

Elastic Operator

The Elastic Operator for ElasticSearch, Kibana, and other Elastic tools has been deployed on the cluster. No elastic services are configured out of the box but can be leveraged by applications or AI Workflows if needed.

Note

Each user is responsible for checking the content and the applicable licenses of third party software and determining if they are suitable for the intended use.

To deploy NVIDIA Cloud Native Service Add-on Pack, the following requirements must be met:

DNS

A wildcard DNS A record must be created for the system along with the DNS A record for the system itself. Reverse lookup PTR records should also exist for both entries. Both DNS records should be resolvable within and outside of the system.


Kubernetes

Cloud Native Service Add-on Pack requires a K8S cluster to deploy to. An example K8S distribution such as NVIDIA Cloud Native Stack should be available before deployment.

Networking

This guide assumes that the cluster will be externally accessible through ports 22,443. Additional ports may be required for your specific use case.

Storage

A storage class must be available on the K8S cluster for the Cloud Native Service Add-on Pack to be configured to use. A simple local storage solution such as Local Path Provisioner can be used. Alternatively, storage classes provided by third-party Kubernetes-based platforms such as Red Hat OpenShift or VMware Tanzu will also work.

NVIDIA AI Enterprise

Since NVIDIA AI Workflows are available on NVIDIA NGC for NVIDIA AI Enterprise software customers, you must have access to the following to pull down the resources which are required for the workflow:

Warning

NVIDIA AI Enterprise licensing is required for accessing AI Workflow resources.

Note

Cloud service providers may include licenses through on-demand NVIDIA AI Enterprise instances.

  1. Ensure that the prerequisite requirements are met. Instructions to provision an example cluster based on NVIDIA Cloud Native Stack can be found in the AI Workflow Guides, or NVIDIA Cloud Native Stack can be manually installed using the guides found here: https://github.com/NVIDIA/cloud-native-stack.

  2. Ensure that a kubeconfig is available and set via the KUBECONFIG environment variable or your user’s default location (.kube/config).

  3. Download the NVIDIA Cloud Native Service Add-on Pack from the Enterprise Catalog onto the instance you have provisioned from here.

    Copy
    Copied!
                

    ngc registry resource download-version "nvaie/nvidia_cnpack:0.2.1"

    Note

    If you still need to install and set up the NGC CLI with your API Key, please do so by autoloading the resource. Instructions can be found here.


  4. Navigate to the installer’s directory using the following command:

    Copy
    Copied!
                

    cd nvidia_cnpack_v0.2.1


  5. Create a config file for the installation using the following template. Ensure that the highlighted fields, including the wildcardDomain field, are customized for your specific instance.

    nano config.yaml

    Copy
    Copied!
                

    apiVersion: v1alpha1 kind: NvidiaPlatform spec: platform: wildcardDomain: "*.my-cluster.my-domain.com" externalPort: 443 ingress: enabled: true postgres: enabled: true certManager: enabled: true trustManager: enabled: true keycloak: databaseStorage: accessModes: - ReadWriteOnce resources: requests: storage: 1G storageClassName: local-path volumeMode: Filesystem prometheus: storage: accessModes: - ReadWriteOnce resources: requests: storage: 1G storageClassName: local-path volumeMode: Filesystem grafana: enabled: true elastic: enabled: true

    Note

    If you installed local-path-provisioner, the storageClassName can be left as shown: local-path


  6. Make the installer executable via the following commands:

    Copy
    Copied!
                

    chmod +x ./nvidia-cnpack-linux-x86_64


  7. Run the following command on the instance to set up NVIDIA Cloud Native Service Add-on Pack:

    Copy
    Copied!
                

    ./nvidia-cnpack-linux-x86_64 create -f config.yaml


  8. Once the install is complete, check that all the pods are healthy via the following command:

    Copy
    Copied!
                

    kubectl get pods -A


    The output should look similar to the screenshot below:

    image2.png

  9. As a part of the installation, the installer will create nvidia-platform and nvidia-monitoring namespaces that contain most of the components and information required for interacting with the deployed services.

    • The default Keycloak instance URL is at: https://auth.my-cluster.my-domain.com

    • Default admin credentials can be found within the nvidia-platform namespace, in a secret called keycloak-initial-admin via the following commands:

    Copy
    Copied!
                

    kubectl get secret keycloak-initial-admin -n nvidia-platform -o jsonpath='{.data.username}' | base64 -d kubectl get secret keycloak-initial-admin -n nvidia-platform -o jsonpath='{.data.password}' | base64 -d

    • The default Grafana instance URL is at: https://dashboards.my-cluster.my-domain.com

    • The default Grafana credentials can be found within the nvidia-monitoring namespace, in a secret called grafana-admin-credentials via the following command:

    Copy
    Copied!
                

    kubectl get secret grafana-admin-credentials -n nvidia-monitoring -o jsonpath='{.data.GF_SECURITY_ADMIN_USER}' | base64 -d kubectl get secret grafana-admin-credentials -n nvidia-monitoring -o jsonpath='{.data.GF_SECURITY_ADMIN_PASSWORD}' | base64 -d


  10. You can configure the components and services installed on the cluster as required for your use case. Specific examples can be found in the NVIDIA AI Workflow Guides.

Installer Flags and Commands

  • Available flags:

    -h, –help

    -v, –version

  • Available Commands:

    • Completion - Generates the autocompletion script for ./cnpctl_Linux_x86_64 for the specified shell. See each sub-command’s help for details on how to use the generated script.

      Usage:

      Copy
      Copied!
                  

      ./cnpctl_Linux_x86_64 completion [command]

      Available Commands:

      bash - Generate the autocompletion script for bash

      fish - Generate the autocompletion script for fish

      powershell -Generate the autocompletion script for powershell

      zsh - Generate the autocompletion script for zsh

      Flags:

      -h, –help - Help for completion

    • Create/Install - Creates the NVIDIA cloud-native platform.

      -d, –directory - String, if non-empty, write working files to this directory. (default “.”)

      -f, –filename - String, the path to a file that contains the configuration to apply.

      -h, –help - Help for create

      -kubeconfig - String, the path to the kubeconfig file to use for CLI requests. By default, the installer will look for a KUBECONFIG environment variable to determine the location of kubeconfig, followed by the default $HOME/.kube/config location, unless the kubeconfig location is specified manually via this flag.

      -v, –verbose - Enables more detailed logging for debugging purposes.

    • Delete - Deletes the NVIDIA cloud-native platform.

      Usage:

      Copy
      Copied!
                  

      ./cnpctl_Linux_x86_64 delete [flags]

      Aliases:

      delete, destroy

      Flags:

      -d, –directory - String, if non-empty, write working files to this directory. (default “.”)

      -h, –help - Help for delete

      -kubeconfig –kubeconfig - String, the path to the kubeconfig file to use for CLI requests. By default, the installer will look for a KUBECONFIG environment variable to determine the location of kubeconfig, followed by the default $HOME/.kube/config location, unless the kubeconfig location is specified manually via this flag.

      -v, –verbose - Increase the verbosity.

Enabling/Disabling Components

Most of the components deployed by the installer can be enabled or disabled by specifying an “Enabled” value within the configuration YAML created earlier.

Use the following template as an example, where Grafana and Elasticsearch have been disabled:

Copy
Copied!
            

apiVersion: v1alpha1 kind: NvidiaPlatform spec: platform: wildcardDomain: "*.my-cluster.my-domain.com" externalPort: 443 ingress: enabled: true postgres: enabled: true certManager: enabled: true trustManager: enabled: true keycloak: databaseStorage: accessModes: - ReadWriteOnce resources: requests: storage: 1G storageClassName: local-path volumeMode: Filesystem prometheus: storage: accessModes: - ReadWriteOnce resources: requests: storage: 1G storageClassName: local-path volumeMode: Filesystem grafana: enabled: false elastic: enabled: false

Ingress Controller Default Certificate Configuration

As a part of the deployment of the HAProxy ingress controller, a secret has been created in the nvidia-platform namespace, called nvidia-ingress-kubernetes-ingress-default-cert, that contains the TLS cert and TLS key used for the wildcard domain name. This certificate can be replaced by a signed certificate of the user’s choosing that is signed for the wildcard domain name of .my-cluster.my-domain.com.

© Copyright 2022-2023, NVIDIA. Last updated on Mar 20, 2023.