Overview

The NVIDIA Cloud Native Service Add-on Pack is a set of packaged components for AI Workflows designed to provide the basic functionalities required for enterprise deployments of AI applications on Kubernetes-based infrastructure.

Currently, the following variants of K8S deployments and integrations are available:

More information about the specific platforms that the add-on pack supports can be found in the platform-specific sections of the document, linked above.

The packaged components in the add-on pack include implementation examples for authentication, monitoring, reporting, and load balancing, that can be used as-is, or customized and connected to your own environment.

These examples follow general guidelines for enterprise production requirements and serve as standards compatible with NVIDIA’s AI frameworks for building and deploying AI solutions as microservices.

The guidelines generally fall within the following categories:

  • Deployment and Orchestration

    • OCI-Compliant Container Images

    • Liveness/Readiness/Startup Probes

    • Security and Vulnerability Scanning/Patching

  • Security

    • OIDC/OAuth2 User Authentication

    • External Secrets Management

    • Secure API Endpoints

  • Networking

    • Ingress Control

    • Proxy Sidecar

  • Logging and Reporting

    • Open Telemetry Protocol (OTLP) monitoring

    • OTLP support within application containers

    • Log Aggregation

AI Workflows also include the AI framework specific to your use case, which is delivered as an OCI-compliant base container image. The following graphic illustrates the additional opinionated components which are included within NVIDIA AI Workflows to meet the above guideline requirements:

image1.png

Keycloak

Standard OIDC/OAuth2 provider to provide user management and authentication. It is an open-source identity provider that is OIDC/OAuth2 compliant and can also integrate with other identity providers, allowing the customer to connect their existing environments into the Workflows with minimal additional development.

Cert-Manager

Cert Manager has been deployed with a custom Certificate Authority set up to generate/rotate HTTPS certs used by applications and AI Workflow components

Trust Manager

Trust Manager is deployed for injecting the Certificate Authority public key into AI Workflow or application namespaces

Ingress Controller

An HAProxy ingress controller is deployed with a wildcard DNS certificate to manage access to the services deployed within the cluster.

Prometheus

A Prometheus operator and a centralized Prometheus service have been deployed on the cluster to scrape and pull metrics from the application services and provide an OTLP-compliant monitoring system and database.

Grafana

A Grafana operator and a centralized Grafana service is deployed and can be used to create and host dashboards visualizing the appropriate metrics and monitoring data for the particular use case or AI Workflow. The centralized Grafana service is connected to the centralized Prometheus server by default.

Postgres Operator

The Postgres Operator from CrunchyDB has been deployed for creating relational databases. One Postgres database is instantiated and is used to back Keycloak.

Elastic Operator

The Elastic Operator for ElasticSearch, Kibana, and other Elastic tools has been deployed on the cluster. No elastic services are configured out of the box but can be leveraged by applications or AI Workflows if needed.

Note

Each user is responsible for checking the content and the applicable licenses of third party software and determining if they are suitable for the intended use.

© Copyright 2022-2023, NVIDIA. Last updated on May 23, 2023.