Overview#
The NVIDIA Cloud Native Service Add-on Pack is a set of packaged components for AI Workflows designed to provide the basic functionalities required for enterprise deployments of AI applications on Kubernetes-based infrastructure.
Currently, the following variants of K8S deployments and integrations are available:
More information about the specific platforms that the add-on pack supports can be found in the platform-specific sections of the document, linked above.
The packaged components in the add-on pack include implementation examples for authentication, monitoring, reporting, and load balancing, that can be used as-is, or customized and connected to your own environment.
These examples follow general guidelines for enterprise production requirements and serve as standards compatible with NVIDIA’s AI frameworks for building and deploying AI solutions as microservices.
The guidelines generally fall within the following categories:
Deployment and Orchestration
OCI-Compliant Container Images
Liveness/Readiness/Startup Probes
Security and Vulnerability Scanning/Patching
Security
OIDC/OAuth2 User Authentication
External Secrets Management
Secure API Endpoints
Networking
Ingress Control
Proxy Sidecar
Logging and Reporting
Open Telemetry Protocol (OTLP) monitoring
OTLP support within application containers
Log Aggregation
AI Workflows also include the AI framework specific to your use case, which is delivered as an OCI-compliant base container image. The following graphic illustrates the additional opinionated components which are included within NVIDIA AI Workflows to meet the above guideline requirements:
- Keycloak
Standard OIDC/OAuth2 provider to provide user management and authentication. It is an open-source identity provider that is OIDC/OAuth2 compliant and can also integrate with other identity providers, allowing the customer to connect their existing environments into the Workflows with minimal additional development.
- Cert-Manager
Cert Manager has been deployed with a custom Certificate Authority set up to generate/rotate HTTPS certs used by applications and AI Workflow components
- Trust Manager
Trust Manager is deployed for injecting the Certificate Authority public key into AI Workflow or application namespaces
- Ingress Controller
An HAProxy ingress controller is deployed with a wildcard DNS certificate to manage access to the services deployed within the cluster.
- Prometheus
A Prometheus operator and a centralized Prometheus service have been deployed on the cluster to scrape and pull metrics from the application services and provide an OTLP-compliant monitoring system and database.
- Grafana
A Grafana operator and a centralized Grafana service is deployed and can be used to create and host dashboards visualizing the appropriate metrics and monitoring data for the particular use case or AI Workflow. The centralized Grafana service is connected to the centralized Prometheus server by default.
- Postgres Operator
The Postgres Operator from CrunchyDB has been deployed for creating relational databases. One Postgres database is instantiated and is used to back Keycloak.
- Elastic Operator
The Elastic Operator for ElasticSearch, Kibana, and other Elastic tools has been deployed on the cluster. No elastic services are configured out of the box but can be leveraged by applications or AI Workflows if needed.
Note
Each user is responsible for checking the content and the applicable licenses of third party software and determining if they are suitable for the intended use.