External Datastore Support for NVSentinel
This document tracks the design decisions, implementation plan, and per-CSP configuration for connecting NVSentinel to externally managed database services.
Overview
By default, NVSentinel deploys its own MongoDB database inside the same Kubernetes cluster it monitors. This works well for general deployments, but creates an overhead problem for lean GPU clusters that are focused purely on compute workloads.
This feature adds support for connecting NVSentinel to an externally managed database — such as a hosted MongoDB or PostgreSQL service from a cloud provider — instead of running one inside the monitored cluster.
Supported External Services (Target)
Problem Statement
- Clusters focused on GPU workloads should not have to host a database
- Running a 3-replica MongoDB replicaset inside a GPU cluster wastes resources
- Customers want to use managed, hosted database services from their cloud provider
- Managed services offer built-in HA, backups, and scaling without operator burden
Design Decision: Configuration Approach
Options Considered
Option A — Extend existing global.datastore section
The global.datastore section already exists in values.yaml as a foundation for
switching between database providers (MongoDB, PostgreSQL). This same section is extended
to also support external/hosted databases by adding uri, tls, and auth sub-sections.
Pros:
configmap-datastore.yamlalready exists and createsnvsentinel-datastore-configwhenglobal.datastoreis set — the Helm scaffolding is in place- All 7 DB-consuming service templates (
daemonset,fault-quarantine,fault-remediation,health-events-analyzer,node-drainer,event-exporter,csp-health-monitor) already contain the switch logic:{{ if .Values.global.datastore }}nvsentinel-datastore-config{{ else }}mongodb-config{{ end }} - Services that don’t use the DB (
janitor,labeler,syslog-health-monitor) correctly don’t reference either ConfigMap — no changes needed there - No new config section needed — less user-facing complexity
Cons:
- Was originally designed for switching providers (mongodb vs postgresql), not for external DB support — the concept is being stretched beyond its original intent
- Currently lacks
uri,tls,authfields — needs significant extension configmap-datastore.yamlno longer embedsMONGODB_URIfor providermongodb; the chart requires a pre-created Secret (MONGODB_URIkey) andcredentialsFromSecret.name.- The init job (
mongodb-store/templates/jobs.yaml) hardcodesmongodb-config— it does NOT participate in theglobal.datastoreswitching path and must be separately updated - Risk of confusion between “which provider” and “where is it running” (internal vs external)
Option B — New externalMongodb / externalPostgresql sections
Pros:
- Very explicit and clear — user immediately knows this section is for external databases only
- No risk of breaking existing teams already using
global.datastorefor PostgreSQL — that section is untouched - Each DB type has its own dedicated, focused config — easier to reason about per provider
- Cleaner separation of concerns:
global.mongodbStore.enabledcontrols internal deployment,externalMongodbcontrols external connection — no overlap
Cons:
- Requires a new separate section per database type (
externalMongodb,externalPostgresql) — grows as more providers are added in the future - All 7 service templates that already switch on
global.datastorewould need to also handle the new section — more template logic to maintain across the codebase - TLS and auth configuration would be duplicated per DB type section (e.g., both
externalMongodb.tlsandexternalPostgresql.tls) - The existing
configmap-datastore.yamland switching logic already present in 7 service templates would be bypassed entirely — prior work not reused
Common Requirements
Whichever option is chosen, the user will need to supply these to connect to an external DB:
Expected Behavior
Implementation Plan
Note: The detailed Helm chart implementation plan (Phase 1) depends on the design decision above (Option A vs Option B) and will be filled in once that is finalized. Phases 2 and 3 are common to both options.
Phase 1 — Helm Chart Changes
To be detailed after design decision is made. At a high level, regardless of which option is chosen, the following areas will need changes:
values.yaml— new config fields for external DB connection (URI, TLS, auth)configmap-datastore.yaml(Option A) or a new ConfigMap template (Option B) — to pass the external URI and credentials to services as environment variablesmongodb-store/templates/jobs.yaml— make TLS flags and X.509 user creation conditional so the init job can connect to an external DB- Deployment templates (all DB-consuming subcharts) — conditional cert mounting based on whether TLS and/or client certs are configured
Phase 2 — Per-CSP Testing and Documentation
- AWS DocumentDB
- Azure Cosmos DB for MongoDB (Azure DocumentDB vCore)
- Google Cloud (MongoDB Atlas on GCP)
- OCI
Phase 3 — Example Values Files
-
values-aws-docdb.yaml -
values-azure-cosmosdb.yaml -
values-atlas-gcp.yaml -
values-oci-mongodb.yaml
What Does NOT Change
- Go application code — already reads
MONGODB_URIas a plain string from the environment. External MongoDB supplies that viaenvFrom.secretRefonly. - Default behavior — all existing deployments with
mongodbStore.enabled: trueare completely unaffected. This is purely additive. - PostgreSQL internal support — existing
postgresql.enabled: truepath is unchanged.
MONGODB_URI only from a Kubernetes Secret (external MongoDB)
For external managed MongoDB (global.mongodbStore.enabled: false and
global.datastore.provider: mongodb), the full URI is not written to the datastore
ConfigMap. You must create an existing Secret whose data key is exactly MONGODB_URI and
set global.datastore.credentialsFromSecret.name. Helm fails if that name is missing, or if
global.datastore.uri is set for this mode.
In-cluster MongoDB (for example Tilt / global.mongodbStore.enabled: true): when no
credentialsFromSecret is set, the chart builds a credential-free MONGODB_URI in the
ConfigMap from global.datastore.connection (host, port, extraParams). You can still set
credentialsFromSecret to override with a Secret.
-
Create the Secret. Set
MONGODB_URIin your shell to the full string from your provider (for example MongoDB Atlas Connect → Drivers), or put that single line in a local file that is never committed:Alternatively:
--from-file=MONGODB_URI=./mongodb-uri.txt(one line; keep the file out of version control). -
Reference it in values:
DB-consuming workloads use envFrom.secretRef when credentialsFromSecret.name is set;
otherwise they read MONGODB_URI from the ConfigMap. The embedded datastore.yaml does not
include a MongoDB uri field from Helm values in either mode.
Per-CSP Configuration Guide
This section will be filled in as each CSP is tested.
Connection URI format reference (Secret MONGODB_URI value)
Use the exact connection string your provider shows in its console as the literal value of key MONGODB_URI in the Kubernetes Secret (never as global.datastore.uri for MongoDB). Do not paste real credentials into docs or source control.
- MongoDB Atlas:
mongodb+srvSRV string from Connect → Drivers; include the database path (for example/HealthEventsDatabase) and query options your deployment needs (for exampleretryWrites=false). - AWS DocumentDB:
mongodbURL to the cluster endpoint with TLS and replica-set query parameters per AWS DocumentDB connection documentation. - Azure DocumentDB vCore:
mongodb+srvstring from Settings → Connection Strings (primary); includes host undermongocluster.cosmos.azure.comand the TLS/auth query parameters the portal provides.
Database and Collection Setup (Automatic)
You do not need to manually create the HealthEventsDatabase database or any collections
in any of the CSPs below. On every helm install or helm upgrade, NVSentinel automatically
runs the nvsentinel-external-mongodb-setup Job which:
- Creates the
HealthEventsDatabasedatabase (MongoDB creates it lazily on first write) - Creates the
HealthEvents,ResumeTokens, andMaintenanceEventscollections if they don’t exist - Creates TTL indexes (auto-expire old events) and query indexes on all collections
All you need is a cluster endpoint and a database user with read/write access.
AWS DocumentDB
Service: Amazon DocumentDB (MongoDB 5.0 compatible)
Step 1 — Obtain a DocumentDB Cluster
Choose one of the two options below depending on whether you are creating a new cluster or connecting to one that already exists.
Option A — Create a New Cluster
In the AWS Console, navigate to Amazon DocumentDB → Clusters → Create with these settings:
- Engine version: 5.0.0 (minimum; supports change streams)
- Cluster type: Instance-based cluster
- Authentication: Username/password (SCRAM)
- VPC: Same VPC as your EKS cluster (or VPC-peered)
- Subnet group: Create a subnet group covering your private subnets
- Security group: Create a new security group — allow TCP port
27017inbound from your EKS node CIDR ranges (see Networking below)
After creation, note the Cluster endpoint (read/write):
Option B — Use an Existing Cluster
Verify the following before proceeding:
- Engine version — must be
5.0.0or higher (change streams require this). - Change streams — must be explicitly enabled (see Step 2 below).
- Network access — EKS node CIDR ranges must be allowed on TCP port
27017in the cluster’s security group. - VPC — cluster must be in the same VPC as your EKS cluster, or reachable via VPC peering / Transit Gateway.
- Credentials — have a username and password ready for NVSentinel.
Step 2 — Enable Change Streams
DocumentDB does not enable change streams by default. You must explicitly enable them.
2a. Create or update a cluster parameter group:
In Amazon DocumentDB → Parameter groups, create a custom parameter group (or modify an existing one) and set:
Apply the parameter group to your cluster and reboot.
2b. Enable change streams on the database via mongosh:
From a pod with network access to DocumentDB (e.g. a debug pod in your EKS cluster):
Set DOCUMENTDB_ADMIN_URI to an admin connection URL from the AWS console for your cluster. Do not store that value in this repository.
Without this, health-events-analyzer and fault-quarantine will fail to start their change stream watchers.
Step 3 — Configure DNS Resolution from EKS
DocumentDB cluster endpoints use private DNS names (e.g. nvsentinel-test-1.cluster-c9g8sagiqhcr.us-east-1.docdb.amazonaws.com) that resolve within AWS VPC. EKS pods must be able to resolve these names.
Create an AWS Route 53 Private Hosted Zone:
- Go to Route 53 → Hosted zones → Create hosted zone.
- Set the Domain name to match the DocumentDB endpoint suffix (e.g.
cluster-c9g8sagiqhcr.us-east-1.docdb.amazonaws.com). - Set Type to Private hosted zone and associate it with your EKS VPC.
- Create an A record pointing the cluster endpoint hostname to the DocumentDB cluster’s private IP address.
The private IP of the DocumentDB cluster can be found via
nslookupfrom any EC2 instance in the VPC, or via the AWS Console under the cluster’s instance details.
Verify from an EKS pod:
Expected output: a valid IP address. NXDOMAIN means the Route 53 hosted zone or A record is incorrect.
Step 4 — Create CA Certificate Secret
DocumentDB uses a private Amazon RDS CA that is not trusted by default Go TLS. You must mount it into NVSentinel pods.
Download the AWS RDS CA bundle:
Create a Kubernetes secret:
The key must be named ca.crt — this is what the Helm chart volume mounts reference.
Step 5 — Helm Values Configuration
Create a values override file (e.g. values-aws-docdb.yaml) with your connection details.
Put the full connection string (with password) in a Kubernetes Secret with data key
MONGODB_URI (see MONGODB_URI only from a Kubernetes Secret)
and set credentialsFromSecret.name.
Example values:
Percent-encode any special characters in the password:
#→%23,@→%40,%→%25
Networking and Security Groups
This was the most common source of connectivity failures during testing. EKS pods can have IPs from multiple CIDR ranges depending on the CNI configuration — not just the primary VPC CIDR.
In the DocumentDB cluster’s security group, add an inbound rule for each CIDR range your pods use:
If some pods connect successfully but others get
dial tcp: i/o timeout, check the pod IPs — pods with IPs outside the allowed CIDR range will be blocked by the security group.
To find the pod CIDRs in use:
Azure Cosmos DB for MongoDB
Service: Azure DocumentDB with MongoDB Compatibility (vCore)
MongoDB Version: 8.0
Important: Choosing the Right Azure Service
Azure has two different Cosmos DB MongoDB offerings. They are NOT interchangeable:
Always create the “Azure DocumentDB (with MongoDB compatibility)” resource, NOT “Azure Cosmos DB for MongoDB”.
Step 1 — Obtain a DocumentDB Cluster
Choose one of the two options below depending on whether you are creating a new cluster or connecting to one that already exists.
Option A — Create a New Cluster
In Azure Portal, create a new “Azure DocumentDB (with MongoDB compatibility)” resource (vCore tier, MongoDB 8.0, SCRAM auth). Match the region to your AKS cluster.
Refer to the Azure DocumentDB quickstart for full creation steps.
Option B — Use an Existing Cluster
Verify the following before proceeding:
- Service type — confirm the resource is Azure DocumentDB vCore (not RU/Serverless). Change Streams are only supported on vCore.
- MongoDB version — must be 5.0 or higher (8.0 recommended).
- Network access — AKS outbound IPs must be allowed in Settings → Networking to reach port
27017. - Credentials — have an admin username and password ready for NVSentinel.
Connection String (both options)
Go to Settings → Connection Strings and copy the Primary Connection String into your Kubernetes Secret only (see MONGODB_URI only from a Kubernetes Secret).
If you assemble the URI yourself, percent-encode reserved characters in the database user’s password: # → %23, @ → %40, % → %25
If special characters are not encoded,
platform-connectorswill fail withMongoParseError: Password contains unescaped characterson startup.
Step 2 — Helm Values Configuration
Create a values override file (e.g. values-cosmosdb-test.yaml) with your connection details.
Put the connection string in a Secret (MONGODB_URI key) and set
credentialsFromSecret.name — see
MONGODB_URI only from a Kubernetes Secret.
Google Cloud — MongoDB Atlas
Service: MongoDB Atlas (M0 free tier or M10+) hosted on GCP
Why MongoDB Atlas on GCP?
GCP does not offer a native first-party MongoDB-compatible managed service the way AWS offers DocumentDB or Azure offers Cosmos DB. The only production-grade options on GCP are:
MongoDB Atlas is the recommended solution. Atlas uses the standard MongoDB driver protocol so all aggregation features work natively — no compatibility workarounds needed.
Step 1 — Obtain a MongoDB Atlas Cluster
Choose one of the two options below depending on whether you are creating a new cluster or connecting to one that already exists.
Option A — Create a New Cluster
- Go to cloud.mongodb.com → Create a deployment
- Choose M0 (Free) for testing or M10+ for production
- Select GCP as the provider and choose the region closest to your GKE cluster (e.g.
us-central1) - Name the cluster and click Create
- When prompted to create a database user, configure SCRAM-SHA-256 auth in the Atlas UI
- Copy the full connection string from Connect → Drivers and supply it only via the Kubernetes Secret (
MONGODB_URI); do not commit it to git
Option B — Use an Existing Cluster
Verify the following before proceeding:
- Change streams — Atlas supports change streams on all tiers (M0 and above). No extra configuration needed.
- Network access — GKE node/pod IPs must be in the Atlas IP Access List (see Step 2).
- Credentials — have a database user with SCRAM-SHA-256 auth credentials ready.
Step 2 — Configure Network Access
MongoDB Atlas M0 does not support VPC peering. IP allowlisting is the only option.
In the Atlas UI, go to Security → Network Access → Add IP Address:
- For testing: click Allow Access from Anywhere to add
0.0.0.0/0 - For production (M10+): add your GKE node pool CIDR ranges only. Find them with:
M10+ clusters support VPC peering with GCP. Refer to the Atlas VPC Peering documentation for production setup.
Step 3 — Helm Values Configuration
No CA certificate secret is needed — Atlas uses DigiCert public CA which is trusted by the Go TLS runtime by default.
Store the Atlas connection string in a Kubernetes Secret (MONGODB_URI key)
and set credentialsFromSecret.name — see
MONGODB_URI only from a Kubernetes Secret.
Create a values override file values-atlas-gcp.yaml:
Percent-encode password characters in the Secret value if needed:
#→%23,@→%40,%→%25
OCI
Status: Skipped — no viable managed MongoDB service available on OCI.
Research Summary
All available OCI-native options were evaluated:
The only viable option is a self-managed MongoDB deployment on OCI Compute VMs. However, when attempting to create a Compute instance in the OCI Console (both US West San Jose and UK South London regions, multiple compartments), no VM images appeared under any OS tab (Oracle Linux, Ubuntu, Red Hat) and no shapes were available for selection.
OCI testing is skipped for now.