Config Store Architecture
This document describes the system architecture of the NVIDIA Config Manager Config Store service.
System Architecture
Components
API Service (FastAPI)
The API service is a FastAPI application that provides a REST API for configuration management. It provides:
- Versioned configuration storage with 1-year retention
- Gzip compression (level 6) for storage efficiency
- PostgreSQL advisory locks for fine-grained concurrency control
- RESTful API with OpenAPI documentation
- Diff generation between versions
- Bulk operations and batch endpoints
Web UI
- Next.js web interface for browsing device configurations
- See Web UI for features and access instructions
PostgreSQL (CNPG)
- Primary data store for versioned configs
- Advisory locks for concurrent writes
- Automatic versioning per device/filename/file_type
- Compressed content storage
Redis
- Cache for Nautobot device metadata
- Reduces load on Nautobot API
Nautobot
- Source of truth for device metadata (site, platform, role, rack)
- Accessed through the GraphQL API
- Metadata cached in Redis for performance
High Availability Architecture
In this high-availability architecture:
- Any API replica can handle any request
- If one replica crashes, others continue serving
- Fine-grained locking allows concurrent writes from all replicas
- No single point of failure (SPOF)
Data Flows
Write Operation Flow
- Client sends HTTP POST request to API endpoint
- FastAPI receives request and validates input
- Storage layer acquires PostgreSQL advisory lock (device+filename+file_type)
- Content is compressed using gzip (level 6)
- Content hash is calculated for deduplication
- New version is inserted into
config_filestable - Lock is automatically released on transaction commit
- Response returned with version number
Read Operation Flow
- Client sends HTTP GET request to API endpoint
- FastAPI queries PostgreSQL for latest version
- Content is decompressed from storage
- Device metadata is enriched from Redis cache (or Nautobot if cache miss)
- Response returned with config content and metadata
Concurrent Write Handling
- PostgreSQL advisory locks provide fine-grained locking at device+filename+file_type level
- Different devices can write simultaneously without blocking
- Intended and backup configs have independent locks
- Locks are automatically released on transaction commit/rollback
- Failed transactions release locks automatically
Storage Architecture
Database Schema
The config_files table stores all versioned configuration content:
Compression
- Content is compressed using gzip level 6 before storage
- Typical compression ratio: ~93% reduction (50KB → ~5KB)
- Decompression happens on read operations
- Content hash is calculated on uncompressed content for deduplication
Versioning
- Automatic version increment per device/filename/file_type combination
- Versions start at 1 and increment sequentially
- Each version is immutable (no updates, only new versions)
- Full audit trail with author, commit message, and timestamp
Nautobot Integration
Device metadata is fetched from Nautobot through GraphQL and cached in Redis:
- Site information
- Platform details
- Device role
- Rack location
- Other device attributes
This metadata enriches API responses and enables device-centric views in the UI.
Caching Strategy:
- Metadata cached in Redis with TTL
- Cache misses trigger GraphQL queries to Nautobot
- Cache refresh service periodically updates stale entries
Deployment Architecture
Kubernetes Deployment
The service is deployed as a Kubernetes application with:
- API Service: 3-5 replicas for high availability
- PostgreSQL: CNPG cluster (primary + 2 replicas)
- Redis: Shared service for Nautobot metadata caching
- Web UI: Optional Next.js frontend
- Gateway: For external access
Infrastructure Requirements
- PostgreSQL: CNPG cluster (primary + 2 replicas)
- Memory: 16GB per instance
- CPU: 4-8 cores per instance
- Storage: 200GB SSD
- Redis: Shared service for Nautobot metadata caching
- API Replicas: 3-5 replicas for high availability
- Memory: 1GB per replica
- CPU: 500m per replica
Monitoring and Observability
Prometheus Metrics
You can access Prometheus metrics at the operational /metrics endpoint. Config store provides the default set of metrics, as documented in the Instrumentator documentation:
http_requests_total- Total number of requestshttp_request_size_bytes- Sum of the content lengths of all incoming requestshttp_response_size_bytes- Sum of the content lengths of all outgoing responseshttp_request_duration_seconds- Total duration of requests, limited to only a few bucketshttp_request_duration_highr_seconds- Higher resolution duration of requests, with a large number of buckets
Health Checks
- Health Check Route:
GET /healthcheck - Readiness Probe: Database connectivity check
- Liveness Probe: Application responsiveness check
Logging
- Structured logging with request IDs
- Audit logging for all configuration changes
- Error logging with stack traces
- Performance logging for slow operations