Production Considerations#

This page covers operational guidance for running the AI-Q blueprint in production environments.

Database#

Use Managed PostgreSQL#

The default compose stack includes a PostgreSQL container, but for production workloads consider a managed database service:

Amazon RDS for PostgreSQL
Google Cloud SQL for PostgreSQL
Azure Database for PostgreSQL

Set the following environment variables to point to your managed database:

Variable	Driver	Example
`NAT_JOB_STORE_DB_URL`	`asyncpg`	`postgresql+asyncpg://<user>:<pw>@rds-host:5432/aiq_jobs`
`AIQ_CHECKPOINT_DB`	`psycopg2`	`postgresql://<user>:<pw>@rds-host:5432/aiq_checkpoints`
`AIQ_SUMMARY_DB`	`psycopg`	`postgresql+psycopg://<user>:<pw>@rds-host:5432/aiq_jobs`

Database Initialization#

When using a managed database, you must run the initialization SQL manually (or as a migration step) since the init-db.sql Docker entrypoint script only executes on a fresh PostgreSQL container volume. The script:

Creates the aiq_checkpoints database.
Grants permissions to the application user.
Creates the job_info table with performance indices in aiq_jobs.

Refer to deploy/compose/init-db.sql for the full schema.

Backup Strategy#

Back up the following databases regularly:

aiq_jobs – Contains the job_info table (job metadata) and job_events table (event stream). This is the critical operational data store.
aiq_checkpoints – Contains LangGraph agent state checkpoints. These allow resumption of interrupted research workflows.

For managed databases, enable automated daily backups with at least 7 days of retention. For self-managed PostgreSQL, use pg_dump on a schedule:

pg_dump -U aiq -d aiq_jobs > aiq_jobs_$(date +%Y%m%d).sql
pg_dump -U aiq -d aiq_checkpoints > aiq_checkpoints_$(date +%Y%m%d).sql

Scaling#

Horizontal Backend Scaling#

The backend is stateless apart from database connections, so it can be horizontally scaled behind a load balancer.

Docker Compose: Run multiple backend containers by scaling the service and using a reverse proxy (such as Traefik or NGINX) in front:

docker compose --env-file ../.env -f docker-compose.yaml up -d --scale aiq-agent=3

Note that each scaled instance starts its own embedded Dask scheduler and worker. For a shared Dask cluster, deploy Dask separately and set NAT_DASK_SCHEDULER_ADDRESS to point to the external scheduler.

Dask Workers#

Each backend container runs an embedded Dask scheduler with a configurable number of workers and threads:

Variable	Default	Guidance
`DASK_NWORKERS`	`1`	Increase for higher job throughput. Each worker consumes memory proportional to the research workflow depth.
`DASK_NTHREADS`	`4`	Increase for I/O-bound workloads (web searches, API calls).

Resource Requirements#

Deep research workflows are memory- and compute-intensive due to multi-phase LLM calls. Recommended minimums:

Component	CPU	Memory	Notes
Backend	2 cores	4 GB	Increase for deep research or multiple concurrent users.
Frontend	0.5 cores	512 MB	Lightweight Next.js server.
PostgreSQL	1 core	2 GB	Increase for high write throughput.

Security#

Non-Root Execution#

The Docker image runs as a non-root user (aiq, UID 1000) in both dev and release targets. The NVIDIA distroless base image has no shell and no package manager, reducing the attack surface.

Read-Only Configuration Mounts#

The compose stack mounts configs/ as read-only (:ro), preventing the application from modifying its own configuration at runtime.

Secrets Management#

Store API keys in deploy/.env and ensure the file is not committed to version control (it is listed in .gitignore). Never embed keys in configuration files or Dockerfiles.

Monitoring#

Health Endpoint#

The backend exposes a health endpoint at /health for liveness and readiness probes.

curl http://localhost:8000/health

Log Tailing#

Backend logs show agent execution, tool calls, LLM interactions, and job lifecycle events.

docker logs aiq-agent -f

Set LOG_LEVEL=DEBUG for verbose output during troubleshooting. Use LOG_LEVEL=WARNING in production to reduce log volume.

Tracing#

The backend supports OpenTelemetry-compatible tracing. See Observability for setup guides covering Phoenix, LangSmith, Weave, and the OTEL Collector with privacy redaction.

If you are deploying the aiq_api front-end and want request correlation on NAT-exported spans, set the relevant environment variables at deploy time rather than hardcoding them in code:

AIQ_TRACE_USER_IDENTITY_MODE
AIQ_TRACE_USER_IDENTITY_HMAC_SECRET
AIQ_TRACE_CLIENT_ID_MODE
AIQ_TRACE_CLIENT_ID_HMAC_SECRET
AIQ_TRACE_CLIENT_IP_HEADERS

Metrics to Watch#

Metric	Source	What to look for
Backend response time	Health endpoint, access logs	Increasing latency indicates resource pressure or LLM API slowdowns.
Job queue depth	`job_info` table (`status='pending'`)	Growing backlog means Dask workers cannot keep up.
Database connections	PostgreSQL `pg_stat_activity`	Connection exhaustion from too many backend replicas.
Container restarts	Docker	Frequent restarts indicate OOM kills or startup failures.
Dask worker memory	Dask dashboard (port 8787)	Memory growth in workers during deep research.