Resource Scaling Modes for NeMo Retriever Extraction
This guide covers how resource scaling modes work across stages in NeMo Retriever extraction, and how to configure it with docker-compose.
- Static scaling: Each pipeline stage runs a fixed number of replicas based on heuristics (memory-aware). Good for consistent latency; higher steady-state memory usage.
- Dynamic scaling: Only the source stage is fixed; other stages scale up/down based on observed resource pressure. Better memory efficiency; may briefly pause to spin replicas back up after idle periods.
Note
NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
When to choose which
- Choose Static when latency consistency and warm pipelines matter more than memory minimization.
- Choose Dynamic when memory headroom is constrained or workloads are bursty/idle for long periods.
Configure (docker-compose)
Edit services > nv-ingest-ms-runtime > environment in docker-compose.yaml.
Select mode
- Dynamic (default)
INGEST_DISABLE_DYNAMIC_SCALING=false-
INGEST_DYNAMIC_MEMORY_THRESHOLD=0.80(fraction of memory; worker scaling reacts around this level) -
Static
INGEST_DISABLE_DYNAMIC_SCALING=true- Optionally set a static memory threshold:
INGEST_STATIC_MEMORY_THRESHOLD=0.85(fraction of total memory reserved for static replicas)
Example (Static):
services:
nv-ingest-ms-runtime:
environment:
- INGEST_DISABLE_DYNAMIC_SCALING=true
- INGEST_STATIC_MEMORY_THRESHOLD=0.85
Example (Dynamic):
services:
nv-ingest-ms-runtime:
environment:
- INGEST_DISABLE_DYNAMIC_SCALING=false
- INGEST_DYNAMIC_MEMORY_THRESHOLD=0.80
Pipeline config mapping
pipeline.disable_dynamic_scaling⇐INGEST_DISABLE_DYNAMIC_SCALINGpipeline.dynamic_memory_threshold⇐INGEST_DYNAMIC_MEMORY_THRESHOLDpipeline.static_memory_threshold⇐INGEST_STATIC_MEMORY_THRESHOLD
Trade-offs recap
- Dynamic
- Pros: Better memory efficiency; stages scale down when idle; can force scale-down under spikes.
-
Cons: After long idle, stages may scale to 0 replicas causing brief warm-up latency when work resumes.
-
Static
- Pros: Stable, predictable latency; stages remain hot.
- Cons: Higher baseline memory usage over time.
Sources of memory utilization
- Workload size and concurrency
- More in‑flight jobs create more objects (pages, images, tables, charts) and large artifacts (for example, embeddings).
- Example: 1 MB text file → paragraphs with 20% overlap → 4k‑dim embeddings base64‑encoded to JSON
- Assumptions: ~600 bytes per paragraph. 20% overlap ⇒ effective step ≈ 480 bytes. Chunks ≈ 1,000,000 / 480 ≈ 2,083.
- Per‑embedding size: 4,096 dims × 4 bytes (float32) = 16,384 bytes; base64 expansion × 4/3 ≈ 21,845 bytes (≈21.3 KB).
- Total embeddings payload: ≈ 2,083 × 21.3 KB ≈ 45 MB, excluding JSON keys/metadata.
- Takeaway: a 1 MB source can yield ≳40× memory just for embeddings, before adding extracted text, images, or other artifacts.
- Example: PDF rendering and extracted images (A4 @ 72 DPI)
- Rendering a page is a large in‑memory buffer; each extracted sub‑image adds more, and base64 inflates size.
- Page pixels ≈ 8.27×72 by 11.69×72 ≈ 595×842 ≈ 0.50 MP.
- RGB (3 bytes/pixel) ≈ 1.5 MB per page buffer; RGBA (4 bytes/pixel) ≈ 2.0 MB.
- Ten 1024×1024 RGB crops ≈ 3.0 MB each in memory → base64 (+33%) ≈ 4.0 MB each ⇒ ~40 MB just for crops (JSON not included).
- If you also base64 the full page image, expect another ~33% over the raw byte size (compression varies by format).
- Library behavior
- Components like PyArrow may retain memory longer than expected (delayed free).
- Queues and payloads
- Base64‑encoded, fragmented documents in Redis consume memory proportional to concurrent jobs, clients, and drain speed.
Where to look in docker-compose
Open docker-compose.yaml and locate:
services > nv-ingest-ms-runtime > environment:INGEST_DISABLE_DYNAMIC_SCALINGINGEST_DYNAMIC_MEMORY_THRESHOLDINGEST_STATIC_MEMORY_THRESHOLD