Example: CLI with Local NIMs#
Run the full research pipeline interactively from the command line using self-hosted NIM models. This is useful for air-gapped environments, custom fine-tuned models, or reducing latency by running inference locally.
This example is based on configs/config_cli_default.yml with modifications to point LLMs at locally-hosted NIM containers. To use local NIMs, copy the default config to configs/config_cli_local_nims.yml and update the base_url fields as shown below.
Prerequisites#
You need Docker and an NVIDIA GPU with sufficient VRAM to run the NIM containers. The Nemotron Nano 30B model requires approximately 40 GB of VRAM (or two 24 GB GPUs).
Running NIM Containers#
Start the NIM model server locally using Docker:
# Pull and run the Nemotron NIM container
# Adjust --gpus and CUDA_VISIBLE_DEVICES for your hardware
docker run -d \
--name nemotron-nim \
--gpus all \
-p 8001:8000 \
-e NVIDIA_API_KEY="${NVIDIA_API_KEY}" \
nvcr.io/nim/nvidia/nemotron-3-nano-30b-a3b:latest
Verify the model is ready:
curl http://localhost:8001/v1/models
For multi-model setups (for example, separate intent and research models), run additional containers on different ports:
# Smaller model for intent classification
docker run -d \
--name nemotron-mini-nim \
--gpus '"device=1"' \
-p 8002:8000 \
-e NVIDIA_API_KEY="${NVIDIA_API_KEY}" \
nvcr.io/nim/nvidia/nemotron-mini-4b-instruct:latest
Configuration#
# configs/config_cli_local_nims.yml
# Copy of config_cli_default.yml with base_url pointing to local NIM containers
general:
telemetry:
logging:
console:
_type: console
level: INFO
# Optional: trace to local Phoenix for debugging
# tracing:
# phoenix:
# _type: phoenix
# endpoint: http://localhost:6006/v1/traces
# project: local-dev
# ===========================================================================
# LLMs - pointing to local NIM containers
# ===========================================================================
# The key difference from cloud configs: base_url points to localhost
# instead of integrate.api.nvidia.com. No NVIDIA_API_KEY is needed for
# inference (only for pulling the container image).
llms:
nemotron_llm_intent:
_type: nim
model_name: nvidia/nemotron-3-nano-30b-a3b
base_url: "http://localhost:8001/v1" # <-- Local NIM
temperature: 0.5
top_p: 0.9
max_tokens: 4096
num_retries: 3
chat_template_kwargs:
enable_thinking: true
nemotron_llm:
_type: nim
model_name: nvidia/nemotron-3-nano-30b-a3b
base_url: "http://localhost:8001/v1" # <-- Local NIM
temperature: 0.1
top_p: 0.3
max_tokens: 16384
num_retries: 3
chat_template_kwargs:
enable_thinking: true
nemotron_llm_deep:
_type: nim
model_name: nvidia/nemotron-3-nano-30b-a3b
base_url: "http://localhost:8001/v1" # <-- Local NIM
temperature: 1.0
top_p: 1.0
max_tokens: 128000
num_retries: 3
chat_template_kwargs:
enable_thinking: true
# ===========================================================================
# Functions
# ===========================================================================
functions:
web_search_tool:
_type: tavily_web_search
max_results: 5
max_content_length: 1000
advanced_web_search_tool:
_type: tavily_web_search
max_results: 2
advanced_search: true
paper_search_tool:
_type: paper_search
max_results: 5
serper_api_key: ${SERPER_API_KEY}
intent_classifier:
_type: intent_classifier
llm: nemotron_llm_intent
tools:
- web_search_tool
- paper_search_tool
clarifier_agent:
_type: clarifier_agent
llm: nemotron_llm
planner_llm: nemotron_llm
tools:
- web_search_tool
max_turns: 3
enable_plan_approval: true # Interactive plan approval in CLI
log_response_max_chars: 2000
verbose: true
shallow_research_agent:
_type: shallow_research_agent
llm: nemotron_llm
tools:
- web_search_tool
max_llm_turns: 10
max_tool_iterations: 5
deep_research_agent:
_type: deep_research_agent
orchestrator_llm: nemotron_llm_deep
max_loops: 2
tools:
- paper_search_tool
- advanced_web_search_tool
# ===========================================================================
# Workflow
# ===========================================================================
workflow:
_type: chat_deepresearcher_agent
enable_escalation: true
enable_clarifier: true
checkpoint_db: ${AIQ_CHECKPOINT_DB:-./checkpoints.db}
Required Environment Variables#
# Only needed for pulling NIM container images (not for inference)
export NVIDIA_API_KEY="nvapi-..." # pragma: allowlist secret
# Web search still requires API keys (runs externally)
export TAVILY_API_KEY="tvly-..." # pragma: allowlist secret
export SERPER_API_KEY="..."
How to Run#
# Interactive CLI mode (recommended)
./scripts/start_cli.sh --config_file configs/config_cli_local_nims.yml
# Single query mode
dotenv -f deploy/.env run .venv/bin/nat run \
--config_file configs/config_cli_local_nims.yml \
--input "What are the latest advances in quantum error correction?"
The CLI script starts an interactive session. Type your research query and the system will:
Classify the intent (shallow vs deep)
For deep queries: present a research plan for your approval
Execute the research with tool calls shown in real time
Output a structured report
Example Session#
> What are the latest advances in quantum error correction?
[Intent: shallow]
[Tool: web_search] Searching: "quantum error correction advances 2026"
[Tool: web_search] Searching: "quantum error correction codes recent breakthroughs"
[Tool: web_search] Found 5 results
# Quantum Error Correction: Recent Advances
...
Tips for Local NIMs#
GPU memory: Monitor with
nvidia-smi. The 30B model needs ~40 GB VRAM.Startup time: NIM containers take 2–5 minutes to load the model on first start. Wait until
/v1/modelsreturns a response.Multiple GPUs: Use
--gpus '"device=0,1"'to spread across GPUs, or run separate containers per GPU for different model roles.Networking: If running inside Docker Compose, use container names instead of
localhostforbase_url.num_retries: Lower retry counts (3 vs 5) are appropriate for local NIMs since failures are less likely to be transient.