Swapping Models#

LLMs are defined in the llms section and referenced by agents and tools. You can swap NIM models, change parameters, or add alternative providers.

Example: NIM model (default)

llms:
  nemotron_llm:
    _type: nim
    model_name: nvidia/nemotron-3-nano-30b-a3b
    base_url: "https://integrate.api.nvidia.com/v1"
    temperature: 0.7
    top_p: 0.7
    max_tokens: 8192
    num_retries: 5

Example: NIM with thinking (for example, for deep research)

llms:
  nemotron_llm:
    _type: nim
    model_name: nvidia/nemotron-3-nano-30b-a3b
    base_url: "https://integrate.api.nvidia.com/v1"
    temperature: 1.0
    top_p: 1.0
    max_tokens: 128000
    chat_template_kwargs:
      enable_thinking: true

Model roles: The workflow maps LLMs to roles (orchestrator, researcher, planner, etc.) through the LLMProvider. In YAML you assign which named LLM each agent uses (for example, orchestrator_llm: nemotron_llm, llm: nemotron_llm). Use different keys in llms and point agents at them to swap models per role.

Using Downloadable NIMs (Self-Hosted)#

By default, configs use NVIDIA’s hosted NIM API (integrate.api.nvidia.com). You can also run NIMs locally or on your own infrastructure for lower latency, data privacy, or offline use.

1. Find Downloadable NIMs#

Browse available NIMs at build.nvidia.com. Each model page includes a “Self-Host” tab with Docker pull commands and setup instructions.

2. Run a NIM Locally#

# Example: run Nemotron on port 8080
docker run --gpus all -p 8080:8000 \
  nvcr.io/nim/nvidia/nemotron-3-nano-30b-a3b:latest

Refer to the NIM documentation for GPU requirements, environment variables, and multi-GPU setup.

3. Update Your Config#

Change base_url to point to your local NIM instance instead of the hosted API. The model_name stays the same. You can remove api_key since local NIMs typically don’t require one.

llms:
  nemotron_llm:
    _type: nim
    model_name: nvidia/nemotron-3-nano-30b-a3b
    base_url: "http://localhost:8080/v1"    # local NIM
    temperature: 0.7
    max_tokens: 8192
    num_retries: 5

You can mix hosted and local NIMs in the same config – for example, use a local NIM for the high-volume shallow researcher and a hosted NIM for the orchestrator:

llms:
  local_llm:
    _type: nim
    model_name: nvidia/nemotron-3-nano-30b-a3b
    base_url: "http://localhost:8080/v1"
    temperature: 0.7
    max_tokens: 8192

  hosted_llm:
    _type: nim
    model_name: nvidia/nemotron-3-nano-30b-a3b
    base_url: "https://integrate.api.nvidia.com/v1"
    temperature: 1.0
    max_tokens: 128000

functions:
  shallow_research_agent:
    _type: shallow_research_agent
    llm: local_llm          # fast, local inference
    # ...

  deep_research_agent:
    _type: deep_research_agent
    orchestrator_llm: hosted_llm   # hosted for deep thinking
    # ...