NemoClaw can route inference to a model server running on your machine instead of a cloud API. This page covers Ollama, compatible-endpoint paths for other servers, and experimental managed options for vLLM and NVIDIA NIM.
All approaches use the same inference.local routing model.
The agent inside the sandbox never connects to your model server directly.
OpenShell intercepts inference traffic and forwards it to the local endpoint you configure.
Ollama is the default local inference option. The onboard wizard detects Ollama automatically when you have installed it or started it on the host.
If you installed Ollama but have not started it, NemoClaw starts it for you.
On macOS and Linux, the wizard can also offer to install Ollama when it is not present.
When the host Ollama is below the minimum version NemoClaw expects for its starter models (currently 0.7.0), the wizard surfaces an explicit Upgrade Ollama entry in the provider menu instead of silently reusing the older daemon, and the express setup path resolves to that entry.
The wizard inspects both the CLI binary (ollama --version) and the locally running daemon (/api/version on :11434) so the upgrade entry still appears when only one side is stale, for example a fresh user-local binary paired with the original system daemon.
The gate skips Windows-host Ollama reached from WSL through host.docker.internal.
The separate Use / Start / Install Ollama on Windows host entries handle that case and run their own actions on the Windows side.
On macOS, the wizard runs the platform install or upgrade path with brew upgrade ollama.
On Linux, the wizard runs the official https://ollama.com/install.sh path.
Upgrades on Linux always take the sudo-driven system path because the sudo-free user-local fallback would leave the existing system daemon on :11434 serving the stale binary.
If sudo is not available in a non-interactive run, NemoClaw refuses to silently downgrade the path and asks you to rerun interactively or upgrade Ollama manually.
After an upgrade finishes, NemoClaw re-probes the running daemon’s /api/version and fails the run if the daemon still reports below the minimum.
Fresh installs skip this re-probe because the bundled installers ship a daemon at or above the minimum.
On WSL, the wizard can use, start, restart, or install Ollama on the Windows host through PowerShell interop.
On native Linux, the install path picks between a system install (under /usr/local, using the official https://ollama.com/install.sh) and a sudo-free user-local install (under ${HOME}/.local).
NemoClaw selects the mode automatically:
sudo -n true returns 0) selects the system install.NEMOCLAW_NON_INTERACTIVE=1 or no TTY on stdin) without passwordless sudo selects the user-local install.
This is the path that lets headless hosts complete onboarding without prompting for a sudo password.Override the detection with NEMOCLAW_OLLAMA_INSTALL_MODE=system or NEMOCLAW_OLLAMA_INSTALL_MODE=user.
The user-local install replicates only the binary extraction step of the official installer.
It downloads the release tarball, extracts it to ${HOME}/.local, and launches ${HOME}/.local/bin/ollama serve one time.
It does not configure a systemd service, does not create the ollama system user, and does not install CUDA drivers, so you must relaunch the daemon manually after a reboot.
NemoClaw also prints a one-line PATH hint if ${HOME}/.local/bin is not already on your PATH; you can add export PATH="${HOME}/.local/bin:$PATH" to your shell profile to invoke ollama directly.
Both modes rely on zstd for archive extraction. On Debian and Ubuntu, the system path uses sudo apt-get to install zstd automatically and explains the prompt before continuing.
The user-local path cannot bootstrap system packages without elevation.
If zstd is missing, it prints per-distro install hints and exits.
Install zstd manually, then rerun onboarding.
Run the onboard wizard.
Select Local Ollama from the provider list.
NemoClaw lists installed models or offers starter models if you have not installed any.
On hosts where the larger starter models fit the currently available GPU memory, the starter list includes qwen3.6:35b and selects it by default.
When another GPU workload is using most of the memory at onboard time, NemoClaw downgrades the menu to the largest model that still fits.
It pulls the selected model, loads it into memory, and validates it before continuing.
When Ollama reports a loaded-model context length, NemoClaw uses that value for the contextWindow baked into openclaw.json unless you set NEMOCLAW_CONTEXT_WINDOW yourself.
If the selected model declares that it does not support tool calling, onboarding stops with guidance to choose a model whose ollama show <model> capabilities include tools.
The validation also requires structured chat-completions tool calls.
If the model leaks tool-call JSON as plain message text, onboarding stops so you can choose a model that returns tool calls in the expected response field.
If a host-side validation probe times out, NemoClaw retries the Ollama tool-call validation with a larger timeout before failing the setup.
On WSL, if you choose the Windows-host Ollama path, NemoClaw uses host.docker.internal:11434 and pulls missing models through the Ollama HTTP API instead of requiring the ollama CLI inside WSL.
When NemoClaw runs inside WSL, the provider menu can include Windows-host Ollama actions:
The install and restart paths set OLLAMA_HOST=0.0.0.0:11434 on the Windows side so Docker and WSL can reach the daemon through host.docker.internal.
After an install or restart action, NemoClaw relaunches Ollama from the detected Windows tray app or verified ollama.exe path and waits until host.docker.internal:11434 responds.
If the HTTP endpoint is not reachable yet, NemoClaw also checks for the Windows ollama.exe process through PowerShell interop so it can offer a start or restart action instead of hiding the Windows-host path.
If the daemon does not become reachable, onboarding prints PowerShell commands you can run to inspect the Windows-side process and port state. Use one Ollama instance on port 11434 at a time.
If both WSL and Windows-host Ollama are running, pick the intended menu entry during onboarding so NemoClaw validates and pulls models against the right daemon.
Windows-host Ollama requires Docker Desktop WSL integration because the sandbox reaches the Windows daemon through Docker Desktop’s WSL routing path. If NemoClaw detects native Docker Engine inside WSL, the provider menu labels Windows-host Ollama actions as requiring Docker Desktop integration. Selecting one of those actions in the unsupported native Docker topology exits early with a remediation message instead of trying to start or install Ollama on Windows.
On non-WSL hosts, NemoClaw keeps Ollama bound to 127.0.0.1:11434 and starts a token-gated reverse proxy on 0.0.0.0:11435.
The native install/start paths also reset NemoClaw-managed systemd launches to the loopback binding.
Containers and other hosts on the local network reach Ollama only through the proxy, which validates a Bearer token before forwarding requests.
On that native path, NemoClaw never exposes Ollama without authentication.
WSL Ollama paths do not use this proxy.
Windows-host Ollama uses the Windows daemon through host.docker.internal.
For non-WSL Ollama setups, the onboard wizard manages the proxy automatically:
~/.nemoclaw/ollama-proxy-token with 0600 permissions.On native Linux hosts, a firewall can allow the host proxy health check while still blocking sandbox containers on the OpenShell Docker bridge. When the sandbox-side proxy probe fails with a TCP error, onboarding exits before it saves the inference route and prints a command like:
If the probe cannot run, for example because Docker Desktop or WSL uses a different host routing model, onboarding continues and relies on the regular proxy health check.
NemoClaw configures the sandbox provider to use proxy port 11435 with the generated token as its OPENAI_API_KEY credential.
OpenShell’s L7 proxy injects the token at egress, so the agent inside the sandbox never sees the token directly.
All proxy endpoints require the Bearer token, including GET /api/tags.
Internal health and reachability checks run through the proxy treat any HTTP response, including 401, as proof the proxy is alive.
They fail only when nothing answers at all.
If Ollama is already running on a non-loopback address when you start onboard,
the wizard restarts it on 127.0.0.1:11434 so the proxy is the only network
path to the model server.
When you switch away from Ollama, stop host services, or destroy an Ollama-backed sandbox, NemoClaw asks Ollama to unload currently loaded models from GPU memory.
The cleanup sends keep_alive: 0 for each model reported by Ollama and runs on a best-effort basis, so shutdown continues if Ollama is already stopped.
This does not delete downloaded model files.
If NEMOCLAW_MODEL is not set, NemoClaw selects a default model based on available memory.
If NEMOCLAW_MODEL names a known bootstrap model (for example qwen3.6:35b) that does not fit the host’s currently available GPU memory, NemoClaw warns and falls back to the largest known model that does fit.
Unknown or custom tags (any value the bootstrap registry has not seen) are still passed through; the Ollama runner validates the choice itself.
--yes (or NEMOCLAW_YES=1) authorizes the Ollama model download without an interactive confirmation prompt.
Under --non-interactive, include --yes (or NEMOCLAW_YES=1) to authorize the download.
Onboard exits otherwise because it cannot prompt.
Run onboard without --non-interactive to get the interactive [y/N] prompt that shows the model size before downloading.
Use Other OpenAI-compatible endpoint for vLLM, TensorRT-LLM, llama.cpp, LocalAI, NIM, SGLang, or another server that implements /v1/chat/completions.
For compatible endpoints, NemoClaw uses /v1/chat/completions by default because some local backends accept /v1/responses but drop system prompts or tool definitions.
Set NEMOCLAW_PREFERRED_API=openai-responses only after you have verified that the backend streams the events OpenClaw requires.
For the full compatible-endpoint prompt flow, non-interactive variables, API-path controls, managed vLLM profiles, NIM setup, and timeout settings, refer to Inference Options.
NemoClaw can use an already-running vLLM server on localhost:8000, start managed vLLM on supported NVIDIA GPU hosts, or manage a local NIM container when NEMOCLAW_EXPERIMENTAL=1 is set.
Managed vLLM records the model returned by /v1/models and uses runtime metadata such as max_model_len when available.
NIM uses the same chat-completions API path restriction as vLLM.
For registry slugs, Hugging Face token requirements, NGC login behavior, and non-interactive examples, refer to Inference Options.
After onboarding completes, confirm the active provider and model.
The output shows the provider label (for example, “Local vLLM” or “Other OpenAI-compatible endpoint”) and the active model.
For Local Ollama, status also checks the authenticated proxy when a proxy token is available.
If Inference is healthy but Inference (auth proxy) is not, rerun onboarding to repair the proxy path that sandbox requests use.
You can change the model without re-running onboard. Refer to Switch Inference Models for the full procedure.
For compatible endpoints, the command is:
If the provider itself needs to change (for example, switching from vLLM to a cloud API), pass the new provider to nemohermes inference set.