NemoClaw supports multiple inference providers. During onboarding, the NemoClaw onboarding wizard presents a numbered list of providers to choose from. Your selection determines where NemoClaw routes the agent’s inference traffic.
For OpenClaw onboarding, use nemoclaw onboard.
The provider flow is the same, with the NVIDIA Endpoints route available for OpenClaw Agent.
The agent inside the sandbox talks to inference.local.
It never connects to a provider directly.
OpenShell intercepts inference traffic on the host and forwards it to the provider you selected.
Provider credentials stay on the host.
The sandbox does not receive your API key.
Local Ollama and local vLLM do not require your host OPENAI_API_KEY.
NemoClaw uses provider-specific local tokens for those routes, and rebuilds of legacy local-inference sandboxes migrate away from stale OpenAI credential requirements.
The onboard wizard presents the following provider options by default. The first six are always available. Ollama appears when you have installed or started it on the host. Local vLLM appears when NemoClaw detects a running vLLM server. The managed install/start vLLM entry appears by default on DGX Spark and DGX Station, and appears on generic Linux NVIDIA GPU hosts after opt-in.
NVIDIA Nemotron models expose OpenAI-compatible APIs across every supported deployment surface, so two onboarding options can route to Nemotron.
For Option 3, the API key environment variable is COMPATIBLE_API_KEY. Set it to whatever credential your endpoint expects, or any non-empty placeholder if your endpoint does not require auth.
The Model Router option uses the routed inference profile in nemoclaw-blueprint/blueprint.yaml.
When you select it, NemoClaw starts the router proxy on the host, waits for its health endpoint, registers the nvidia-router provider with OpenShell, and creates the sandbox with the same inference.local route the agent uses for other providers.
The sandbox does not call the router port directly.
The router model pool lives in nemoclaw-blueprint/router/pool-config.yaml.
Edit that file to define which models the router can choose from.
The default pool routes between NVIDIA-hosted Nemotron models and uses the tolerance value to choose the lowest-cost model whose predicted quality stays within the configured threshold.
The tolerance parameter controls the accuracy-cost tradeoff.
The router runs on the host, not inside the sandbox.
Credentials flow through the OpenShell provider system. The sandbox never sees raw API keys.
To use the router in scripted setup, set:
The Model Router runs in a host-side virtual environment that NemoClaw creates during onboarding.
NemoClaw probes python3.13, python3.12, python3.11, python3.10, and bare python3, and adopts the first interpreter that satisfies both of:
[3.10, 3.14).ensurepip, pyexpat, ssl, and venv all import without error.If no candidate qualifies, onboarding aborts and prints the real failure for each candidate.
This surfaces issues like Homebrew python@3.14 whose pyexpat extension fails to dlopen against the older system libexpat on macOS.
To pin a specific interpreter, set NEMOCLAW_MODEL_ROUTER_PYTHON to its absolute path before running nemoclaw onboard:
The pin is strict.
NemoClaw probes only that interpreter and aborts with the failure reason if it does not qualify, rather than silently falling back to a different python on PATH.
NemoClaw rejects relative command names such as python3.12.
Use command -v python3.12 to find the absolute path.
If python -m venv itself fails for a probe-clean interpreter (for example, a corrupt ensurepip seed), NemoClaw retries with the next healthy candidate when no pin is set; with a pin set, the failure stops onboarding so you can fix or repoint the pinned python.
The following local inference options have caveats.
Local NIM and generic Linux managed vLLM install/start require NEMOCLAW_EXPERIMENTAL=1; DGX Spark and DGX Station managed vLLM entries appear by default.
An already-running vLLM server appears directly in the onboarding selection list.
For setup instructions, refer to Use a Local Inference Server.
NemoClaw validates the selected provider and model before creating the sandbox.
If credential validation fails, the wizard asks whether to re-enter the API key, choose a different provider, retry, or exit.
The wizard retries transient upstream validation failures before it reports a provider failure.
The nvapi- prefix check applies only to NVIDIA_API_KEY.
Other provider credentials, such as OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, and compatible endpoint keys, use provider-aware validation during retry.
The sections below collect the detailed setup prompts and environment variables for local and compatible inference providers. Use them when the quickstart or local inference guide points you here for exact command shapes.
This option works with any server that implements /v1/chat/completions, including vLLM, TensorRT-LLM, llama.cpp, LocalAI, and others.
For compatible endpoints, NemoClaw uses /v1/chat/completions by default.
This avoids a class of failures where local backends accept /v1/responses requests but silently drop the system prompt and tool definitions.
To opt in to /v1/responses, set NEMOCLAW_PREFERRED_API=openai-responses before running onboard.
Start your model server. The examples below use vLLM, but any OpenAI-compatible server works.
Run the onboard wizard.
When the wizard asks you to choose an inference provider, select Other OpenAI-compatible endpoint.
Enter the base URL of your local server, for example http://localhost:8000/v1.
The wizard prompts for an API key.
If your server does not require authentication, enter any non-empty string (for example, dummy).
NemoClaw validates the endpoint by sending a test inference request before continuing.
The wizard probes /v1/chat/completions by default for the compatible-endpoint provider.
If you set NEMOCLAW_PREFERRED_API=openai-responses, NemoClaw probes /v1/responses instead and only selects it when the response includes the streaming events OpenClaw requires.
If a reasoning model returns only reasoning content before producing a final answer, NemoClaw retries the smoke request with a larger response budget.
Route, configuration, and authentication failures still fail immediately.
Set the following environment variables for scripted or CI/CD deployments.
For the compatible-endpoint provider, /v1/chat/completions is the default.
NemoClaw tests streaming events during onboarding and uses chat completions
without probing the Responses API.
To opt in to /v1/responses, set NEMOCLAW_PREFERRED_API before running onboard:
The wizard then probes /v1/responses and only selects it when streaming
support is complete.
If the probe fails, the wizard falls back to /v1/chat/completions
automatically.
You can use this variable in both interactive and non-interactive mode.
If you already onboarded and the sandbox is failing at runtime, re-run nemoclaw onboard to re-probe the endpoint and bake the correct API path
into the image.
Refer to Switch Inference Models for more information.
If your local server implements the Anthropic Messages API (/v1/messages), choose Other Anthropic-compatible endpoint during onboarding instead.
For non-interactive setup, use NEMOCLAW_PROVIDER=anthropicCompatible and set COMPATIBLE_ANTHROPIC_API_KEY.
When vLLM is already running on localhost:8000, NemoClaw can detect it automatically and query the /v1/models endpoint to determine the loaded model.
On supported Linux hosts with NVIDIA GPUs, the onboard wizard can also install or start a managed vLLM container for you.
For an already-running vLLM server, run nemoclaw onboard and select Local vLLM [experimental] from the provider list.
If vLLM is already running, NemoClaw detects the running model and validates the endpoint.
When vLLM exposes runtime metadata such as max_model_len, NemoClaw uses that value for the contextWindow baked into openclaw.json unless you set NEMOCLAW_CONTEXT_WINDOW yourself.
If vLLM is not running and your host matches a DGX Spark or DGX Station managed profile, NemoClaw shows the Install vLLM or Start vLLM entry by default.
Generic Linux NVIDIA GPU hosts still require NEMOCLAW_EXPERIMENTAL=1 or NEMOCLAW_PROVIDER=install-vllm before the managed entry appears.
NemoClaw pulls the vLLM image, downloads model weights into ~/.cache/huggingface, starts the nemoclaw-vllm container on localhost:8000, streams Hugging Face download progress, and polls /v1/models until the model is ready.
Managed DGX Spark and DGX Station profiles use the stable NGC nvcr.io/nvidia/vllm:26.05.post1-py3 container image.
If Docker pull output stops making progress, a watchdog stops the stalled pull instead of failing slow but active downloads on a fixed wall-clock timeout.
If vLLM never becomes ready, NemoClaw prints a short tail of the vLLM container logs before exiting.
The first run can take 10 to 30 minutes.
Later runs reuse the cached image and model weights.
Managed vLLM uses these profiles:
NemoClaw forces the chat/completions API path for vLLM.
The vLLM /v1/responses endpoint does not run the --tool-call-parser, so tool calls arrive as raw text.
Use an already-running vLLM server:
Install or start managed vLLM when NemoClaw detects a supported profile.
On DGX Spark and DGX Station, NEMOCLAW_PROVIDER=install-vllm is enough for non-interactive runs; add NEMOCLAW_EXPERIMENTAL=1 on generic Linux NVIDIA GPU hosts.
NemoClaw records the model returned by vLLM’s /v1/models endpoint.
Start vLLM with the model you want before onboarding if you manage the server yourself.
Managed vLLM serves the profile default unless you select a different registry entry.
Export NEMOCLAW_VLLM_MODEL=<slug> before invoking the installer to choose a different model from the registry.
NemoClaw uses the matching vllm serve flags, including the reasoning parser, tool-call parser, and --max-model-len.
Recognized slugs are:
The slug is case-insensitive; the full Hugging Face id is also accepted. An unrecognized value fails fast with a list of valid slugs.
Gated models require a Hugging Face token; export it before onboarding so NemoClaw can forward it into the managed vLLM container:
NemoClaw accepts HUGGING_FACE_HUB_TOKEN as an alternative.
The token check runs on the host before any docker pull, so a missing or empty token aborts onboarding before bandwidth is spent on a 401.
NemoClaw can pull, start, and manage a NIM container on hosts with a NIM-capable NVIDIA GPU.
Set the experimental flag and run onboard.
Select Local NVIDIA NIM [experimental] from the provider list.
NemoClaw filters available models by GPU VRAM, pulls the NIM container image, starts it, and waits for it to become healthy before continuing.
On hosts with mixed NVIDIA GPU models, the preflight summary shows each detected GPU model and the total VRAM so you can confirm which device class the model selection used.
On Docker 29.x or containerd image-store hosts, NemoClaw resolves the host-platform manifest digest before pulling multi-architecture NIM images when the registry exposes an index.
It pulls repo@digest and retags the local image so NGC attestation metadata on other architectures does not block the selected platform.
If the registry does not expose a matching index, NemoClaw falls back to the tag pull.
NVIDIA hosts NIM container images on nvcr.io, and docker pull requires NGC registry authentication.
If Docker is not already logged in to nvcr.io, onboard prompts for an NGC API key and runs docker login nvcr.io over --password-stdin so the key is never written to disk or shell history.
The prompt masks the key during input and retries one time on a bad key before failing.
In non-interactive mode, onboard exits with login instructions if Docker is not already authenticated; run docker login nvcr.io yourself, then re-run nemoclaw onboard --non-interactive.
If NGC_API_KEY or NVIDIA_API_KEY is already exported, NemoClaw passes it into the managed NIM container through the process environment instead of command-line arguments.
If the NIM container exits before the health endpoint becomes ready, onboarding stops early and prints the last container log lines.
After NIM becomes healthy, NemoClaw reads /v1/models and uses the served model id for validation when it differs from the catalog name.
Unsafe served ids are rejected instead of being written into the sandbox config.
NIM uses vLLM internally.
The same chat/completions API path restriction applies.
To select a specific model, set NEMOCLAW_MODEL.
Local inference requests use a default timeout of 180 seconds. Large prompts on hardware such as DGX Spark can exceed shorter timeouts, so NemoClaw sets a higher default for Ollama, vLLM, NIM, and compatible-endpoint setup.
To override the timeout, set the NEMOCLAW_LOCAL_INFERENCE_TIMEOUT environment variable before onboarding:
The value is in seconds.
NemoClaw bakes this setting into the sandbox at build time.
Changing it after onboarding requires re-running nemoclaw onboard.
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT only governs the inference-server validation probe.
During local Ollama setup, NemoClaw treats host-side curl process timeouts as retryable probe failures and retries with a larger timeout before it reports a validation failure.
NemoClaw also retries Docker runtime detection with a longer docker info timeout before it chooses the local inference route.
The post-create readiness wait (image build, gateway upload, in-sandbox boot) has its own budget, NEMOCLAW_SANDBOX_READY_TIMEOUT, also defaulting to 180 seconds.
On hosts where the sandbox image takes minutes to build or upload, raise both settings together.
Examples include large quantized models, DGX Station first runs, and remote VMs over a slow link.
If onboard ends with Sandbox '<name>' was created but did not become ready within 180s, refer to Troubleshooting.