> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemoclaw/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemoclaw/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemoclaw/_mcp/server.

# Tool-Calling Reliability for Local Inference

> Diagnose local inference setups where tool calls leak as plain text and choose when to use Ollama or vLLM.

Local inference is useful for privacy, cost control, and offline development, but
tool-calling agents place stricter demands on the model server than simple chat.
The model server must return structured `tool_calls`, not a JSON-looking string
inside normal assistant text.

Use this page when the TUI shows raw JSON such as:

```json
{"arguments":{"query":"robotics"},"name":"memory_search"}
```

If that appears as text in the assistant reply, OpenClaw cannot dispatch the
tool because the inference response did not include a structured tool call.

## Quick Choice Guide

| Workload                               | Ollama is usually sufficient | Prefer vLLM with a parser |
| -------------------------------------- | ---------------------------- | ------------------------- |
| Plain chat                             | Yes                          | Optional                  |
| Embeddings-only or retrieval setup     | Yes                          | Optional                  |
| One simple tool with short prompts     | Often                        | Optional                  |
| Agent loops with several tools         | Risky                        | Yes                       |
| Long system prompts or sender metadata | Risky                        | Yes                       |
| Multi-turn tool dispatch               | Risky                        | Yes                       |

Ollama can work well for lightweight local chat and some simple tool surfaces.
For OpenClaw-style agent loops with multiple tools, long instructions, or
multi-turn dispatch, use a server that exposes OpenAI-compatible
`/v1/chat/completions` with a tool-call parser. vLLM is the common local choice.

## Symptom

The common failure mode is:

* The model emits text that looks like a tool call.
* The response does not include a structured `tool_calls` field.
* The gateway treats the response as normal text.
* No tool runs, and the user sees raw JSON in the TUI.

This is different from a network or policy block. `nemoclaw <name> status`,
`nemoclaw <name> logs`, and `nemoclaw debug --quick` can all look healthy while
tool dispatch still fails inside the conversation.

## Recommended Fix

For persistent NemoClaw use, start vLLM with auto tool choice and the parser that
matches your model family, then rerun onboarding and select **Local vLLM
\[experimental]** or **Other OpenAI-compatible endpoint**.

For Hermes 3 style models, a known-good vLLM command shape is:

```console
$ vllm serve /models/Hermes-3-Llama-3.1-8B \
  --served-model-name hermes-3-llama-3.1-8b \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --port 8000
```

For a Docker Compose setup:

```yaml
services:
  vllm-nemoclaw:
    image: vllm/vllm-openai:latest
    container_name: vllm-nemoclaw
    restart: unless-stopped
    ports:
      - "8002:8000"
    volumes:
      - /path/to/models:/models:ro
      - /path/to/hf-cache:/root/.cache/huggingface
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              count: all
    command: >
      --model /models/Hermes-3-Llama-3.1-8B
      --served-model-name hermes-3-llama-3.1-8b
      --enable-auto-tool-choice
      --tool-call-parser hermes
      --gpu-memory-utilization 0.20
      --max-model-len 32768
      --api-key ${VLLM_API_KEY}
```

Then onboard against that endpoint:

```console
$ NEMOCLAW_PROVIDER=custom \
  NEMOCLAW_ENDPOINT_URL=http://localhost:8002/v1 \
  NEMOCLAW_MODEL=hermes-3-llama-3.1-8b \
  COMPATIBLE_API_KEY=$VLLM_API_KEY \
  nemoclaw onboard --non-interactive
```

If the endpoint does not require authentication, set `COMPATIBLE_API_KEY` to any
non-empty placeholder, such as `dummy`.

## Advanced Temporary Repointing

NemoClaw-managed sandboxes normally block direct `openclaw config set` writes
inside the sandbox because those edits do not survive rebuilds. Prefer rerunning
`nemoclaw onboard` for a persistent provider change.

If you are intentionally testing a mutable OpenClaw config, prepare a batch file
like this:

```json
{
  "models": {
    "providers": {
      "vllm-local": {
        "baseUrl": "http://host.openshell.internal:8002/v1",
        "api": "openai",
        "apiKey": "${VLLM_API_KEY}"
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "vllm-local/hermes-3-llama-3.1-8b"
      }
    }
  }
}
```

Apply it only in environments where OpenClaw config writes are allowed:

```console
$ openclaw config set --batch-file /sandbox/.openclaw/vllm-tool-calls.json
```

After testing, persist the working provider through `nemoclaw onboard` so the
sandbox image, OpenShell inference route, and host-managed credentials stay in
sync.

## Verify the Fix

After switching to vLLM, ask for an action that should use a tool. Good signs:

* The TUI does not show JSON blobs as assistant text.
* The gateway log shows tool dispatch and a follow-up answer.
* `nemoclaw <name> status` reports the local vLLM or compatible endpoint as the
  active provider.

If JSON still appears as text, confirm that vLLM was started with both
`--enable-auto-tool-choice` and the correct `--tool-call-parser` value for your
model.

## Next Steps

* [Use a Local Inference Server](/inference/use-local-inference)
* [Inference Options](/inference/inference-options)
* [Switch Inference Models](/inference/switch-inference-providers)