Model Capability Audit Matrix

View as Markdown

Use this matrix to maintain model and provider audit evidence for NemoClaw agent behavior. The matrix tracks whether a supported model works as an agent model, not only whether it can answer a one-shot chat prompt.

Do not mark a row as completed without committed evidence or a stable CI link. Rows seeded from source inventory start as not-yet-run until a maintainer imports or records evidence.

Result States

Every audit row must use one of these states.

StateUse when
passThe row completes required scenarios without model-specific changes.
pass-with-affordanceThe row completes required scenarios with a documented model or provider affordance.
degradedThe row is usable but has documented limits, retries, latency risk, or partial surface coverage.
blockedThe row cannot complete required scenarios and needs a linked follow-up issue or PR.
unsupportedThe model, provider, or surface is intentionally unsupported.
not-yet-runThe row is in scope but has no completed evidence yet.

Required Row Schema

Use these fields for every completed row. If a field is not applicable, write n/a and explain why in the evidence notes.

FieldRequired content
Model IDExact model identifier used by onboarding or runtime config.
Provider pathProvider class and route, such as NVIDIA Endpoints, OpenAI, Anthropic, Gemini, Local Ollama, Local vLLM, or another compatible endpoint.
Agent surfaceExact agent path, such as OpenClaw primary agent, OpenClaw CLI prompt path, OpenClaw browser or gateway path, OpenClaw sub-agent delegation, Hermes sandbox API, or auxiliary model path.
NemoClaw commit SHAFull commit SHA for the repo state used during validation.
Runtime versionsOpenShell, OpenClaw, Hermes, provider server, and local serving versions when available.
Endpoint/API path selectedConcrete API path, base URL class, and provider key selected by NemoClaw.
Workflow usedExact command sequence or CI workflow used to run the scenario.
StateOne result state from this page.
EvidenceTrajectory file path, session log path, request dump path, or CI artifact link.
Observed tool-call countCount and names of structured tool calls observed in the scenario.
Final-response behaviorWhether the assistant produced a final response after tool results, stopped empty, stopped reasoning-only, or emitted raw tool text.
Multi-turn behaviorWhether turn 2 used turn 1 tool results without re-running unrelated tools.
Latency and timeout notesValidation time, first token or first event time when available, total duration, retries, and timeout budget used.
Required affordanceModel-specific setup, provider-class transport behavior, request mutation, API path forcing, streaming requirement, or none.
Follow-upLinked issue, PR, or registry decision when remediation or setup work is needed.

Required Scenario Coverage

Completed rows should state which required scenarios were exercised. Rows can remain degraded, blocked, or not-yet-run when a scenario cannot be exercised yet.

ScenarioRequired checks
Baseline chatDeterministic response works, provider validation is actionable, and credentials do not leak into sandbox-visible files, logs, or prompts.
Shell tool loopSeparate structured hostname, date, and uptime tool calls are emitted, persisted, correlated with tool results, and followed by a final assistant response.
Multi-turn continuationTurn 2 uses a tool result from turn 1 and does not ask the user to continue after a complete tool result.
Sub-agent delegationThe primary agent emits a structured sessions_spawn request, the sub-agent receives the intended task and workspace, and the primary agent consumes the result.
Hermes pathHermes starts with the selected provider/model, returns the expected OpenAI-compatible response shape, and separates Hermes failures from OpenClaw-only request-shape issues.
Performance and operabilityThe row records validation duration, first event timing when available, retry behavior, timeout budget, streaming requirement, request mutation requirement, API path forcing, and cold-start differences.

Audit Matrix

These seed rows come from current repo source files, not from live benchmark claims. Keep them as not-yet-run until the row has evidence that satisfies the schema above. When importing a completed row from an issue comment, preserve the exact commit SHA, workflow, evidence paths, and observed behavior.

Agent surfaceProvider classModel or routeAPI pathStateEvidenceRequired affordanceFollow-upSource
OpenClaw primary agentNVIDIA Endpointsnvidia/nemotron-3-super-120b-a12bManaged inference.local OpenAI-compatible completionsnot-yet-runAdd trajectory and session evidence before changing state.Existing OpenClaw setup manifest disables tool_search for this route.Verify evidence before changing state.src/lib/inference/config.ts, nemoclaw-blueprint/model-specific-setup/openclaw/nemotron-3-super-120b-managed-inference.json.
OpenClaw primary agentNVIDIA Endpointsmoonshotai/kimi-k2.6Managed inference.local OpenAI-compatible completionsnot-yet-runAdd trajectory and session evidence before changing state.Existing OpenClaw setup manifest applies Kimi compatibility and plugin loading.Verify Kimi regression evidence before changing state.src/lib/inference/config.ts, nemoclaw-blueprint/model-specific-setup/openclaw/kimi-k2.6-managed-inference.json.
OpenClaw primary agentNVIDIA EndpointsAny model from CLOUD_MODEL_OPTIONSManaged inference.local OpenAI-compatible completions unless config selects another API.not-yet-runAdd one evidence row per model before changing state.Record none, model-specific setup, or provider-class transport behavior.Expand into per-model rows as evidence lands.src/lib/inference/config.ts.
OpenClaw primary agentOpenAIAny model from REMOTE_MODEL_OPTIONS.openaiopenai provider through https://inference.local/v1.not-yet-runAdd one evidence row per model before changing state.Record Responses or Chat Completions behavior explicitly.Expand into per-model rows as evidence lands.src/lib/inference/model-prompts.ts, src/lib/inference/config.ts.
OpenClaw primary agentAnthropicAny model from REMOTE_MODEL_OPTIONS.anthropicanthropic provider through https://inference.local with anthropic-messages.not-yet-runAdd one evidence row per model before changing state.Record native Anthropic Messages behavior explicitly.Expand into per-model rows as evidence lands.src/lib/inference/model-prompts.ts, src/lib/inference/config.ts.
OpenClaw primary agentGeminiAny model from REMOTE_MODEL_OPTIONS.geminiManaged inference.local OpenAI-compatible route.not-yet-runAdd one evidence row per model before changing state.Record provider state and tool-result continuation behavior.Expand into per-model rows as evidence lands.src/lib/inference/model-prompts.ts, src/lib/inference/config.ts.
OpenClaw primary agentLocal OllamaDefault nemotron-3-nano:30b or any installed model selected by onboarding.Managed inference.local route to the host Ollama proxy.not-yet-runAdd local daemon, model tag, and trajectory evidence before changing state.Record tool capability, streaming usage, and local proxy behavior.Add one row per audited local model tag.src/lib/inference/local.ts, src/lib/inference/config.ts.
OpenClaw primary agentLocal vLLMAny model from VLLM_MODELS.Managed inference.local route to the host vLLM server.not-yet-runAdd vLLM serve flags, model id, and trajectory evidence before changing state.Record parser flags, reasoning parser, and tool-call parser behavior.Add one row per audited vLLM model id.src/lib/inference/vllm-models.ts, src/lib/inference/config.ts.
OpenClaw primary agentOther OpenAI-compatible endpointUser-selected custom-model or another configured model id.Managed inference.local route to the compatible endpoint.not-yet-runAdd endpoint class and trajectory evidence before changing state.Record endpoint API path forcing and store/streaming assumptions.Add one row per endpoint class that is validated.src/lib/inference/config.ts.
OpenClaw primary agentOther Anthropic-compatible endpointUser-selected custom-anthropic-model or another configured model id.anthropic route when supported, otherwise managed compatible route.not-yet-runAdd endpoint class and trajectory evidence before changing state.Record native Anthropic Messages or compatible-route transport behavior.Add one row per endpoint class that is validated.src/lib/inference/config.ts.
Hermes sandbox APIHermes ProviderDefault moonshotai/kimi-k2.6 or any model from HERMES_PROVIDER_MODEL_OPTIONS.Hermes Provider route through NemoClaw managed inference.not-yet-runAdd Hermes session, request dump, logs, and local API evidence before changing state.Record Hermes-specific config, transport, and response-shape behavior.Keep Hermes rows separate from OpenClaw rows.src/lib/inference/config.ts, src/lib/inference/model-prompts.ts.

Completed Row Template

Copy this template when adding evidence for a specific model/provider/agent combination. Do not leave placeholder text in a completed row.

FieldValue
Model ID<provider/model-id>.
Provider path<provider class and route>.
Agent surface<OpenClaw primary agent, OpenClaw CLI prompt path, OpenClaw browser or gateway path, OpenClaw sub-agent delegation, Hermes sandbox API, or auxiliary model path>.
NemoClaw commit SHA<full SHA>.
Runtime versions<OpenShell version, OpenClaw version, Hermes version, local server version, or n/a>.
Endpoint/API path selected<provider key, base URL class, API mode, and endpoint path>.
Workflow used<exact commands or CI workflow>.
State<pass, pass-with-affordance, degraded, blocked, unsupported, or not-yet-run>.
Evidence<trajectory, session log, request dump, CI artifact, or n/a>.
Observed tool-call count<count, names, and shape>.
Final-response behavior<final answer, empty stop, reasoning-only stop, raw tool text, or other behavior>.
Multi-turn behavior<turn 1 and turn 2 behavior>.
Latency and timeout notes<validation time, first event timing, total duration, retry behavior, timeout budget, and streaming notes>.
Required affordance<none, setup manifest, request mutation, parser flag, API path forcing, streaming requirement, or transport policy>.
Follow-up<issue, PR, registry decision, or n/a>.
  • nemoclaw-blueprint/model-specific-setup/README.md documents where model-specific setup belongs once an intervention is justified.
  • docs/inference/tool-calling-reliability explains the local inference tool-call failure mode that audit rows should classify separately from provider connectivity.

Next Steps