Inference Routing
OpenShell handles inference traffic through two paths: external endpoints and inference.local.
How inference.local Works
When code inside a sandbox calls https://inference.local, the privacy router routes the request to the configured backend for that gateway. The configured model is applied to generation requests, provider credentials come from OpenShell rather than from code inside the sandbox, and only approved inference headers are forwarded upstream.
If code calls an external inference host directly, OpenShell evaluates that traffic only through network_policies.
Supported API Patterns
Supported request patterns depend on the provider configured for inference.local.
OpenAI-compatible
Anthropic-compatible
Requests to inference.local that do not match the configured provider’s supported patterns are denied.
Google Vertex AI does not expose every OpenAI-compatible path through inference.local. Vertex routes for Gemini and other non-Anthropic models currently support Chat Completions. Vertex routes for Claude models use the Anthropic Messages pattern. Base URL overrides are only supported for non-Anthropic Vertex routes.
Configure Inference Routing
The managed local inference endpoint uses three values:
For tested providers and base URLs, refer to Supported Inference Providers.
Create a Provider
Create a provider that holds the backend credentials you want OpenShell to use.
NVIDIA API Catalog
OpenAI-compatible Provider
Google Vertex AI
Local Endpoint
Anthropic
This reads NVIDIA_API_KEY from your environment.
Set Inference Routing
Point inference.local at that provider and choose the model to use:
To override the default 60-second per-request timeout, add --timeout:
The value is in seconds. When --timeout is omitted or set to 0, the default of 60 seconds applies. Increase --timeout when you expect extended thinking phases so the full response completes before the request deadline.
Inspect and Update the Config
Confirm that the provider and model are set correctly:
Use update when you want to change only one field:
Use the Local Endpoint from a Sandbox
After inference is configured, code inside any sandbox can call https://inference.local directly. The client-supplied model and api_key values are not sent upstream — the privacy router injects the real credentials from the configured provider and rewrites the model before forwarding. Some SDKs require a non-empty API key even though inference.local does not use the sandbox-provided value; pass any placeholder such as unused.
Claude Code
OpenCode
Python (OpenAI SDK)
Python (Anthropic SDK)
--bare skips the OAuth login flow and uses ANTHROPIC_API_KEY directly. The key is stripped by the proxy and never reaches the upstream provider.
Claude Code appends /v1/messages to ANTHROPIC_BASE_URL, so omit the /v1 suffix from the base URL.
Use inference.local when inference should stay private and credentials should not be exposed inside the sandbox. External providers reached directly belong in network_policies instead.
When the upstream runs on the same machine as the gateway, bind it to 0.0.0.0 and point the provider at host.openshell.internal or the host’s LAN IP. 127.0.0.1 and localhost usually fail because the request originates from the gateway or sandbox runtime, not from your shell.
If the gateway runs on a remote host or behind a cloud deployment, host.openshell.internal points to that remote machine, not to your laptop. A locally running Ollama or vLLM process is not reachable from a remote gateway unless you add your own tunnel or shared network path.
Verify from a Sandbox
openshell inference set and openshell inference update verify the resolved upstream endpoint by default before saving the configuration. If the endpoint is not live yet, retry with --no-verify to persist the route without the probe.
To confirm end-to-end connectivity from a sandbox, run:
A successful response confirms the privacy router can reach the configured backend and the model is serving requests.
- Gateway-scoped: Every sandbox using the active gateway sees the same
inference.localbackend. - HTTPS only:
inference.localis intercepted only for HTTPS traffic. - Hot reload: Provider, model, and timeout changes are picked up by running sandboxes within about 5 seconds by default. No sandbox recreation is required.
Next Steps
Explore related topics:
- To follow a complete Ollama-based local setup, refer to Inference Ollama.
- To follow a complete LM Studio-based local setup, refer to Local Inference LM Studio.
- To control external endpoints, refer to Policies.
- To manage provider records, refer to Providers.