Inference Routing

View as Markdown

OpenShell handles inference traffic through two paths: external endpoints and inference.local.

PathHow it works
External endpointsTraffic to hosts like api.openai.com or api.anthropic.com is treated like any other outbound request, allowed or denied by network_policies. Refer to Policies.
inference.localA sandbox-local HTTPS endpoint that routes model requests through the gateway. The privacy router strips sandbox-supplied credentials, forwards only approved inference headers, injects the configured backend credentials, and forwards to the managed model endpoint.

How inference.local Works

When code inside a sandbox calls https://inference.local, the privacy router routes the request to the configured backend for that gateway. The configured model is applied to generation requests, provider credentials come from OpenShell rather than from code inside the sandbox, and only approved inference headers are forwarded upstream.

If code calls an external inference host directly, OpenShell evaluates that traffic only through network_policies.

PropertyDetail
CredentialsNo sandbox API keys needed. Credentials come from the configured provider record. The router strips caller-supplied Authorization before forwarding the request.
Header forwardinginference.local forwards only a per-provider header allowlist. OpenAI routes allow openai-organization and x-model-id. Anthropic routes allow anthropic-version and anthropic-beta. NVIDIA routes allow x-model-id. All other caller headers are stripped.
ConfigurationOne provider and one model define sandbox inference for the active gateway. Every sandbox on that gateway sees the same inference.local backend.
Provider supportNVIDIA, any OpenAI-compatible provider, and Anthropic all work through the same endpoint.
Streaming reliabilityThe router tolerates idle gaps of up to 120 seconds between streamed chunks so long reasoning responses are not cut off mid-stream.
Hot refreshOpenShell picks up provider credential changes and inference updates without recreating sandboxes. Changes propagate within about 5 seconds by default.

Supported API Patterns

Supported request patterns depend on the provider configured for inference.local.

PatternMethodPath
Chat CompletionsPOST/v1/chat/completions
CompletionsPOST/v1/completions
ResponsesPOST/v1/responses
Model DiscoveryGET/v1/models
Model DiscoveryGET/v1/models/*

Requests to inference.local that do not match the configured provider’s supported patterns are denied.

Configure Inference Routing

The managed local inference endpoint uses three values:

ValueDescription
Provider recordThe credential backend OpenShell uses to authenticate with the upstream model host.
Model IDThe model to use for generation requests.
TimeoutPer-request timeout in seconds for upstream inference calls. Defaults to 60 seconds.

For tested providers and base URLs, refer to Supported Inference Providers.

Create a Provider

Create a provider that holds the backend credentials you want OpenShell to use.

$openshell provider create --name nvidia-prod --type nvidia --from-existing

This reads NVIDIA_API_KEY from your environment.

Set Inference Routing

Point inference.local at that provider and choose the model to use:

$openshell inference set \
> --provider nvidia-prod \
> --model nvidia/nemotron-3-nano-30b-a3b

To override the default 60-second per-request timeout, add --timeout:

$openshell inference set \
> --provider nvidia-prod \
> --model nvidia/nemotron-3-nano-30b-a3b \
> --timeout 300

The value is in seconds. When --timeout is omitted or set to 0, the default of 60 seconds applies. Increase --timeout when you expect extended thinking phases so the full response completes before the request deadline.

Inspect and Update the Config

Confirm that the provider and model are set correctly:

$openshell inference get
$Gateway inference:
$
$ Provider: nvidia-prod
$ Model: nvidia/nemotron-3-nano-30b-a3b
$ Timeout: 300s
$ Version: 1

Use update when you want to change only one field:

$openshell inference update --model nvidia/nemotron-3-nano-30b-a3b
$openshell inference update --provider openai-prod
$openshell inference update --timeout 120

Use the Local Endpoint from a Sandbox

After inference is configured, code inside any sandbox can call https://inference.local directly:

1from openai import OpenAI
2
3client = OpenAI(base_url="https://inference.local/v1", api_key="unused")
4
5response = client.chat.completions.create(
6 model="anything",
7 messages=[{"role": "user", "content": "Hello"}],
8)

The client-supplied model and api_key values are not sent upstream. The privacy router injects the real credentials from the configured provider and rewrites the model before forwarding. Some SDKs require a non-empty API key even though inference.local does not use the sandbox-provided value. In those cases, pass any placeholder such as test or unused.

Use this endpoint when inference should stay local to the host for privacy and security reasons. External providers that should be reached directly belong in network_policies instead.

When the upstream runs on the same machine as the gateway, bind it to 0.0.0.0 and point the provider at host.openshell.internal or the host’s LAN IP. 127.0.0.1 and localhost usually fail because the request originates from the gateway or sandbox runtime, not from your shell.

If the gateway runs on a remote host or behind a cloud deployment, host.openshell.internal points to that remote machine, not to your laptop. A locally running Ollama or vLLM process is not reachable from a remote gateway unless you add your own tunnel or shared network path.

Verify from a Sandbox

openshell inference set and openshell inference update verify the resolved upstream endpoint by default before saving the configuration. If the endpoint is not live yet, retry with --no-verify to persist the route without the probe.

To confirm end-to-end connectivity from a sandbox, run:

$curl https://inference.local/v1/responses \
> -H "Content-Type: application/json" \
> -d '{
> "instructions": "You are a helpful assistant.",
> "input": "Hello!"
> }'

A successful response confirms the privacy router can reach the configured backend and the model is serving requests.

  • Gateway-scoped: Every sandbox using the active gateway sees the same inference.local backend.
  • HTTPS only: inference.local is intercepted only for HTTPS traffic.
  • Hot reload: Provider, model, and timeout changes are picked up by running sandboxes within about 5 seconds by default. No sandbox recreation is required.

Next Steps

Explore related topics: