Configure Inference Routing#
This page covers the managed local inference endpoint (https://inference.local). External inference endpoints go through sandbox network_policies. Refer to Policies for details.
The configuration consists of two values:
Value |
Description |
|---|---|
Provider record |
The credential backend OpenShell uses to authenticate with the upstream model host. |
Model ID |
The model to use for generation requests. |
Step 1: Create a Provider#
Create a provider that holds the backend credentials you want OpenShell to use.
$ openshell provider create --name nvidia-prod --type nvidia --from-existing
This reads NVIDIA_API_KEY from your environment.
$ openshell provider create \
--name my-local-model \
--type openai \
--credential OPENAI_API_KEY=empty-if-not-required \
--config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1
Use --config OPENAI_BASE_URL to point to any OpenAI-compatible server running where the gateway runs. For host-backed local inference, use host.openshell.internal or the host’s LAN IP. Avoid 127.0.0.1 and localhost. Set OPENAI_API_KEY to a dummy value if the server does not require authentication.
$ openshell provider create --name anthropic-prod --type anthropic --from-existing
This reads ANTHROPIC_API_KEY from your environment.
Step 2: Set Inference Routing#
Point inference.local at that provider and choose the model to use:
$ openshell inference set \
--provider nvidia-prod \
--model nvidia/nemotron-3-nano-30b-a3b
Step 3: Verify the Active Config#
Confirm that the provider and model are set correctly:
$ openshell inference get
Gateway inference:
Provider: nvidia-prod
Model: nvidia/nemotron-3-nano-30b-a3b
Version: 1
Step 4: Update Part of the Config#
Use update when you want to change only one field:
$ openshell inference update --model nvidia/nemotron-3-nano-30b-a3b
Or switch providers without repeating the current model:
$ openshell inference update --provider openai-prod
Use It from a Sandbox#
After inference is configured, code inside any sandbox can call https://inference.local directly:
from openai import OpenAI
client = OpenAI(base_url="https://inference.local/v1", api_key="unused")
response = client.chat.completions.create(
model="anything",
messages=[{"role": "user", "content": "Hello"}],
)
The client-supplied model and api_key values are not sent upstream. The privacy router injects the real credentials from the configured provider and rewrites the model before forwarding.
Some SDKs require a non-empty API key even though inference.local does not use the sandbox-provided value. In those cases, pass any placeholder such as test or unused.
Use this endpoint when inference should stay local to the host for privacy and security reasons. External providers that should be reached directly belong in network_policies instead.
When the upstream runs on the same machine as the gateway, bind it to 0.0.0.0 and point the provider at host.openshell.internal or the host’s LAN IP. 127.0.0.1 and localhost usually fail because the request originates from the gateway or sandbox runtime, not from your shell.
If the gateway runs on a remote host or behind a cloud deployment, host.openshell.internal points to that remote machine, not to your laptop. A laptop-local Ollama or vLLM process is not reachable from a remote gateway unless you add your own tunnel or shared network path.
Verify the Endpoint from a Sandbox#
openshell inference set and openshell inference update verify the resolved upstream endpoint by default before saving the configuration. If the endpoint is not live yet, retry with --no-verify to persist the route without the probe.
openshell inference get confirms the current saved configuration. To confirm end-to-end connectivity from a sandbox, run:
curl https://inference.local/v1/responses \
-H "Content-Type: application/json" \
-d '{
"instructions": "You are a helpful assistant.",
"input": "Hello!"
}'
A successful response confirms the privacy router can reach the configured backend and the model is serving requests.
Gateway-scoped: Every sandbox using the active gateway sees the same
inference.localbackend.HTTPS only:
inference.localis intercepted only for HTTPS traffic.Hot reload: Provider and inference changes are picked up within about 5 seconds by default.
Next Steps#
Explore related topics:
To understand the inference routing flow and supported API patterns, refer to About Inference Routing.
To follow a complete Ollama-based local setup, refer to Run Local Inference with Ollama.
To control external endpoints, refer to Policies.
To manage provider records, refer to Manage Providers and Credentials.