***

title: Configure Inference Routing
sidebar-title: Configure Inference Routing
description: Set up the managed local inference endpoint with provider credentials and model configuration.
keywords: Generative AI, Cybersecurity, Inference Routing, Configuration, Privacy, LLM, Provider
position: 2
---------------------

For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.nvidia.com/openshell/inference/llms.txt. For full documentation content, see https://docs.nvidia.com/openshell/inference/llms-full.txt.

This page covers the managed local inference endpoint (`https://inference.local`). External inference endpoints go through sandbox `network_policies`. Refer to [Policies](/sandboxes/policies) for details.

The configuration consists of three values:

| Value           | Description                                                                          |
| --------------- | ------------------------------------------------------------------------------------ |
| Provider record | The credential backend OpenShell uses to authenticate with the upstream model host.  |
| Model ID        | The model to use for generation requests.                                            |
| Timeout         | Per-request timeout in seconds for upstream inference calls. Defaults to 60 seconds. |

For a list of tested providers and their base URLs, refer to [Supported Inference Providers](/sandboxes/manage-providers#supported-inference-providers).

## Create a Provider

Create a provider that holds the backend credentials you want OpenShell to use.

<Tabs>
  <Tab title="NVIDIA API Catalog">
    ```shell
    openshell provider create --name nvidia-prod --type nvidia --from-existing
    ```

    This reads `NVIDIA_API_KEY` from your environment.
  </Tab>

  <Tab title="OpenAI-compatible Provider">
    Any cloud provider that exposes an OpenAI-compatible API works with the `openai` provider type. You need three values from the provider: the base URL, an API key, and a model name.

    ```shell
    openshell provider create \
        --name my-cloud-provider \
        --type openai \
        --credential OPENAI_API_KEY=<your_api_key> \
        --config OPENAI_BASE_URL=https://api.example.com/v1
    ```

    Replace the base URL and API key with the values from your provider. For supported providers out of the box, refer to [Supported Inference Providers](/sandboxes/manage-providers#supported-inference-providers). For other providers, refer to your provider's documentation for the correct base URL, available models, and API key setup.
  </Tab>

  <Tab title="Local Endpoint">
    ```shell
    openshell provider create \
        --name my-local-model \
        --type openai \
        --credential OPENAI_API_KEY=empty-if-not-required \
        --config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1
    ```

    Use `--config OPENAI_BASE_URL` to point to any OpenAI-compatible server running where the gateway runs. For host-backed local inference, use `host.openshell.internal` or the host's LAN IP. Avoid `127.0.0.1` and `localhost`. Set `OPENAI_API_KEY` to a dummy value if the server does not require authentication.

    <Tip>
      For a self-contained setup, the Ollama community sandbox bundles Ollama inside the sandbox itself — no host-level provider needed. See [Inference Ollama](/tutorials/inference-ollama) for details.
    </Tip>

    Ollama also supports cloud-hosted models using the `:cloud` tag suffix (e.g., `qwen3.5:cloud`).
  </Tab>

  <Tab title="Anthropic">
    ```shell
    openshell provider create --name anthropic-prod --type anthropic --from-existing
    ```

    This reads `ANTHROPIC_API_KEY` from your environment.
  </Tab>
</Tabs>

## Set Inference Routing

Point `inference.local` at that provider and choose the model to use:

```shell
openshell inference set \
    --provider nvidia-prod \
    --model nvidia/nemotron-3-nano-30b-a3b
```

To override the default 60-second per-request timeout, add `--timeout`:

```shell
openshell inference set \
    --provider nvidia-prod \
    --model nvidia/nemotron-3-nano-30b-a3b \
    --timeout 300
```

The value is in seconds. When `--timeout` is omitted (or set to `0`), the default of 60 seconds applies.

## Verify the Active Config

Confirm that the provider and model are set correctly:

```shell
openshell inference get
Gateway inference:

  Provider: nvidia-prod
  Model: nvidia/nemotron-3-nano-30b-a3b
  Timeout: 300s
  Version: 1
```

## Update Part of the Config

Use `update` when you want to change only one field:

```shell
openshell inference update --model nvidia/nemotron-3-nano-30b-a3b
```

Or switch providers without repeating the current model:

```shell
openshell inference update --provider openai-prod
```

Or change only the timeout:

```shell
openshell inference update --timeout 120
```

## Use the Local Endpoint from a Sandbox

After inference is configured, code inside any sandbox can call `https://inference.local` directly:

```python
from openai import OpenAI

client = OpenAI(base_url="https://inference.local/v1", api_key="unused")

response = client.chat.completions.create(
    model="anything",
    messages=[{"role": "user", "content": "Hello"}],
)
```

The client-supplied `model` and `api_key` values are not sent upstream. The privacy router injects the real credentials from the configured provider and rewrites the model before forwarding.

Some SDKs require a non-empty API key even though `inference.local` does not use the sandbox-provided value. In those cases, pass any placeholder such as `test` or `unused`.

Use this endpoint when inference should stay local to the host for privacy and security reasons. External providers that should be reached directly belong in `network_policies` instead.

When the upstream runs on the same machine as the gateway, bind it to `0.0.0.0` and point the provider at `host.openshell.internal` or the host's LAN IP. `127.0.0.1` and `localhost` usually fail because the request originates from the gateway or sandbox runtime, not from your shell.

If the gateway runs on a remote host or behind a cloud deployment, `host.openshell.internal` points to that remote machine, not to your laptop. A locally running Ollama or vLLM process is not reachable from a remote gateway unless you add your own tunnel or shared network path. Ollama also supports cloud-hosted models that do not require local hardware.

### Verify the Endpoint from a Sandbox

`openshell inference set` and `openshell inference update` verify the resolved upstream endpoint by default before saving the configuration. If the endpoint is not live yet, retry with `--no-verify` to persist the route without the probe.

`openshell inference get` confirms the current saved configuration. To confirm end-to-end connectivity from a sandbox, run:

```shell
curl https://inference.local/v1/responses \
    -H "Content-Type: application/json" \
    -d '{
      "instructions": "You are a helpful assistant.",
      "input": "Hello!"
    }'
```

A successful response confirms the privacy router can reach the configured backend and the model is serving requests.

* Gateway-scoped: Every sandbox using the active gateway sees the same `inference.local` backend.
* HTTPS only: `inference.local` is intercepted only for HTTPS traffic.
* Hot reload: Provider, model, and timeout changes are picked up by running sandboxes within about 5 seconds by default. No sandbox recreation is required.

## Next Steps

Explore related topics:

* To understand the inference routing flow and supported API patterns, refer to [Index](/inference/about).
* To follow a complete Ollama-based local setup, refer to [Inference Ollama](/tutorials/inference-ollama).
* To follow a complete LM Studio-based local setup, refer to [Local Inference Lmstudio](/tutorials/local-inference-lmstudio).
* To control external endpoints, refer to [Policies](/sandboxes/policies).
* To manage provider records, refer to [Manage Providers](/sandboxes/manage-providers).