***

title: Run Local Inference with Ollama
sidebar-title: Inference with Ollama
slug: tutorials/inference-ollama
description: Run local and cloud models inside an OpenShell sandbox using the Ollama community sandbox, or route sandbox requests to a host-level Ollama server.
keywords: Generative AI, Cybersecurity, Tutorial, Inference Routing, Ollama, Local Inference, Sandbox
---------------------

For clean Markdown of any page, append .md to the page URL. For a complete documentation index, see https://docs.nvidia.com/openshell/latest/tutorials/llms.txt. For full documentation content, see https://docs.nvidia.com/openshell/latest/tutorials/llms-full.txt.

This tutorial covers two ways to use Ollama with OpenShell:

1. **Ollama sandbox (recommended)** — a self-contained sandbox with Ollama, Claude Code, and Codex pre-installed. One command to start.
2. **Host-level Ollama** — run Ollama on the gateway host and route sandbox inference to it. Useful when you want a single Ollama instance shared across multiple sandboxes.

After completing this tutorial, you will know how to:

* Launch the Ollama community sandbox for a batteries-included experience.
* Use `ollama launch` to start coding agents inside a sandbox.
* Expose a host-level Ollama server to sandboxes through `inference.local`.

## Prerequisites

* A working OpenShell installation. Complete the [Quickstart](/get-started/quickstart) before proceeding.

## Option A: Ollama Community Sandbox (Recommended)

The Ollama community sandbox bundles Ollama, Claude Code, OpenCode, and Codex into a single image. Ollama starts automatically when the sandbox launches.

<Steps toc={true}>
  ### Create the Sandbox

  ```shell
  openshell sandbox create --from ollama
  ```

  This pulls the community sandbox image, applies the bundled policy, and drops you into a shell with Ollama running.

  ### Chat with a Model

  Chat with a local model

  ```shell
  ollama run qwen3.5
  ```

  Or a cloud model

  ```shell
  ollama run kimi-k2.5:cloud
  ```

  Or use `ollama launch` to start a coding agent with Ollama as the model backend:

  ```shell
  ollama launch claude
  ollama launch codex
  ollama launch opencode
  ```

  For CI/CD and automated workflows, `ollama launch` supports a headless mode:

  ```shell
  ollama launch claude --yes --model qwen3.5
  ```
</Steps>

### Model Recommendations

| Use case             | Model              | Notes                                                            |
| -------------------- | ------------------ | ---------------------------------------------------------------- |
| Smoke test           | `qwen3.5:0.8b`     | Fast, lightweight, good for verifying setup                      |
| Coding and reasoning | `qwen3.5`          | Strong tool calling support for agentic workflows                |
| Complex tasks        | `nemotron-3-super` | 122B parameter model, needs 48GB+ VRAM                           |
| No local GPU         | `qwen3.5:cloud`    | Runs on Ollama's cloud infrastructure, no `ollama pull` required |

<Note>
  Cloud models use the `:cloud` tag suffix and do not require local hardware.

  ```shell
  openshell sandbox create --from ollama
  ```
</Note>

### Tool Calling

Agentic workflows (Claude Code, Codex, OpenCode) rely on tool calling. The following models have reliable tool calling support: Qwen 3.5, Nemotron-3-Super, GLM-5, and Kimi-K2.5. Check the [Ollama model library](https://ollama.com/library) for the latest models.

### Updating Ollama

To update Ollama inside a running sandbox:

```shell
update-ollama
```

Or auto-update on every sandbox start:

```shell
openshell sandbox create --from ollama -e OLLAMA_UPDATE=1
```

## Option B: Host-Level Ollama

Use this approach when you want a single Ollama instance on the gateway host, shared across multiple sandboxes through `inference.local`.

<Note>
  This approach uses Ollama because it is easy to install and run locally, but you can substitute other inference engines such as vLLM, SGLang, TRT-LLM, and NVIDIA NIM by changing the startup command, base URL, and model name.
</Note>

<Steps toc={true}>
  ### Install and Start Ollama

  Install [Ollama](https://ollama.com/) on the gateway host:

  ```shell
  curl -fsSL https://ollama.com/install.sh | sh
  ```

  Start Ollama on all interfaces so it is reachable from sandboxes:

  ```shell
  OLLAMA_HOST=0.0.0.0:11434 ollama serve
  ```

  <Tip>
    If you see `Error: listen tcp 0.0.0.0:11434: bind: address already in use`, Ollama is already running as a system service. Stop it first:

    ```shell
    systemctl stop ollama
    OLLAMA_HOST=0.0.0.0:11434 ollama serve
    ```
  </Tip>

  ### Pull a Model

  In a second terminal, pull a model:

  ```shell
  ollama run qwen3.5:0.8b
  ```

  Type `/bye` to exit the interactive session. The model stays loaded.

  ### Create a Provider

  Create an OpenAI-compatible provider pointing at the host Ollama:

  ```shell
  openshell provider create \
      --name ollama \
      --type openai \
      --credential OPENAI_API_KEY=empty \
      --config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1
  ```

  OpenShell injects `host.openshell.internal` so sandboxes and the gateway can reach the host machine. You can also use the host's LAN IP.

  ### Set Inference Routing

  ```shell
  openshell inference set --provider ollama --model qwen3.5:0.8b
  ```

  Confirm:

  ```shell
  openshell inference get
  ```

  ### Verify from a Sandbox

  ```shell
  openshell sandbox create -- \
      curl https://inference.local/v1/chat/completions \
      --json '{"messages":[{"role":"user","content":"hello"}],"max_tokens":10}'
  ```

  The response should be JSON from the model.
</Steps>

## Troubleshooting

Common issues and fixes:

* **Ollama not reachable from sandbox** — Ollama must be bound to `0.0.0.0`, not `127.0.0.1`. This applies to host-level Ollama only; the community sandbox handles this automatically.
* **`OPENAI_BASE_URL` wrong** — Use `http://host.openshell.internal:11434/v1`, not `localhost` or `127.0.0.1`.
* **Model not found** — Run `ollama ps` to confirm the model is loaded. Run `ollama pull <model>` if needed.
* **HTTPS vs HTTP** — Code inside sandboxes must call `https://inference.local`, not `http://`.
* **AMD GPU driver issues** — Ollama v0.18+ requires ROCm 7 drivers for AMD GPUs. Update your drivers if you see GPU detection failures.

Useful commands:

```shell
openshell status
openshell inference get
openshell provider get ollama
```

## Next Steps

* To learn more about managed inference, refer to [Index](/inference/about).
* To configure a different self-hosted backend, refer to [Configure](/inference/configure).
* To explore more community sandboxes, refer to [Community Sandboxes](/sandboxes/community-sandboxes).