Adding a Sandbox Provider

View as Markdown

Add a provider when NeMo Gym needs to create sandboxes through a new runtime backend, such as a container service, HPC isolation layer, or in-house execution platform. The public AsyncSandbox and Sandbox facades stay the same; the provider owns runtime-specific create, command, file transfer, status, and cleanup behavior.

Provider Contract

Providers implement the SandboxProvider protocol from nemo_gym.sandbox.providers.base. Keep common caller fields on SandboxSpec; put backend-specific options in SandboxSpec.provider_options.

1from pathlib import Path
2
3from nemo_gym.sandbox.providers.base import (
4 SandboxExecResult,
5 SandboxHandle,
6 SandboxSpec,
7 SandboxStatus,
8)
9
10
11class MySandboxProvider:
12 name = "my_provider"
13
14 async def create(self, spec: SandboxSpec) -> SandboxHandle:
15 raw = await my_runtime_create(spec)
16 return SandboxHandle(sandbox_id=raw.id, provider_name=self.name, raw=raw)
17
18 async def exec(
19 self,
20 handle: SandboxHandle,
21 command: str,
22 *,
23 cwd: str | None = None,
24 env: dict[str, str] | None = None,
25 timeout_s: int | float | None = None,
26 user: str | int | None = None,
27 ) -> SandboxExecResult:
28 result = await handle.raw.run(command, cwd=cwd, env=env, timeout_s=timeout_s, user=user)
29 return SandboxExecResult(
30 stdout=result.stdout,
31 stderr=result.stderr,
32 return_code=result.return_code,
33 )
34
35 async def upload_file(self, handle: SandboxHandle, source_path: Path, target_path: str) -> None:
36 await handle.raw.upload(source_path, target_path)
37
38 async def download_file(self, handle: SandboxHandle, source_path: str, target_path: Path) -> None:
39 await handle.raw.download(source_path, target_path)
40
41 async def status(self, handle: SandboxHandle) -> SandboxStatus:
42 return SandboxStatus.RUNNING
43
44 async def close(self, handle: SandboxHandle) -> None:
45 await handle.raw.stop()
46
47 async def aclose(self) -> None:
48 return None

Provider implementations should preserve the same lifecycle contract as the built-in providers:

  • Return a SandboxHandle from create() only after the sandbox is ready enough to run commands and transfer files.
  • Return command status through SandboxExecResult for process exits, including nonzero exits.
  • Raise SandboxCreateError or SandboxCreateVerificationError for sandbox allocation and readiness failures.
  • Make close() safe to call from cleanup paths.
  • Use aclose() for provider-scoped resources such as SDK clients.

Provider Config

Provider constructors receive the single provider-specific mapping from a named sandbox block:

1sandbox:
2 default_metadata:
3 sandbox-api: my-runtime
4 my_provider:
5 connection:
6 endpoint: https://sandbox.example
7 operations:
8 retries: 3

resolve_provider_config("sandbox", global_config_dict) returns only the single provider mapping:

1{"my_provider": {"connection": {"endpoint": "https://sandbox.example"}, "operations": {"retries": 3}}}

resolve_provider_metadata("sandbox", global_config_dict) returns the optional default_metadata block. Agents merge those values into SandboxSpec.metadata before provider create, with agent-level metadata taking precedence.

Each named block must contain exactly one non-reserved provider key. The only reserved key today is default_metadata.

Provider Options

Use SandboxSpec.provider_options for per-sandbox options that do not belong in the neutral schema. Validate that mapping before allocating runtime resources so bad configs fail early.

1from collections.abc import Mapping
2from dataclasses import dataclass
3from typing import Any
4
5
6@dataclass(frozen=True)
7class MyProviderOptions:
8 snapshot_id: str | None = None
9 labels: dict[str, str] | None = None
10
11 @classmethod
12 def from_mapping(cls, options: Mapping[str, Any] | None) -> "MyProviderOptions":
13 if options is None:
14 return cls()
15 allowed = set(cls.__dataclass_fields__)
16 unknown = set(options) - allowed
17 if unknown:
18 raise ValueError(f"Unknown provider option(s): {', '.join(sorted(unknown))}")
19 return cls(**dict(options))

Then resolve it inside create():

1async def create(self, spec: SandboxSpec) -> SandboxHandle:
2 options = MyProviderOptions.from_mapping(spec.provider_options)
3 ...

This keeps common fields such as image, env, files, metadata, and resources portable while still giving a provider room for runtime-specific behavior.

Registry

The registry in nemo_gym.sandbox.providers.registry maps provider names from config to provider classes. Lookup order is:

  1. Explicit in-process registration with register_provider().
  2. Built-in providers shipped with NeMo Gym.
  3. Installed Python entry points in the nemo_gym.sandbox_providers group.

External packages and tests can register a provider directly:

1from nemo_gym.sandbox.providers.registry import register_provider
2
3
4register_provider("my_provider", MySandboxProvider)

In-tree built-in providers should use a lazy loader in registry.py so importing nemo_gym.sandbox does not eagerly import optional provider dependencies.

1def _load_my_provider() -> ProviderClass:
2 from nemo_gym.sandbox.providers.my_provider import MySandboxProvider
3
4 return MySandboxProvider
5
6
7_BUILTIN_PROVIDER_LOADERS["my_provider"] = _load_my_provider

Out-of-tree packages can publish a provider entry point in their pyproject.toml:

1[project.entry-points."nemo_gym.sandbox_providers"]
2my_provider = "my_pkg.provider:MyProvider"

Two entry points with the same provider name raise an error because selection would be nondeterministic. An entry point shadowed by a registered provider or a built-in provider is warned and ignored.

After registration, callers select the provider with a single-key provider config:

1provider_config = {
2 "my_provider": {
3 "provider_setting": "value",
4 }
5}

Provider Pages

Each provider page should use the same shape so users can compare backends quickly:

  • Setup and optional dependencies
  • Named provider config block
  • provider_options accepted by SandboxSpec
  • Resource mapping and isolation properties
  • Minimal gym env start or local first-run example
  • Provider-specific operational notes