Sandbox Orchestration#
Per-problem isolated execution environments for agent evaluations and code execution.
Overview#
NEL provides three levels of isolation:
Level |
What |
Implementation |
|---|---|---|
Cluster |
Where the overall eval runs |
|
Benchmark |
Per-benchmark environment container |
SLURM |
Problem |
Per-problem isolated sandbox |
|
The per-problem sandbox is essential for:
Agent evaluations (SWE-bench, OpenHands, terminal-bench): each problem gets an isolated container with its own filesystem, network, and process space.
Code execution (HumanEval, NeMo Skills math): untrusted code runs in a sandbox with resource limits.
Multi-turn tool use: agents in sandboxes need to reach the model server while the orchestrator reaches the agent.
Architecture#
┌──────────────────────────────────────────┐
│ Orchestrator (eval loop) │
│ │
│ for problem in dataset: │
│ seed = env.seed(idx) │
│ sandbox = manager.acquire(spec) │
│ result = solver.solve(task, sandbox) │
│ verify = env.verify(resp, exp, sandbox)│
│ manager.release(sandbox) │
└───────────┬──────────────┬───────────────┘
│ │
┌───────────▼──┐ ┌──────▼──────────────┐
│ Model Server │ │ Sandbox (1/problem) │
│ (vLLM/NIM) │ │ image: per-task │
│ │◄───┤ bridge network │
│ :8000 │ │ ┌───────────────────┐│
└──────────────┘ │ │ Agent or code ││
│ └───────────────────┘│
└──────────────────────┘
Design Principles#
Sandbox is infrastructure-only#
The sandbox knows nothing about agents, solvers, or evaluation. Its job:
Start a container from a given image
Provide
exec(),upload(),download()Expose outside endpoints (model URL) inside the container
Stop and clean up
Whether we exec a Python script or start a multi-turn agent is the solver’s decision.
Async protocol#
All sandbox operations are I/O-bound. The eval loop is async. The protocol is fully async to avoid blocking the event loop when running concurrent sandboxes.
Bridge network by default#
With --network host, concurrent sandboxes sharing port 3000 collide. Bridge network gives each container its own IP. The orchestrator reaches the container via container_ip.
Verify needs the sandbox#
For agent benchmarks, verification (apply patch, run tests) happens inside the same sandbox the agent modified. The sandbox stays alive through: seed → solve(sandbox) → verify(sandbox) → release.
The Per-Task Image Problem#
SWE-bench is the critical test case. Every problem has a different Docker image:
instance_id = "django__django-11099"→swebench/sweb.eval.x86_64.django_1776_django-11099:latestinstance_id = "scikit-learn__scikit-learn-13779"→swebench/sweb.eval.x86_64.scikit-learn_1776_scikit-learn-13779:latest
NEL handles this via three mechanisms (in priority order):
SandboxSpecinSeedResult: The environment returns per-problem sandbox requirements directly.image_template: Config-level template interpolated from metadata:"swebench/sweb.eval.x86_64.{instance_id}:latest".default_image: Fallback for homogeneous benchmarks.
Sandbox Backends#
DockerSandbox#
Per-problem Docker container on the orchestrator machine.
Bridge network by default (each container gets its own IP)
Async operations via
asyncio.create_subprocess_execresolve_outside_endpoint()rewriteslocalhost→host.docker.internal
SlurmSandbox#
Per-problem container on SLURM nodes via Pyxis/Enroot.
Multiplexed: multiple sandboxes per node via
--container-nameShared filesystem for fast file transfer when available
SLURM nodes share the job’s network (no URL rewriting needed)
ECSFargateSandbox#
Per-problem container on AWS ECS Fargate. No local Docker required.
SSH sidecar: Each ECS task includes an SSH sidecar container for
exec(),upload(),download()via SSH tunnelsBidirectional tunnels: Forward tunnel (orchestrator → container exec server) and reverse tunnels (container → host proxy/model endpoint)
CodeBuild image building: Per-task Docker images built via AWS CodeBuild and pushed to ECR with content-hash tags for caching
OutsideEndpoint: Host-side services (LiteLLM proxy, model server) are made reachable inside the container via reverse SSH tunnels
benchmarks:
- name: harbor://swebench-verified@1.0
solver:
type: harbor
service: model
agent: openhands
sandbox:
type: ecs_fargate
cluster: harbor-cluster
subnets: [subnet-abc123]
security_groups: [sg-def456]
ecr_repository: 123456789.dkr.ecr.us-west-2.amazonaws.com/harbor-swebench
ssh_sidecar:
sshd_port: 2222
public_key_secret_arn: arn:aws:secretsmanager:...
private_key_secret_arn: arn:aws:secretsmanager:...
ApptainerSandbox#
Per-problem Apptainer/Singularity container for HPC environments where Docker is unavailable. Supports SLURM integration via salloc/srun.
LocalSandbox#
Async subprocess in a temp directory. No container isolation. For development and testing.
Configuration#
benchmarks:
- name: gym://swebench-server
sandbox:
type: docker
image_template: "swebench/sweb.eval.x86_64.{instance_id}:latest"
concurrency: 8
memory: 4g
network: bridge
SWE-bench on SLURM#
benchmarks:
- name: gym://swebench-server
sandbox:
type: slurm
image_template: "swebench/sweb.eval.x86_64.{instance_id}:latest"
concurrency: 16
node_pool: sandbox
slots_per_node: 4
cluster:
type: slurm
node_pools:
sandbox:
partition: batch
nodes: 4
4 sandbox nodes × 4 slots/node = 16 concurrent sandboxes on 4 nodes.
Code execution (HumanEval)#
benchmarks:
- name: humaneval
sandbox:
type: docker
image: python:3.12-slim
network: none
memory: 256m
cpus: 1
timeout: 30
concurrency: 16
SandboxManager#
The SandboxManager provides:
Concurrency control: semaphore-based limiting of concurrent sandboxes
Pre-pull: bulk image pulling before eval starts (via
env.sandbox_specs())Emergency cleanup:
atexitand signal handlers to stop leaked containersSpec resolution: resolves
SandboxSpecfromSeedResult,image_template, or default imageSLURM multiplexing: round-robin slot allocation across nodes
SandboxedAgentSolver#
The SandboxedAgentSolver runs agents inside sandboxes. It supports two modes:
exec-server: no entrypoint → solver execs the agent command via
sandbox.exec()agent-server: entrypoint starts an HTTP agent → solver talks to it via
container_ip
The solver infers the mode from sandbox.spec.entrypoint.
SWE-bench Integration Example#
SWE-bench is not a built-in environment. Here’s how to integrate it using the sandbox system:
from nemo_evaluator import EvalEnvironment, SeedResult, VerifyResult, benchmark
from nemo_evaluator.sandbox import SandboxSpec
class SWEBenchEnvironment(EvalEnvironment):
name = "swebench"
def __init__(self, dataset_path: str):
super().__init__()
import json
with open(dataset_path) as f:
self._dataset = [json.loads(line) for line in f]
def _instance_image(self, instance_id: str) -> str:
safe = instance_id.replace("/", "_").replace("__", "_1776_")
return f"swebench/sweb.eval.x86_64.{safe}:latest"
async def sandbox_specs(self) -> list[SandboxSpec]:
return [
SandboxSpec(
image=self._instance_image(row["instance_id"]),
workdir="/testbed",
)
for row in self._dataset
]
async def seed(self, idx: int) -> SeedResult:
row = self._dataset[idx]
return SeedResult(
prompt=row["problem_statement"],
expected_answer=row["patch"],
metadata={"instance_id": row["instance_id"]},
sandbox_spec=SandboxSpec(
image=self._instance_image(row["instance_id"]),
workdir="/testbed",
),
)
async def verify(self, response, expected, sandbox=None, **meta):
if sandbox is None:
raise RuntimeError("SWE-bench requires a sandbox for verification")
await sandbox.exec(f"cd /testbed && git apply /tmp/patch.diff")
result = await sandbox.exec(
"cd /testbed && python -m pytest tests/ -x",
timeout_sec=300,
)
passed = result.return_code == 0
return VerifyResult(
reward=1.0 if passed else 0.0,
scoring_details={"method": "swebench_test", "passed": passed},
)
Adding a New Backend#
Create
sandbox/my_backend.pyimplementing theSandboxprotocol.Add a branch in
SandboxManager._create().Add a
CustomSandboxvariant or extend theSandboxConfigunion.Handle emergency cleanup in
SandboxManager._sync_cleanup().