nemo_evaluator.sandbox#

Sandbox implementations used by evaluation harnesses that need isolated container environments for command execution, file transfer, and agent hosting.

This module is designed to keep dependencies optional:

  • The ECS Fargate implementation only imports AWS SDKs (boto3/botocore) when actually used.

  • Transport is SSH-based; no AWS CLI or session-manager-plugin required on the host.

Sandbox Protocol#

All sandbox backends implement the Sandbox protocol, so harnesses can be written backend-agnostically:

  • start() / stop() — lifecycle

  • exec(command) — run a shell command

  • upload(local_path, remote_path) / download(remote_path, local_path) — file transfer

  • is_running — health check

  • Context manager (with sandbox: ...) for automatic cleanup

Usage (ECS Fargate — exec-server mode)#

For harnesses that drive execution from the orchestrator (e.g. terminal-bench):

from nemo_evaluator.sandbox import EcsFargateConfig, EcsFargateSandbox, SshSidecarConfig

cfg = EcsFargateConfig(
    region="us-west-2",
    cluster="my-ecs-cluster",
    subnets=["subnet-abc"],
    security_groups=["sg-xyz"],
    image_template="123456789.dkr.ecr.us-west-2.amazonaws.com/my-repo:{task_id}",
    s3_bucket="my-staging-bucket",
    ssh_sidecar=SshSidecarConfig(
        public_key_secret_arn="arn:aws:secretsmanager:...:my-pubkey",
        private_key_secret_arn="arn:aws:secretsmanager:...:my-privkey",
        exec_server_port=19542,
    ),
)

with EcsFargateSandbox(cfg, task_id="task-001", run_id="run-001") as sandbox:
    sandbox.start()
    result = sandbox.exec("echo hello")
    print(result.stdout)

Usage (ECS Fargate — agent-server mode)#

For harnesses that host an agent inside the container (e.g. openhands), omit exec_server_port and pass OutsideEndpoint to start():

from nemo_evaluator.sandbox import OutsideEndpoint

cfg = EcsFargateConfig(
    ...
    ssh_sidecar=SshSidecarConfig(
        public_key_secret_arn="arn:aws:secretsmanager:...:my-pubkey",
        private_key_secret_arn="arn:aws:secretsmanager:...:my-privkey",
    ),
)

model_ep = OutsideEndpoint(url="http://localhost:8080", env_var="MODEL_BASE_URL")

with EcsFargateSandbox(cfg, task_id="task-001", run_id="run-001") as sandbox:
    sandbox.start(outside_endpoints=[model_ep])
    # The agent inside the container can now reach the model via the reverse tunnel.
    # The orchestrator can reach the agent's API via sandbox.local_port.

Prerequisites / Notes#

  • SSH keys must be pre-provisioned in AWS Secrets Manager.

  • If you use S3-based file staging (large uploads / downloads), configure s3_bucket.

  • Docker image building via AWS CodeBuild is available through ImageBuilder.

class nemo_evaluator.sandbox.ExecResult(stdout: str, stderr: str, return_code: int)[source]#

Bases: object

Result of a command executed inside a sandbox.

stdout: str#
stderr: str#
return_code: int#
class nemo_evaluator.sandbox.OutsideEndpoint(url: str, env_var: str)[source]#

Bases: object

An external service endpoint that sandbox processes need to reach.

url: str#

Orchestrator-side URL of the service (e.g. "http://localhost:3825").

env_var: str#

Environment variable to inject into sandbox processes with the resolved URL.

class nemo_evaluator.sandbox.Sandbox(*args, **kwargs)[source]#

Bases: Protocol

Minimal contract every sandbox backend must satisfy.

download(remote_path: str, local_path: Path) None[source]#
exec(
command: str,
timeout_sec: float = 180,
) ExecResult[source]#
property is_running: bool#
resolve_outside_endpoint(url: str) str[source]#

Return the URL that processes inside this sandbox should use to reach the outside service at url (orchestrator-side).

Network-isolated backends (ECS Fargate) remap url to the tunnelled address. Shared-network backends (Apptainer) return url unchanged.

Must be called after start().

start(
*,
force_build: bool = False,
outside_endpoints: list[OutsideEndpoint] | None = None,
) None[source]#
stop() None[source]#
upload(local_path: Path, remote_path: str) None[source]#
class nemo_evaluator.sandbox.EcsFargateConfig(
region: str | None = None,
cluster: str = '',
subnets: list[str] = <factory>,
security_groups: list[str] = <factory>,
assign_public_ip: bool = False,
task_definition: str | None = None,
task_definition_family_prefix: str = 'ecs-sandbox',
image_template: str | None = None,
container_name: str = 'main',
container_port: int | None = None,
cpu: str = '4096',
memory: str = '8192',
ephemeral_storage_gib: int | None = None,
platform_version: str | None = None,
execution_role_arn: str | None = None,
task_role_arn: str | None = None,
extra_env: dict[str,
str] | None = None,
log_group: str | None = None,
log_stream_prefix: str | None = None,
max_task_lifetime_sec: int = 14400,
startup_timeout_sec: float = 300.0,
poll_interval_sec: float = 2.0,
run_task_max_retries: int = 30,
ssh_sidecar: ~nemo_evaluator.sandbox.ecs_fargate.SshSidecarConfig | None = None,
s3_bucket: str | None = None,
s3_prefix: str | None = None,
ecr_repository: str | None = None,
environment_dir: str | None = None,
codebuild_project: str | None = None,
codebuild_service_role: str | None = None,
codebuild_compute_type: str = 'BUILD_GENERAL1_MEDIUM',
codebuild_build_timeout: int = 30,
dockerhub_secret_arn: str | None = None,
build_parallelism: int = 50,
)[source]#

Bases: object

Configuration for the ECS Fargate sandbox.

assign_public_ip: bool = False#
build_parallelism: int = 50#
cluster: str = ''#
codebuild_build_timeout: int = 30#
codebuild_compute_type: str = 'BUILD_GENERAL1_MEDIUM'#
codebuild_project: str | None = None#
codebuild_service_role: str | None = None#
container_name: str = 'main'#
container_port: int | None = None#
cpu: str = '4096'#
dockerhub_secret_arn: str | None = None#
ecr_repository: str | None = None#
environment_dir: str | None = None#
ephemeral_storage_gib: int | None = None#
execution_role_arn: str | None = None#
extra_env: dict[str, str] | None = None#
classmethod from_dict(
raw: Mapping[str, Any],
) EcsFargateConfig[source]#
image_template: str | None = None#
log_group: str | None = None#
log_stream_prefix: str | None = None#
max_task_lifetime_sec: int = 14400#
memory: str = '8192'#
platform_version: str | None = None#
poll_interval_sec: float = 2.0#
region: str | None = None#
run_task_max_retries: int = 30#
s3_bucket: str | None = None#
s3_prefix: str | None = None#
ssh_sidecar: SshSidecarConfig | None = None#
startup_timeout_sec: float = 300.0#
task_definition: str | None = None#
task_definition_family_prefix: str = 'ecs-sandbox'#
task_role_arn: str | None = None#
subnets: list[str]#
security_groups: list[str]#
class nemo_evaluator.sandbox.EcsFargateSandbox(
cfg: EcsFargateConfig,
*,
task_id: str,
run_id: str,
)[source]#

Bases: object

ECS Fargate sandbox implementing the Sandbox protocol.

Supports two modes (determined by ssh_sidecar.exec_server_port):

Exec-server mode (exec_server_port is set):

One-way SSH tunnel + embedded HTTP exec server. exec(), upload(), download() work through the exec server.

Agent-server mode (exec_server_port is None):

Two-way SSH tunnel. Consumer accesses the hosted agent via ssh_tunnel / local_port. exec() / upload() / download() raise RuntimeError.

describe_task() dict[str, Any] | None[source]#

Return a summary dict of the ECS task’s current state, or None.

download(
remote_path: str,
local_path: Path,
) None[source]#
exec(
command: str,
timeout_sec: float = 180,
) ExecResult[source]#
property exec_client: ExecClient | None#

The exec client (only available in exec-server mode after start).

property is_running: bool#
property local_port: int | None#

Local port of the SSH forward tunnel (exec server or agent server).

property model_tunnel_port: int | None#

Port the container uses to reach the model (agent-server mode only).

reconnect_tunnel() None[source]#

Re-open the SSH tunnel if it died (e.g. after a network blip).

resolve_outside_endpoint(url: str) str[source]#

Return the URL that processes inside this sandbox should use to reach the outside service at url (orchestrator-side).

Remaps url to the reverse-tunnel address (127.0.0.1:<tunnel-port>). Must be called after start().

property ssh_tunnel: SshTunnel | None#
start(
*,
force_build: bool = False,
outside_endpoints: list[OutsideEndpoint] | None = None,
) None[source]#
stop() None[source]#
property task_arn: str | None#
property task_ip: str | None#
upload(
local_path: Path,
remote_path: str,
) None[source]#
class nemo_evaluator.sandbox.ExecClient(*, port: int, connect_timeout: float = 30.0)[source]#

Bases: object

HTTP client for the exec server running inside the container.

Communicates through the SSH tunnel (127.0.0.1:<local_port>). Uses only stdlib urllib.request — no extra dependencies.

download(
remote_path: str,
*,
max_retries: int = 3,
) bytes[source]#
exec(
cmd: str,
*,
timeout: int = 300,
) ExecResult[source]#
health() bool[source]#
upload(
remote_path: str,
data: bytes | Path,
*,
mode: str | None = None,
max_retries: int = 3,
) None[source]#
class nemo_evaluator.sandbox.ImageBuilder[source]#

Bases: object

Build Docker images via AWS CodeBuild and push to ECR.

Features: - Content-hash based ECR tags for automatic caching. - Build deduplication across concurrent tasks (only one build per tag). - Semaphore-based concurrency control.

classmethod ensure_image_built(
*,
cfg: EcsFargateConfig,
environment_name: str,
force_build: bool = False,
) str[source]#

Build and push if needed. Returns the full ECR image URL.

Safe to call from many threads — deduplication and a semaphore ensure only one CodeBuild job runs per content-hash tag.

static get_ecr_image_tag(
environment_dir: str | Path,
environment_name: str,
) str[source]#

<name>__<content_hash[:8]> — deterministic, cache-friendly.

static image_exists_in_ecr(
ecr_repository: str,
tag: str,
region: str | None = None,
) bool[source]#
class nemo_evaluator.sandbox.SshSidecarConfig(
sshd_port: int = 2222,
ssh_ready_timeout_sec: float = 120.0,
public_key_secret_arn: str = '',
private_key_secret_arn: str = '',
image: str | None = None,
exec_server_port: int | None = None,
)[source]#

Bases: object

SSH sidecar container configuration.

The sidecar runs sshd in an Alpine container alongside the main container, providing SSH-based transport for tunnelling and execution.

Two modes depending on exec_server_port:

  • Exec-server mode (exec_server_port is set): One-way SSH tunnel. The sandbox uploads a zero-dependency HTTP exec server into the main container and forwards a local port to it. exec(), upload(), download() all work.

  • Agent-server mode (exec_server_port is None): Two-way SSH tunnel. A reverse tunnel (-R) makes the OutsideEndpoint reachable inside the task; a forward tunnel (-L) gives the orchestrator access to the agent server. The consumer is responsible for command execution via its own agent API.

exec_server_port: int | None = None#
classmethod from_dict(
raw: Mapping[str, Any],
) SshSidecarConfig[source]#
image: str | None = None#
private_key_secret_arn: str = ''#
public_key_secret_arn: str = ''#
ssh_ready_timeout_sec: float = 120.0#
sshd_port: int = 2222#
class nemo_evaluator.sandbox.SshTunnel(
*,
host: str,
port: int = 2222,
user: str = 'root',
key_file: str,
forward_port: int | None = None,
forwards: list[str] | None = None,
reverses: list[str] | None = None,
local_port_override: int | None = None,
)[source]#

Bases: object

Manages an ssh -N subprocess with -L and/or -R tunnels.

Two usage patterns:

Exec-server mode — forward a single remote port:

tunnel = SshTunnel(host=ip, port=2222, key_file=key,
                   forward_port=19542)
tunnel.open()
# tunnel.local_port → auto-allocated ephemeral port

Agent-server mode — explicit forward + reverse specs:

fwd = _free_port()
tunnel = SshTunnel(host=ip, port=2222, key_file=key,
                   forwards=[f"{fwd}:localhost:8000"],
                   reverses=[f"11434:model-host:11434"])
tunnel.open()
# tunnel.local_port → fwd
check_health() bool[source]#

Return True if the SSH process is still alive.

close() None[source]#

Terminate the SSH tunnel subprocess.

property is_open: bool#
property local_port: int#
open(
*,
max_retries: int = 15,
initial_backoff: float = 5.0,
) None[source]#

Start the SSH tunnel with retries for transient connection errors.

wait_ready(
*,
health_url: str | None = None,
timeout: float = 120.0,
) None[source]#

Wait until the tunnel endpoint is reachable.

If health_url is given, polls GET <url> for HTTP 200. Otherwise just checks that the local port is open.