nemo_evaluator.sandbox#
Sandbox implementations used by evaluation harnesses that need isolated container environments for command execution, file transfer, and agent hosting.
This module is designed to keep dependencies optional:
The ECS Fargate implementation only imports AWS SDKs (
boto3/botocore) when actually used.Transport is SSH-based; no AWS CLI or session-manager-plugin required on the host.
Sandbox Protocol#
All sandbox backends implement the Sandbox
protocol, so harnesses can be written backend-agnostically:
start()/stop()— lifecycleexec(command)— run a shell commandupload(local_path, remote_path)/download(remote_path, local_path)— file transferis_running— health checkContext manager (
with sandbox: ...) for automatic cleanup
Usage (ECS Fargate — exec-server mode)#
For harnesses that drive execution from the orchestrator (e.g. terminal-bench):
from nemo_evaluator.sandbox import EcsFargateConfig, EcsFargateSandbox, SshSidecarConfig
cfg = EcsFargateConfig(
region="us-west-2",
cluster="my-ecs-cluster",
subnets=["subnet-abc"],
security_groups=["sg-xyz"],
image_template="123456789.dkr.ecr.us-west-2.amazonaws.com/my-repo:{task_id}",
s3_bucket="my-staging-bucket",
ssh_sidecar=SshSidecarConfig(
public_key_secret_arn="arn:aws:secretsmanager:...:my-pubkey",
private_key_secret_arn="arn:aws:secretsmanager:...:my-privkey",
exec_server_port=19542,
),
)
with EcsFargateSandbox(cfg, task_id="task-001", run_id="run-001") as sandbox:
sandbox.start()
result = sandbox.exec("echo hello")
print(result.stdout)
Usage (ECS Fargate — agent-server mode)#
For harnesses that host an agent inside the container (e.g. openhands),
omit exec_server_port and pass OutsideEndpoint to start():
from nemo_evaluator.sandbox import OutsideEndpoint
cfg = EcsFargateConfig(
...
ssh_sidecar=SshSidecarConfig(
public_key_secret_arn="arn:aws:secretsmanager:...:my-pubkey",
private_key_secret_arn="arn:aws:secretsmanager:...:my-privkey",
),
)
model_ep = OutsideEndpoint(url="http://localhost:8080", env_var="MODEL_BASE_URL")
with EcsFargateSandbox(cfg, task_id="task-001", run_id="run-001") as sandbox:
sandbox.start(outside_endpoints=[model_ep])
# The agent inside the container can now reach the model via the reverse tunnel.
# The orchestrator can reach the agent's API via sandbox.local_port.
Prerequisites / Notes#
SSH keys must be pre-provisioned in AWS Secrets Manager.
If you use S3-based file staging (large uploads / downloads), configure
s3_bucket.Docker image building via AWS CodeBuild is available through
ImageBuilder.
- class nemo_evaluator.sandbox.ExecResult(stdout: str, stderr: str, return_code: int)[source]#
Bases:
objectResult of a command executed inside a sandbox.
- stdout: str#
- stderr: str#
- return_code: int#
- class nemo_evaluator.sandbox.OutsideEndpoint(url: str, env_var: str)[source]#
Bases:
objectAn external service endpoint that sandbox processes need to reach.
- url: str#
Orchestrator-side URL of the service (e.g.
"http://localhost:3825").
- env_var: str#
Environment variable to inject into sandbox processes with the resolved URL.
- class nemo_evaluator.sandbox.Sandbox(*args, **kwargs)[source]#
Bases:
ProtocolMinimal contract every sandbox backend must satisfy.
- exec(
- command: str,
- timeout_sec: float = 180,
- property is_running: bool#
- resolve_outside_endpoint(url: str) str[source]#
Return the URL that processes inside this sandbox should use to reach the outside service at url (orchestrator-side).
Network-isolated backends (ECS Fargate) remap url to the tunnelled address. Shared-network backends (Apptainer) return url unchanged.
Must be called after
start().
- start(
- *,
- force_build: bool = False,
- outside_endpoints: list[OutsideEndpoint] | None = None,
- class nemo_evaluator.sandbox.EcsFargateConfig(
- region: str | None = None,
- cluster: str = '',
- subnets: list[str] = <factory>,
- security_groups: list[str] = <factory>,
- assign_public_ip: bool = False,
- task_definition: str | None = None,
- task_definition_family_prefix: str = 'ecs-sandbox',
- image_template: str | None = None,
- container_name: str = 'main',
- container_port: int | None = None,
- cpu: str = '4096',
- memory: str = '8192',
- ephemeral_storage_gib: int | None = None,
- platform_version: str | None = None,
- execution_role_arn: str | None = None,
- task_role_arn: str | None = None,
- extra_env: dict[str,
- str] | None = None,
- log_group: str | None = None,
- log_stream_prefix: str | None = None,
- max_task_lifetime_sec: int = 14400,
- startup_timeout_sec: float = 300.0,
- poll_interval_sec: float = 2.0,
- run_task_max_retries: int = 30,
- ssh_sidecar: ~nemo_evaluator.sandbox.ecs_fargate.SshSidecarConfig | None = None,
- s3_bucket: str | None = None,
- s3_prefix: str | None = None,
- ecr_repository: str | None = None,
- environment_dir: str | None = None,
- codebuild_project: str | None = None,
- codebuild_service_role: str | None = None,
- codebuild_compute_type: str = 'BUILD_GENERAL1_MEDIUM',
- codebuild_build_timeout: int = 30,
- dockerhub_secret_arn: str | None = None,
- build_parallelism: int = 50,
Bases:
objectConfiguration for the ECS Fargate sandbox.
- assign_public_ip: bool = False#
- build_parallelism: int = 50#
- cluster: str = ''#
- codebuild_build_timeout: int = 30#
- codebuild_compute_type: str = 'BUILD_GENERAL1_MEDIUM'#
- codebuild_project: str | None = None#
- codebuild_service_role: str | None = None#
- container_name: str = 'main'#
- container_port: int | None = None#
- cpu: str = '4096'#
- dockerhub_secret_arn: str | None = None#
- ecr_repository: str | None = None#
- environment_dir: str | None = None#
- ephemeral_storage_gib: int | None = None#
- execution_role_arn: str | None = None#
- extra_env: dict[str, str] | None = None#
- classmethod from_dict(
- raw: Mapping[str, Any],
- image_template: str | None = None#
- log_group: str | None = None#
- log_stream_prefix: str | None = None#
- max_task_lifetime_sec: int = 14400#
- memory: str = '8192'#
- platform_version: str | None = None#
- poll_interval_sec: float = 2.0#
- region: str | None = None#
- run_task_max_retries: int = 30#
- s3_bucket: str | None = None#
- s3_prefix: str | None = None#
- ssh_sidecar: SshSidecarConfig | None = None#
- startup_timeout_sec: float = 300.0#
- task_definition: str | None = None#
- task_definition_family_prefix: str = 'ecs-sandbox'#
- task_role_arn: str | None = None#
- subnets: list[str]#
- security_groups: list[str]#
- class nemo_evaluator.sandbox.EcsFargateSandbox(
- cfg: EcsFargateConfig,
- *,
- task_id: str,
- run_id: str,
Bases:
objectECS Fargate sandbox implementing the
Sandboxprotocol.Supports two modes (determined by
ssh_sidecar.exec_server_port):- Exec-server mode (
exec_server_portis set): One-way SSH tunnel + embedded HTTP exec server.
exec(),upload(),download()work through the exec server.- Agent-server mode (
exec_server_portisNone): Two-way SSH tunnel. Consumer accesses the hosted agent via
ssh_tunnel/local_port.exec()/upload()/download()raiseRuntimeError.
- describe_task() dict[str, Any] | None[source]#
Return a summary dict of the ECS task’s current state, or None.
- exec(
- command: str,
- timeout_sec: float = 180,
- property exec_client: ExecClient | None#
The exec client (only available in exec-server mode after start).
- property is_running: bool#
- property local_port: int | None#
Local port of the SSH forward tunnel (exec server or agent server).
- property model_tunnel_port: int | None#
Port the container uses to reach the model (agent-server mode only).
- resolve_outside_endpoint(url: str) str[source]#
Return the URL that processes inside this sandbox should use to reach the outside service at url (orchestrator-side).
Remaps url to the reverse-tunnel address (
127.0.0.1:<tunnel-port>). Must be called afterstart().
- start(
- *,
- force_build: bool = False,
- outside_endpoints: list[OutsideEndpoint] | None = None,
- property task_arn: str | None#
- property task_ip: str | None#
- Exec-server mode (
- class nemo_evaluator.sandbox.ExecClient(*, port: int, connect_timeout: float = 30.0)[source]#
Bases:
objectHTTP client for the exec server running inside the container.
Communicates through the SSH tunnel (
127.0.0.1:<local_port>). Uses only stdliburllib.request— no extra dependencies.- exec(
- cmd: str,
- *,
- timeout: int = 300,
- class nemo_evaluator.sandbox.ImageBuilder[source]#
Bases:
objectBuild Docker images via AWS CodeBuild and push to ECR.
Features: - Content-hash based ECR tags for automatic caching. - Build deduplication across concurrent tasks (only one build per tag). - Semaphore-based concurrency control.
- classmethod ensure_image_built(
- *,
- cfg: EcsFargateConfig,
- environment_name: str,
- force_build: bool = False,
Build and push if needed. Returns the full ECR image URL.
Safe to call from many threads — deduplication and a semaphore ensure only one CodeBuild job runs per content-hash tag.
- class nemo_evaluator.sandbox.SshSidecarConfig(
- sshd_port: int = 2222,
- ssh_ready_timeout_sec: float = 120.0,
- public_key_secret_arn: str = '',
- private_key_secret_arn: str = '',
- image: str | None = None,
- exec_server_port: int | None = None,
Bases:
objectSSH sidecar container configuration.
The sidecar runs sshd in an Alpine container alongside the main container, providing SSH-based transport for tunnelling and execution.
Two modes depending on
exec_server_port:Exec-server mode (
exec_server_portis set): One-way SSH tunnel. The sandbox uploads a zero-dependency HTTP exec server into the main container and forwards a local port to it.exec(),upload(),download()all work.Agent-server mode (
exec_server_portisNone): Two-way SSH tunnel. A reverse tunnel (-R) makes theOutsideEndpointreachable inside the task; a forward tunnel (-L) gives the orchestrator access to the agent server. The consumer is responsible for command execution via its own agent API.
- exec_server_port: int | None = None#
- classmethod from_dict(
- raw: Mapping[str, Any],
- image: str | None = None#
- private_key_secret_arn: str = ''#
- public_key_secret_arn: str = ''#
- ssh_ready_timeout_sec: float = 120.0#
- sshd_port: int = 2222#
- class nemo_evaluator.sandbox.SshTunnel(
- *,
- host: str,
- port: int = 2222,
- user: str = 'root',
- key_file: str,
- forward_port: int | None = None,
- forwards: list[str] | None = None,
- reverses: list[str] | None = None,
- local_port_override: int | None = None,
Bases:
objectManages an
ssh -Nsubprocess with-Land/or-Rtunnels.Two usage patterns:
Exec-server mode — forward a single remote port:
tunnel = SshTunnel(host=ip, port=2222, key_file=key, forward_port=19542) tunnel.open() # tunnel.local_port → auto-allocated ephemeral port
Agent-server mode — explicit forward + reverse specs:
fwd = _free_port() tunnel = SshTunnel(host=ip, port=2222, key_file=key, forwards=[f"{fwd}:localhost:8000"], reverses=[f"11434:model-host:11434"]) tunnel.open() # tunnel.local_port → fwd
- property is_open: bool#
- property local_port: int#