nemo_evaluator.sandbox#
Sandbox implementations used by evaluation harnesses that need a tmux-like interactive session.
This module is designed to keep dependencies optional:
The ECS Fargate implementation only imports AWS SDKs (
boto3/botocore) when actually used.Using the ECS sandbox also requires the AWS CLI (
aws) andsession-manager-pluginon the host.
Usage (ECS Fargate)#
Typical usage is:
configure
EcsFargateConfigspin_up()a sandbox contextcreate an interactive
NemoSandboxSession
Example:
from nemo_evaluator.sandbox import EcsFargateConfig, EcsFargateSandbox
cfg = EcsFargateConfig(
region="us-west-2",
cluster="my-ecs-cluster",
task_definition="my-task-def:1",
container_name="eval",
subnets=["subnet-abc"],
security_groups=["sg-xyz"],
s3_bucket="my-staging-bucket",
)
with EcsFargateSandbox.spin_up(
cfg=cfg,
task_id="task-001",
trial_name="trial-0001",
run_id="run-2026-01-12",
) as sandbox:
session = sandbox.create_session("main")
session.send_keys(["echo hello", "Enter"], block=True)
print(session.capture_pane())
Prerequisites / Notes#
The harness host must have AWS CLI and session-manager-plugin installed.
If you use S3-based fallbacks (large uploads / long commands), configure
s3_bucket.
- class nemo_evaluator.sandbox.NemoEvaluatorSandbox[source]#
Bases:
ABCAbstract factory for evaluator sandboxes.
Implementations are responsible for provisioning an isolated environment and exposing a tmux-like session API for agents to interact with it.
- abstractmethod copy_to_sandbox(
- *,
- paths: list[Path] | Path,
- container_dir: str | None = None,
- container_filename: str | None = None,
- abstractmethod create_session(
- session_name: str,
- is_active_stream: bool = False,
- as_configured_user: bool = True,
- abstractmethod classmethod spin_up(
- *,
- task_id: str,
- trial_name: str,
- run_id: str,
- pre_upload_paths: Iterable[Path] | None = None,
- upload_dest_dir: str | None = None,
- **kwargs,
- class nemo_evaluator.sandbox.NemoSandboxCommand(
- command: str,
- min_timeout_sec: float = 0.0,
- max_timeout_sec: float = 180.0,
- block: bool = False,
- append_enter: bool = True,
Bases:
objectTB-independent command model for driving an interactive terminal.
Mirrors the fields terminal-bench agents commonly use (but does not depend on TB).
- append_enter: bool = True#
- block: bool = False#
- max_timeout_sec: float = 180.0#
- min_timeout_sec: float = 0.0#
- command: str#
- class nemo_evaluator.sandbox.NemoSandboxSession(*args, **kwargs)[source]#
Bases:
ProtocolMinimal session API used by agents/harnesses (tmux-like).
- copy_to_sandbox(
- paths: list[Path] | Path,
- container_dir: str | None = None,
- container_filename: str | None = None,
- send_command(
- command: NemoSandboxCommand,
- class nemo_evaluator.sandbox.EcsFargateConfig(
- region: 'str | None',
- cluster: 'str',
- task_definition: 'str',
- container_name: 'str',
- subnets: 'list[str]',
- security_groups: 'list[str]',
- assign_public_ip: 'bool' = False,
- image_template: 'str | None' = None,
- register_task_definition_per_task: 'bool' = True,
- cpu: 'str' = '8192',
- memory: 'str' = '32768',
- execution_role_arn: 'str | None' = None,
- task_role_arn: 'str | None' = None,
- log_group: 'str | None' = None,
- log_stream_prefix: 'str' = 'nemo-evaluator',
- max_task_lifetime_sec: 'int' = 10800,
- run_task_max_retries: 'int' = 30,
- s3_bucket: 'str | None' = None,
- s3_prefix: 'str' = 'nemo-evaluator',
- ecs_exec_timeout_sec: 'int' = 180,
Bases:
object- assign_public_ip: bool = False#
- cpu: str = '8192'#
- ecs_exec_timeout_sec: int = 180#
- execution_role_arn: str | None = None#
- image_template: str | None = None#
- log_group: str | None = None#
- log_stream_prefix: str = 'nemo-evaluator'#
- max_task_lifetime_sec: int = 10800#
- memory: str = '32768'#
- register_task_definition_per_task: bool = True#
- run_task_max_retries: int = 30#
- s3_bucket: str | None = None#
- s3_prefix: str = 'nemo-evaluator'#
- task_role_arn: str | None = None#
- region: str | None#
- cluster: str#
- task_definition: str#
- container_name: str#
- subnets: list[str]#
- security_groups: list[str]#
- class nemo_evaluator.sandbox.EcsFargateSandbox(
- *,
- cfg: EcsFargateConfig,
- task_arn: str,
- run_id: str,
- task_id: str,
- trial_name: str,
Bases:
NemoEvaluatorSandboxSandbox backed by ECS Fargate + ECS Exec.
No inbound connectivity is required. File transfer is done by uploading a tar to S3 and downloading it from inside the container using python stdlib.
- copy_to_sandbox(
- *,
- paths: list[Path] | Path,
- container_dir: str | None = None,
- container_filename: str | None = None,
Copy local files/dirs into the remote container via S3-staged tarball.
Flow: - Tar+gzip the provided paths in-memory on the harness host - Upload to S3 and generate a presigned URL - Download inside the container and extract into container_dir
Security note: extraction uses tarfile’s safety filter on Python 3.12+, and a manual path traversal check on Python 3.10-3.11.
- create_session(
- session_name: str,
- is_active_stream: bool = False,
- as_configured_user: bool = True,
Create (and start) a tmux-backed sandbox session.
- classmethod spin_up(
- *,
- cfg: EcsFargateConfig,
- task_id: str,
- trial_name: str,
- run_id: str,
- pre_upload_paths: Iterable[Path] | None = None,
- upload_dest_dir: str | None = None,