nemo_evaluator.sandbox#

Sandbox implementations used by evaluation harnesses that need a tmux-like interactive session.

This module is designed to keep dependencies optional:

  • The ECS Fargate implementation only imports AWS SDKs (boto3/botocore) when actually used.

  • Using the ECS sandbox also requires the AWS CLI (aws) and session-manager-plugin on the host.

Usage (ECS Fargate)#

Typical usage is:

Example:

from nemo_evaluator.sandbox import EcsFargateConfig, EcsFargateSandbox

cfg = EcsFargateConfig(
    region="us-west-2",
    cluster="my-ecs-cluster",
    task_definition="my-task-def:1",
    container_name="eval",
    subnets=["subnet-abc"],
    security_groups=["sg-xyz"],
    s3_bucket="my-staging-bucket",
)

with EcsFargateSandbox.spin_up(
    cfg=cfg,
    task_id="task-001",
    trial_name="trial-0001",
    run_id="run-2026-01-12",
) as sandbox:
    session = sandbox.create_session("main")
    session.send_keys(["echo hello", "Enter"], block=True)
    print(session.capture_pane())

Prerequisites / Notes#

  • The harness host must have AWS CLI and session-manager-plugin installed.

  • If you use S3-based fallbacks (large uploads / long commands), configure s3_bucket.

class nemo_evaluator.sandbox.NemoEvaluatorSandbox[source]#

Bases: ABC

Abstract factory for evaluator sandboxes.

Implementations are responsible for provisioning an isolated environment and exposing a tmux-like session API for agents to interact with it.

abstractmethod copy_to_sandbox(
*,
paths: list[Path] | Path,
container_dir: str | None = None,
container_filename: str | None = None,
) None[source]#
abstractmethod create_session(
session_name: str,
is_active_stream: bool = False,
as_configured_user: bool = True,
) NemoSandboxSession[source]#
abstractmethod classmethod spin_up(
*,
task_id: str,
trial_name: str,
run_id: str,
pre_upload_paths: Iterable[Path] | None = None,
upload_dest_dir: str | None = None,
**kwargs,
) ContextManager[NemoEvaluatorSandbox][source]#
abstractmethod stop() None[source]#
class nemo_evaluator.sandbox.NemoSandboxCommand(
command: str,
min_timeout_sec: float = 0.0,
max_timeout_sec: float = 180.0,
block: bool = False,
append_enter: bool = True,
)[source]#

Bases: object

TB-independent command model for driving an interactive terminal.

Mirrors the fields terminal-bench agents commonly use (but does not depend on TB).

append_enter: bool = True#
block: bool = False#
max_timeout_sec: float = 180.0#
min_timeout_sec: float = 0.0#
command: str#
class nemo_evaluator.sandbox.NemoSandboxSession(*args, **kwargs)[source]#

Bases: Protocol

Minimal session API used by agents/harnesses (tmux-like).

capture_pane(capture_entire: bool = False) str[source]#
copy_to_sandbox(
paths: list[Path] | Path,
container_dir: str | None = None,
container_filename: str | None = None,
) None[source]#
get_asciinema_timestamp() float[source]#
get_incremental_output() str[source]#
is_session_alive() bool[source]#
send_command(
command: NemoSandboxCommand,
) None[source]#
send_keys(
keys: str | list[str],
block: bool = False,
min_timeout_sec: float = 0.0,
max_timeout_sec: float = 180.0,
) None[source]#
exception nemo_evaluator.sandbox.AwsCliMissingError[source]#

Bases: RuntimeError

exception nemo_evaluator.sandbox.EcsExecError[source]#

Bases: RuntimeError

class nemo_evaluator.sandbox.EcsFargateConfig(
region: 'str | None',
cluster: 'str',
task_definition: 'str',
container_name: 'str',
subnets: 'list[str]',
security_groups: 'list[str]',
assign_public_ip: 'bool' = False,
image_template: 'str | None' = None,
register_task_definition_per_task: 'bool' = True,
cpu: 'str' = '8192',
memory: 'str' = '32768',
execution_role_arn: 'str | None' = None,
task_role_arn: 'str | None' = None,
log_group: 'str | None' = None,
log_stream_prefix: 'str' = 'nemo-evaluator',
max_task_lifetime_sec: 'int' = 10800,
run_task_max_retries: 'int' = 30,
s3_bucket: 'str | None' = None,
s3_prefix: 'str' = 'nemo-evaluator',
ecs_exec_timeout_sec: 'int' = 180,
)[source]#

Bases: object

assign_public_ip: bool = False#
cpu: str = '8192'#
ecs_exec_timeout_sec: int = 180#
execution_role_arn: str | None = None#
image_template: str | None = None#
log_group: str | None = None#
log_stream_prefix: str = 'nemo-evaluator'#
max_task_lifetime_sec: int = 10800#
memory: str = '32768'#
register_task_definition_per_task: bool = True#
run_task_max_retries: int = 30#
s3_bucket: str | None = None#
s3_prefix: str = 'nemo-evaluator'#
task_role_arn: str | None = None#
region: str | None#
cluster: str#
task_definition: str#
container_name: str#
subnets: list[str]#
security_groups: list[str]#
class nemo_evaluator.sandbox.EcsFargateSandbox(
*,
cfg: EcsFargateConfig,
task_arn: str,
run_id: str,
task_id: str,
trial_name: str,
)[source]#

Bases: NemoEvaluatorSandbox

Sandbox backed by ECS Fargate + ECS Exec.

No inbound connectivity is required. File transfer is done by uploading a tar to S3 and downloading it from inside the container using python stdlib.

copy_to_sandbox(
*,
paths: list[Path] | Path,
container_dir: str | None = None,
container_filename: str | None = None,
) None[source]#

Copy local files/dirs into the remote container via S3-staged tarball.

Flow: - Tar+gzip the provided paths in-memory on the harness host - Upload to S3 and generate a presigned URL - Download inside the container and extract into container_dir

Security note: extraction uses tarfile’s safety filter on Python 3.12+, and a manual path traversal check on Python 3.10-3.11.

create_session(
session_name: str,
is_active_stream: bool = False,
as_configured_user: bool = True,
) EcsFargateTmuxSession[source]#

Create (and start) a tmux-backed sandbox session.

classmethod spin_up(
*,
cfg: EcsFargateConfig,
task_id: str,
trial_name: str,
run_id: str,
pre_upload_paths: Iterable[Path] | None = None,
upload_dest_dir: str | None = None,
) ContextManager[EcsFargateSandbox][source]#
stop() None[source]#

Best-effort teardown for local session objects (remote task is stopped by the contextmanager).