> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.speculative.eagle.remote.server

Remote EAGLE-3 target server.

Runs the frozen target model and, for each training request, produces the
draft-vocab supervision (aux hidden states, `target_probs`, `position_mask`)
and ships it back to the training client. The supervision computation reuses the
co-located building blocks verbatim -- `HFEagle3TargetModel.generate_batch`
for the forward + aux capture and `_compute_target_distribution` for the
draft-vocab projection -- so a remote run is numerically identical to a
co-located one.

The HTTP request handling is split from the `http.server` plumbing
(:class:`TargetModelServer` holds the pure logic) so it can be unit-tested on
CPU with the NCCL data plane disabled (wire-format path).

## Module Contents

### Classes

| Name                                                                                                | Description                                                                    |
| --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| [`TargetModelServer`](#nemo_automodel-components-speculative-eagle-remote-server-TargetModelServer) | Request-handling logic for the remote target server (HTTP-transport agnostic). |

### Functions

| Name                                                                                                        | Description                                                          |
| ----------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| [`_make_request_handler`](#nemo_automodel-components-speculative-eagle-remote-server-_make_request_handler) | -                                                                    |
| [`compute_supervision`](#nemo_automodel-components-speculative-eagle-remote-server-compute_supervision)     | Produce the precomputed draft-vocab supervision for one batch.       |
| [`serve`](#nemo_automodel-components-speculative-eagle-remote-server-serve)                                 | Run the blocking HTTP server until the client disconnects or Ctrl-C. |

### Data

[`logger`](#nemo_automodel-components-speculative-eagle-remote-server-logger)

### API

```python
class nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer(
    target_wrapper,
    nccl_port: int,
    host: str = '0.0.0.0'
)
```

Request-handling logic for the remote target server (HTTP-transport agnostic).

## Parameters

target\_wrapper:
A loaded `HFEagle3TargetModel` (or any object exposing the same
`generate_batch` / `get_input_embeddings` surface).
nccl\_port:
TCP rendezvous port for the NCCL data plane.
host:
Bind/advertise address (rendezvous master for NCCL).

```python
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer._infer_device() -> torch.device
```

```python
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.close() -> None
```

```python
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.flush_nccl_send() -> None
```

Send the pending supervision tensors over NCCL (after the HTTP flush).

```python
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_disconnect(
    _raw: bytes
) -> bytes
```

```python
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_generate(
    raw: bytes,
    client_wants_nccl: bool
) -> tuple[bytes, bool]
```

Run the target and serialize the supervision.

Returns `(body, used_nccl)`. When NCCL is used the body is the JSON
metadata only and the tensors are queued for :meth:`flush_nccl_send`
(sent *after* the HTTP response is flushed, to avoid a recv deadlock).

```python
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_init_nccl(
    raw: bytes
) -> bytes
```

```python
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_input_embeddings(
    _raw: bytes
) -> bytes
```

Return the target input-embedding weight (used once to seed the draft).

```python
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_model_info(
    _raw: bytes
) -> bytes
```

```python
nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer.handle_set_vocab_mapping(
    raw: bytes
) -> bytes
```

```python
nemo_automodel.components.speculative.eagle.remote.server._make_request_handler(
    server_logic: nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer
)
```

```python
nemo_automodel.components.speculative.eagle.remote.server.compute_supervision(
    target_wrapper,
    selected_token_ids: torch.Tensor,
    selected_token_mask: torch.Tensor,
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor,
    loss_mask: torch.Tensor
) -> dict[str, torch.Tensor]
```

Produce the precomputed draft-vocab supervision for one batch.

Mirrors the co-located path exactly: `generate_batch` runs the target and
returns shifted logits / input\_ids / loss\_mask plus the aux hidden states;
`_compute_target_distribution` then projects the shifted logits onto the
draft vocab. Returns tensors keyed by :data:`protocol.SUPERVISION_KEYS`.

```python
nemo_automodel.components.speculative.eagle.remote.server.serve(
    server_logic: nemo_automodel.components.speculative.eagle.remote.server.TargetModelServer,
    host: str,
    port: int
) -> None
```

Run the blocking HTTP server until the client disconnects or Ctrl-C.

```python
nemo_automodel.components.speculative.eagle.remote.server.logger = logging.getLogger(__name__)
```