core.inference.apis.llm#

Sync high-level inference API for Megatron (MegatronLLM).

Module Contents#

Classes#

MegatronLLM

Sync high-level inference API for Megatron.

API#

class core.inference.apis.llm.MegatronLLM(
*,
model,
tokenizer,
inference_config: Optional[megatron.core.inference.config.InferenceConfig] = None,
use_coordinator: bool = False,
coordinator_host: Optional[str] = None,
coordinator_port: Optional[int] = None,
)#

Bases: megatron.core.inference.apis._llm_base._MegatronLLMBase

Sync high-level inference API for Megatron.

See :class:_MegatronLLMBase for execution modes (direct vs coordinator), caller responsibilities, and the model.eval() contract.

On top of the base, this class provides:

  • meth:

    generate accepting one prompt or a batch; always returns a list[DynamicInferenceRequest] (single-prompt input returns a one-element list – deliberate asymmetry vs the async API).

  • Sync lifecycle controls: :meth:pause / :meth:unpause /

    meth:

    suspend / :meth:resume / :meth:shutdown /

    meth:

    wait_for_shutdown.

  • Context-manager protocol: with MegatronLLM(...) as llm:; exit calls :meth:shutdown.

.. note::

serve() (online HTTP serving) is async-only by design; use

Class:

MegatronAsyncLLM for serving.

Initialization

generate(
prompts: Union[str, List[int], List[str], List[List[int]]],
sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
) List[megatron.core.inference.inference_request.DynamicInferenceRequest]#

Run inference for one prompt or a batch.

Returns list[DynamicInferenceRequest] in input order. Single-prompt input returns a one-element list – the always-list shape is the deliberate sync-vs-async asymmetry.

No concurrency guard: sync is single-caller by Python’s GIL. If you need to call generate concurrently from multiple threads, callers must serialize externally.

Raises:

RuntimeError – if called on a non-primary rank in coordinator mode.

pause() None#

Transition the engine to PAUSED. Coordinator mode only.

Raises:

RuntimeError – in direct mode (use_coordinator=False).

unpause() None#

Transition the engine from PAUSED back to RUNNING.

Raises:

RuntimeError – in direct mode (use_coordinator=False).

suspend() None#

Transition the engine to SUSPENDED (offloads GPU buffers).

The caller must pause() first; this method does not enforce that.

Raises:

RuntimeError – in direct mode (use_coordinator=False).

resume() None#

Transition the engine from SUSPENDED to RESUMED.

Raises:

RuntimeError – in direct mode (use_coordinator=False).

shutdown() None#

Tear down the engine and runtime. Idempotent. Direct mode is a no-op.

wait_for_shutdown() None#

Block until the engine loop terminates. Direct mode no-op.

__enter__() core.inference.apis.llm.MegatronLLM#
__exit__(exc_type, exc, tb) None#