`core.inference.apis.llm`#

Sync high-level inference API for Megatron (MegatronLLM).

Module Contents#

Classes#

MegatronLLM

Sync high-level inference API for Megatron.

API#

class core.inference.apis.llm.MegatronLLM( *, model, tokenizer, inference_config: Optional[megatron.core.inference.config.InferenceConfig] = None, use_coordinator: bool = False, coordinator_host: Optional[str] = None, coordinator_port: Optional[int] = None, )#

Bases: megatron.core.inference.apis._llm_base._MegatronLLMBase

Sync high-level inference API for Megatron.

See :class:_MegatronLLMBase for execution modes (direct vs coordinator), caller responsibilities, and the model.eval() contract.

On top of the base, this class provides:

meth:

generate accepting one prompt or a batch; always returns a list[DynamicInferenceRequest] (single-prompt input returns a one-element list – deliberate asymmetry vs the async API).
Sync lifecycle controls: :meth:pause / :meth:unpause /

meth:

suspend / :meth:resume / :meth:shutdown /

meth:

wait_for_shutdown.
Context-manager protocol: with MegatronLLM(...) as llm:; exit calls :meth:shutdown.

.. note::

serve() (online HTTP serving) is async-only by design; use

Class:: MegatronAsyncLLM for serving.

Initialization

generate( prompts: Union[str, List[int], List[str], List[List[int]]], sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, ) → List[megatron.core.inference.inference_request.DynamicInferenceRequest]#

Run inference for one prompt or a batch.

Returns list[DynamicInferenceRequest] in input order. Single-prompt input returns a one-element list – the always-list shape is the deliberate sync-vs-async asymmetry.

No concurrency guard: sync is single-caller by Python’s GIL. If you need to call generate concurrently from multiple threads, callers must serialize externally.

Raises:: RuntimeError – if called on a non-primary rank in coordinator mode.

pause() → None#

Transition the engine to PAUSED. Coordinator mode only.

Raises:: RuntimeError – in direct mode (use_coordinator=False).

unpause() → None#

Transition the engine from PAUSED back to RUNNING.

Raises:: RuntimeError – in direct mode (use_coordinator=False).

suspend() → None#

Transition the engine to SUSPENDED (offloads GPU buffers).

The caller must pause() first; this method does not enforce that.

Raises:: RuntimeError – in direct mode (use_coordinator=False).

resume() → None#

Transition the engine from SUSPENDED to RESUMED.

Raises:: RuntimeError – in direct mode (use_coordinator=False).

shutdown() → None#: Tear down the engine and runtime. Idempotent. Direct mode is a no-op.

wait_for_shutdown() → None#: Block until the engine loop terminates. Direct mode no-op.

__enter__() → core.inference.apis.llm.MegatronLLM#

__exit__(exc_type, exc, tb) → None#

core.inference.apis.llm#

Module Contents#

Classes#

API#

`core.inference.apis.llm`#