core.inference.apis.llm#
Sync high-level inference API for Megatron (MegatronLLM).
Module Contents#
Classes#
Sync high-level inference API for Megatron. |
API#
- class core.inference.apis.llm.MegatronLLM(
- *,
- model,
- tokenizer,
- inference_config: Optional[megatron.core.inference.config.InferenceConfig] = None,
- use_coordinator: bool = False,
- coordinator_host: Optional[str] = None,
- coordinator_port: Optional[int] = None,
Bases:
megatron.core.inference.apis._llm_base._MegatronLLMBaseSync high-level inference API for Megatron.
See :class:
_MegatronLLMBasefor execution modes (direct vs coordinator), caller responsibilities, and themodel.eval()contract.On top of the base, this class provides:
- meth:
generateaccepting one prompt or a batch; always returns alist[DynamicInferenceRequest](single-prompt input returns a one-element list – deliberate asymmetry vs the async API).
Sync lifecycle controls: :meth:
pause/ :meth:unpause/- meth:
suspend/ :meth:resume/ :meth:shutdown/- meth:
wait_for_shutdown.
Context-manager protocol:
with MegatronLLM(...) as llm:; exit calls :meth:shutdown.
.. note::
serve()(online HTTP serving) is async-only by design; use- Class:
MegatronAsyncLLMfor serving.
Initialization
- generate(
- prompts: Union[str, List[int], List[str], List[List[int]]],
- sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
Run inference for one prompt or a batch.
Returns
list[DynamicInferenceRequest]in input order. Single-prompt input returns a one-element list – the always-list shape is the deliberate sync-vs-async asymmetry.No concurrency guard: sync is single-caller by Python’s GIL. If you need to call
generateconcurrently from multiple threads, callers must serialize externally.- Raises:
RuntimeError – if called on a non-primary rank in coordinator mode.
- pause() None#
Transition the engine to
PAUSED. Coordinator mode only.- Raises:
RuntimeError – in direct mode (
use_coordinator=False).
- unpause() None#
Transition the engine from
PAUSEDback toRUNNING.- Raises:
RuntimeError – in direct mode (
use_coordinator=False).
- suspend() None#
Transition the engine to
SUSPENDED(offloads GPU buffers).The caller must
pause()first; this method does not enforce that.- Raises:
RuntimeError – in direct mode (
use_coordinator=False).
- resume() None#
Transition the engine from
SUSPENDEDtoRESUMED.- Raises:
RuntimeError – in direct mode (
use_coordinator=False).
- shutdown() None#
Tear down the engine and runtime. Idempotent. Direct mode is a no-op.
- wait_for_shutdown() None#
Block until the engine loop terminates. Direct mode no-op.
- __enter__() core.inference.apis.llm.MegatronLLM#
- __exit__(exc_type, exc, tb) None#