core.inference.apis.async_llm#
Async high-level inference API for Megatron (MegatronAsyncLLM).
Module Contents#
Classes#
Async high-level inference API for Megatron. |
API#
- class core.inference.apis.async_llm.MegatronAsyncLLM(
- *,
- model,
- tokenizer,
- inference_config: Optional[megatron.core.inference.config.InferenceConfig] = None,
- use_coordinator: bool = False,
- coordinator_host: Optional[str] = None,
- coordinator_port: Optional[int] = None,
Bases:
megatron.core.inference.apis._llm_base._MegatronLLMBaseAsync high-level inference API for Megatron.
Asyncio-native wrapper over the shared engine + runtime managed by
- Class:
_MegatronLLMBase– see that class for caller responsibilities and themodel.eval()contract. Requiresuse_coordinator=True; direct mode is rejected at__init__(see Known Limitations in the package README).
On top of the base, this class provides:
async generateaccepting single or batched prompts.asynclifecycle controls:pause/unpause/suspend/resume/shutdown/wait_for_shutdown.- meth:
servefor OpenAI-compatible HTTP serving on the primary rank.
async withcontext-manager protocol; exit calls :meth:shutdown.
Initialization
- async generate(
- prompts: Union[str, List[int], List[str], List[List[int]]],
- sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
Run inference for one prompt or a batch of prompts.
Single input (
strorlist[int]) returns a singleDynamicInferenceRequest; batched input (list[str]orlist[list[int]]) returnslist[DynamicInferenceRequest]in input order.- Raises:
RuntimeError – if called on a non-primary rank.
- async pause() None#
Transition the engine to
PAUSED.- Raises:
RuntimeError – in direct mode (
use_coordinator=False).
- async unpause() None#
Transition the engine from
PAUSEDback toRUNNING.- Raises:
RuntimeError – in direct mode (
use_coordinator=False).
- async suspend() None#
Transition the engine to
SUSPENDED(offloads GPU buffers).The caller must
pause()first; this method does not enforce that.- Raises:
RuntimeError – in direct mode (
use_coordinator=False).
- async resume() None#
Transition the engine from
SUSPENDEDtoRESUMED.- Raises:
RuntimeError – in direct mode (
use_coordinator=False).
- async shutdown() None#
Stop the engine, tear down the coordinator, and join the runtime thread.
Idempotent. No-op in direct mode.
- async serve(
- serve_config: megatron.core.inference.apis.serve_config.ServeConfig,
- *,
- blocking: bool = True,
Start the OpenAI-compatible HTTP frontend.
Coordinator mode only. The HTTP frontend runs only on the primary rank (global rank 0); other ranks no-op the HTTP setup but still respect
blocking(so all ranks return together).With
blocking=True(default), this awaits the engine loop until- Meth:
shutdownis called – suitable for standalone serving scripts. Withblocking=False, this returns once the HTTP frontend is up (primary) or immediately (workers); the engine loop continues in the background runtime, and the user can call :meth:generate/- Meth:
shutdownafterward.- Raises:
ValueError – if
use_coordinator=False(HTTP serving requires the coordinator path).
- async wait_for_shutdown() None#
Block until the engine’s background loop task terminates.
No-op in direct mode.
- async __aenter__() core.inference.apis.async_llm.MegatronAsyncLLM#
- async __aexit__(exc_type, exc, tb) None#