`core.inference.apis.async_llm`#

Async high-level inference API for Megatron (MegatronAsyncLLM).

Module Contents#

Classes#

MegatronAsyncLLM

Async high-level inference API for Megatron.

API#

class core.inference.apis.async_llm.MegatronAsyncLLM( *, model, tokenizer, inference_config: Optional[megatron.core.inference.config.InferenceConfig] = None, use_coordinator: bool = False, coordinator_host: Optional[str] = None, coordinator_port: Optional[int] = None, )#

Bases: megatron.core.inference.apis._llm_base._MegatronLLMBase

Async high-level inference API for Megatron.

Asyncio-native wrapper over the shared engine + runtime managed by

Class:: _MegatronLLMBase – see that class for caller responsibilities and the model.eval() contract. Requires use_coordinator=True; direct mode is rejected at __init__ (see Known Limitations in the package README).

On top of the base, this class provides:

async generate accepting single or batched prompts.
async lifecycle controls: pause / unpause / suspend / resume / shutdown / wait_for_shutdown.
meth:

serve for OpenAI-compatible HTTP serving on the primary rank.
async with context-manager protocol; exit calls :meth:shutdown.

Initialization

async generate( prompts: Union[str, List[int], List[str], List[List[int]]], sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None, ) → Union[megatron.core.inference.inference_request.DynamicInferenceRequest, List[megatron.core.inference.inference_request.DynamicInferenceRequest]]#

Run inference for one prompt or a batch of prompts.

Single input (str or list[int]) returns a single DynamicInferenceRequest; batched input (list[str] or list[list[int]]) returns list[DynamicInferenceRequest] in input order.

Raises:: RuntimeError – if called on a non-primary rank.

async pause() → None#

Transition the engine to PAUSED.

Raises:: RuntimeError – in direct mode (use_coordinator=False).

async unpause() → None#

Transition the engine from PAUSED back to RUNNING.

Raises:: RuntimeError – in direct mode (use_coordinator=False).

async suspend() → None#

Transition the engine to SUSPENDED (offloads GPU buffers).

The caller must pause() first; this method does not enforce that.

Raises:: RuntimeError – in direct mode (use_coordinator=False).

async resume() → None#

Transition the engine from SUSPENDED to RESUMED.

Raises:: RuntimeError – in direct mode (use_coordinator=False).

async shutdown() → None#

Stop the engine, tear down the coordinator, and join the runtime thread.

Idempotent. No-op in direct mode.

async serve( serve_config: megatron.core.inference.apis.serve_config.ServeConfig, *, blocking: bool = True, ) → None#

Start the OpenAI-compatible HTTP frontend.

Coordinator mode only. The HTTP frontend runs only on the primary rank (global rank 0); other ranks no-op the HTTP setup but still respect blocking (so all ranks return together).

With blocking=True (default), this awaits the engine loop until

Meth:: shutdown is called – suitable for standalone serving scripts. With blocking=False, this returns once the HTTP frontend is up (primary) or immediately (workers); the engine loop continues in the background runtime, and the user can call :meth:generate /
Meth:: shutdown afterward.
Raises:: ValueError – if use_coordinator=False (HTTP serving requires the coordinator path).

async wait_for_shutdown() → None#

Block until the engine’s background loop task terminates.

No-op in direct mode.

async __aenter__() → core.inference.apis.async_llm.MegatronAsyncLLM#

async __aexit__(exc_type, exc, tb) → None#

core.inference.apis.async_llm#

Module Contents#

Classes#

API#

`core.inference.apis.async_llm`#