core.inference.apis.async_llm#

Async high-level inference API for Megatron (MegatronAsyncLLM).

Module Contents#

Classes#

MegatronAsyncLLM

Async high-level inference API for Megatron.

API#

class core.inference.apis.async_llm.MegatronAsyncLLM(
*,
model,
tokenizer,
inference_config: Optional[megatron.core.inference.config.InferenceConfig] = None,
use_coordinator: bool = False,
coordinator_host: Optional[str] = None,
coordinator_port: Optional[int] = None,
)#

Bases: megatron.core.inference.apis._llm_base._MegatronLLMBase

Async high-level inference API for Megatron.

Asyncio-native wrapper over the shared engine + runtime managed by

Class:

_MegatronLLMBase – see that class for caller responsibilities and the model.eval() contract. Requires use_coordinator=True; direct mode is rejected at __init__ (see Known Limitations in the package README).

On top of the base, this class provides:

  • async generate accepting single or batched prompts.

  • async lifecycle controls: pause / unpause / suspend / resume / shutdown / wait_for_shutdown.

  • meth:

    serve for OpenAI-compatible HTTP serving on the primary rank.

  • async with context-manager protocol; exit calls :meth:shutdown.

Initialization

async generate(
prompts: Union[str, List[int], List[str], List[List[int]]],
sampling_params: Optional[megatron.core.inference.sampling_params.SamplingParams] = None,
) Union[megatron.core.inference.inference_request.DynamicInferenceRequest, List[megatron.core.inference.inference_request.DynamicInferenceRequest]]#

Run inference for one prompt or a batch of prompts.

Single input (str or list[int]) returns a single DynamicInferenceRequest; batched input (list[str] or list[list[int]]) returns list[DynamicInferenceRequest] in input order.

Raises:

RuntimeError – if called on a non-primary rank.

async pause() None#

Transition the engine to PAUSED.

Raises:

RuntimeError – in direct mode (use_coordinator=False).

async unpause() None#

Transition the engine from PAUSED back to RUNNING.

Raises:

RuntimeError – in direct mode (use_coordinator=False).

async suspend() None#

Transition the engine to SUSPENDED (offloads GPU buffers).

The caller must pause() first; this method does not enforce that.

Raises:

RuntimeError – in direct mode (use_coordinator=False).

async resume() None#

Transition the engine from SUSPENDED to RESUMED.

Raises:

RuntimeError – in direct mode (use_coordinator=False).

async shutdown() None#

Stop the engine, tear down the coordinator, and join the runtime thread.

Idempotent. No-op in direct mode.

async serve(
serve_config: megatron.core.inference.apis.serve_config.ServeConfig,
*,
blocking: bool = True,
) None#

Start the OpenAI-compatible HTTP frontend.

Coordinator mode only. The HTTP frontend runs only on the primary rank (global rank 0); other ranks no-op the HTTP setup but still respect blocking (so all ranks return together).

With blocking=True (default), this awaits the engine loop until

Meth:

shutdown is called – suitable for standalone serving scripts. With blocking=False, this returns once the HTTP frontend is up (primary) or immediately (workers); the engine loop continues in the background runtime, and the user can call :meth:generate /

Meth:

shutdown afterward.

Raises:

ValueError – if use_coordinator=False (HTTP serving requires the coordinator path).

async wait_for_shutdown() None#

Block until the engine’s background loop task terminates.

No-op in direct mode.

async __aenter__() core.inference.apis.async_llm.MegatronAsyncLLM#
async __aexit__(exc_type, exc, tb) None#