nemo_deploy.nlp.megatronllm_deployable_ray#

Module Contents#

Classes#

ModelWorker

Ray actor that loads and runs inference on a shard of the model.

MegatronRayDeployable

A Ray Serve deployment for distributed Megatron LLM models.

Data#

API#

nemo_deploy.nlp.megatronllm_deployable_ray.LOGGER = 'getLogger(...)'#
nemo_deploy.nlp.megatronllm_deployable_ray.app = 'FastAPI(...)'#
class nemo_deploy.nlp.megatronllm_deployable_ray.ModelWorker(
nemo_checkpoint_filepath: str,
rank: int,
world_size: int,
tensor_model_parallel_size: int,
pipeline_model_parallel_size: int,
context_parallel_size: int,
expert_model_parallel_size: int,
master_port: str,
replica_id: int = 0,
enable_cuda_graphs: bool = False,
enable_flash_decode: bool = False,
legacy_ckpt: bool = False,
)[source]#

Ray actor that loads and runs inference on a shard of the model.

Each ModelWorker is responsible for a specific rank in the model parallel setup.

Initialization

infer(
inputs: Dict[str, Any],
) Dict[str, Any][source]#

Run inference on the model shard.

class nemo_deploy.nlp.megatronllm_deployable_ray.MegatronRayDeployable(
nemo_checkpoint_filepath: str,
num_gpus: int = 1,
num_nodes: int = 1,
tensor_model_parallel_size: int = 1,
pipeline_model_parallel_size: int = 1,
context_parallel_size: int = 1,
expert_model_parallel_size: int = 1,
model_id: str = 'nemo-model',
enable_cuda_graphs: bool = False,
enable_flash_decode: bool = False,
legacy_ckpt: bool = False,
)#

A Ray Serve deployment for distributed Megatron LLM models.

This class coordinates model parallelism across multiple GPUs and nodes, with each shard handled by a separate Ray actor.

Initialization

Initialize the distributed Megatron LLM model deployment.

Parameters:
  • nemo_checkpoint_filepath (str) – Path to the .nemo checkpoint file.

  • num_gpus (int) – Number of GPUs to use per replica.

  • num_nodes (int) – Number of nodes to use for deployment.

  • tensor_model_parallel_size (int) – Size of tensor model parallelism.

  • pipeline_model_parallel_size (int) – Size of pipeline model parallelism.

  • context_parallel_size (int) – Size of context parallelism.

  • model_id (str) – Identifier for the model in API responses.

  • enable_cuda_graphs (bool) – Whether to enable CUDA graphs for faster inference.

  • enable_flash_decode (bool) – Whether to enable Flash Attention decode.

  • max_batch_size (int) – Maximum batch size for request batching.

  • batch_wait_timeout_s (float) – Maximum time to wait for batching requests.

  • legacy_ckpt (bool) – Whether to use legacy checkpoint format. Defaults to False.

async completions(request: Dict[Any, Any])#

Handle text completion requests.

async chat_completions(request: Dict[Any, Any])#

Handle chat completion requests.

async list_models()#

List available models.

async health_check()#

Health check endpoint.