nemo_deploy.multimodal.megatron_multimodal_deployable_ray#

Module Contents#

Classes#

ModelWorker

Ray actor that loads and runs inference on a shard of the multimodal model.

MegatronMultimodalRayDeployable

A Ray Serve deployment for distributed Megatron multimodal models.

Data#

API#

nemo_deploy.multimodal.megatron_multimodal_deployable_ray.LOGGER = 'getLogger(...)'#
nemo_deploy.multimodal.megatron_multimodal_deployable_ray.app = 'FastAPI(...)'#
class nemo_deploy.multimodal.megatron_multimodal_deployable_ray.ModelWorker(
megatron_checkpoint_filepath: str,
rank: int,
world_size: int,
tensor_model_parallel_size: int,
pipeline_model_parallel_size: int,
master_port: str,
master_addr: Optional[str] = None,
replica_id: int = 0,
**model_config_kwargs,
)#

Ray actor that loads and runs inference on a shard of the multimodal model.

Each ModelWorker is responsible for a specific rank in the model parallel setup.

Initialization

infer(
inputs: Dict[str, Any],
) Dict[str, Any]#

Run inference on the model shard.

class nemo_deploy.multimodal.megatron_multimodal_deployable_ray.MegatronMultimodalRayDeployable(
megatron_checkpoint_filepath: str,
num_gpus: int = 1,
tensor_model_parallel_size: int = 1,
pipeline_model_parallel_size: int = 1,
model_id: str = 'megatron-model',
**model_config_kwargs,
)#

A Ray Serve deployment for distributed Megatron multimodal models.

This class coordinates model parallelism across multiple GPUs and nodes, with each shard handled by a separate Ray actor.

Initialization

Initialize the distributed Megatron multimodal model deployment.

Parameters:
  • megatron_checkpoint_filepath (str) – Path to the Megatron checkpoint directory.

  • num_gpus (int) – Number of GPUs to use for the deployment

  • tensor_model_parallel_size (int) – Size of tensor model parallelism.

  • pipeline_model_parallel_size (int) – Size of pipeline model parallelism.

  • model_id (str) – Identifier for the model in API responses.

  • **model_config_kwargs – Additional model configuration arguments.

async chat_completions(request: Dict[Any, Any])#

Handle multimodal chat completion requests.

Supports two image content formats (normalized internally to format 1):

  1. {“type”: “image”, “image”: “url_or_base64”}

  2. {“type”: “image_url”, “image_url”: {“url”: “url_or_base64”}} (OpenAI-style, converted to format 1)

async completions(request: Dict[Any, Any])#

Handle multimodal completion requests.

async list_models()#

List available models.

async health_check()#

Health check endpoint.