`nemo_deploy.nlp.megatronllm_deployable_ray`#

Module Contents#

Classes#

`ModelWorker`	Ray actor that loads and runs inference on a shard of the model.
`MegatronRayDeployable`	A Ray Serve deployment for distributed Megatron LLM models.

Data#

`LOGGER`
`app`

API#

nemo_deploy.nlp.megatronllm_deployable_ray.LOGGER = 'getLogger(...)'#

nemo_deploy.nlp.megatronllm_deployable_ray.app = 'FastAPI(...)'#

class nemo_deploy.nlp.megatronllm_deployable_ray.ModelWorker(

nemo_checkpoint_filepath: str,

rank: int,

world_size: int,

tensor_model_parallel_size: int,

pipeline_model_parallel_size: int,

context_parallel_size: int,

expert_model_parallel_size: int,

master_port: str,

replica_id: int = 0,

enable_cuda_graphs: bool = False,

enable_flash_decode: bool = False,

legacy_ckpt: bool = False,

max_batch_size: int = 32,

random_seed: Optional[int] = None,

megatron_checkpoint_filepath: str = None,

model_type: str = 'gpt',

model_format: str = 'nemo',

micro_batch_size: Optional[int] = None,

**model_config_kwargs,

)#

Ray actor that loads and runs inference on a shard of the model.

Each ModelWorker is responsible for a specific rank in the model parallel setup.

Initialization

infer( inputs: Dict[str, Any], ) → Dict[str, Any]#: Run inference on the model shard.

class nemo_deploy.nlp.megatronllm_deployable_ray.MegatronRayDeployable(

nemo_checkpoint_filepath: str,

num_gpus: int = 1,

tensor_model_parallel_size: int = 1,

pipeline_model_parallel_size: int = 1,

context_parallel_size: int = 1,

expert_model_parallel_size: int = 1,

model_id: str = 'nemo-model',

enable_cuda_graphs: bool = False,

enable_flash_decode: bool = False,

legacy_ckpt: bool = False,

max_batch_size: int = 32,

random_seed: Optional[int] = None,

megatron_checkpoint_filepath: str = None,

model_type: str = 'gpt',

model_format: str = 'nemo',

micro_batch_size: Optional[int] = None,

**model_config_kwargs,

)#

A Ray Serve deployment for distributed Megatron LLM models.

This class coordinates model parallelism across multiple GPUs and nodes, with each shard handled by a separate Ray actor.

Initialization

Initialize the distributed Megatron LLM model deployment.

Parameters:

nemo_checkpoint_filepath (str) – Path to the .nemo checkpoint file.
num_gpus (int) – Number of GPUs to use for the deployment
tensor_model_parallel_size (int) – Size of tensor model parallelism.
pipeline_model_parallel_size (int) – Size of pipeline model parallelism.
context_parallel_size (int) – Size of context parallelism.
model_id (str) – Identifier for the model in API responses.
enable_cuda_graphs (bool) – Whether to enable CUDA graphs for faster inference.
enable_flash_decode (bool) – Whether to enable Flash Attention decode.
max_batch_size (int) – Maximum batch size for request batching.
batch_wait_timeout_s (float) – Maximum time to wait for batching requests.
legacy_ckpt (bool) – Whether to use legacy checkpoint format. Defaults to False.
random_seed (int) – Random seed for model initialization.
megatron_checkpoint_filepath (str) – Path to the Megatron checkpoint file.
model_type (str) – Type of model to load.
model_format (str) – Format of model to load.
micro_batch_size (Optional[int]) – Micro batch size for model execution.

async completions(request: Dict[Any, Any])#: Handle text completion requests.

async chat_completions(request: Dict[Any, Any])#: Handle chat completion requests.

async list_models()#: List available models.

async health_check()#: Health check endpoint.

nemo_deploy.nlp.megatronllm_deployable_ray#

Module Contents#

Classes#

Data#

API#

`nemo_deploy.nlp.megatronllm_deployable_ray`#