nemo_deploy.nlp.megatronllm_deployable_ray
#
Module Contents#
Classes#
Ray actor that loads and runs inference on a shard of the model. |
|
A Ray Serve deployment for distributed Megatron LLM models. |
Data#
API#
- nemo_deploy.nlp.megatronllm_deployable_ray.LOGGER = 'getLogger(...)'#
- nemo_deploy.nlp.megatronllm_deployable_ray.app = 'FastAPI(...)'#
- class nemo_deploy.nlp.megatronllm_deployable_ray.ModelWorker(
- nemo_checkpoint_filepath: str,
- rank: int,
- world_size: int,
- tensor_model_parallel_size: int,
- pipeline_model_parallel_size: int,
- context_parallel_size: int,
- expert_model_parallel_size: int,
- master_port: str,
- replica_id: int = 0,
- enable_cuda_graphs: bool = False,
- enable_flash_decode: bool = False,
- legacy_ckpt: bool = False,
Ray actor that loads and runs inference on a shard of the model.
Each ModelWorker is responsible for a specific rank in the model parallel setup.
Initialization
- class nemo_deploy.nlp.megatronllm_deployable_ray.MegatronRayDeployable(
- nemo_checkpoint_filepath: str,
- num_gpus: int = 1,
- num_nodes: int = 1,
- tensor_model_parallel_size: int = 1,
- pipeline_model_parallel_size: int = 1,
- context_parallel_size: int = 1,
- expert_model_parallel_size: int = 1,
- model_id: str = 'nemo-model',
- enable_cuda_graphs: bool = False,
- enable_flash_decode: bool = False,
- legacy_ckpt: bool = False,
A Ray Serve deployment for distributed Megatron LLM models.
This class coordinates model parallelism across multiple GPUs and nodes, with each shard handled by a separate Ray actor.
Initialization
Initialize the distributed Megatron LLM model deployment.
- Parameters:
nemo_checkpoint_filepath (str) – Path to the .nemo checkpoint file.
num_gpus (int) – Number of GPUs to use per replica.
num_nodes (int) – Number of nodes to use for deployment.
tensor_model_parallel_size (int) – Size of tensor model parallelism.
pipeline_model_parallel_size (int) – Size of pipeline model parallelism.
context_parallel_size (int) – Size of context parallelism.
model_id (str) – Identifier for the model in API responses.
enable_cuda_graphs (bool) – Whether to enable CUDA graphs for faster inference.
enable_flash_decode (bool) – Whether to enable Flash Attention decode.
max_batch_size (int) – Maximum batch size for request batching.
batch_wait_timeout_s (float) – Maximum time to wait for batching requests.
legacy_ckpt (bool) – Whether to use legacy checkpoint format. Defaults to False.
- async completions(request: Dict[Any, Any])#
Handle text completion requests.
- async chat_completions(request: Dict[Any, Any])#
Handle chat completion requests.
- async list_models()#
List available models.
- async health_check()#
Health check endpoint.