Deploy Megatron-Bridge LLMs with Ray Serve#

This section demonstrates how to deploy Megatron-Bridge LLM models using Ray Serve. Ray deployment support provides scalable and flexible deployment for NeMo models, offering features such as automatic scaling, load balancing, and multi-replica deployment with support for advanced parallelism strategies.

Note: Single-node examples are shown below. For multi-node clusters managed by SLURM, you can deploy across nodes using the ray.sub helper described in the section “Multi-node on SLURM using ray.sub”.

Quick Example#

Follow the steps on the Generate A Megatron-Bridge Checkpoint page to generate a Megatron-Bridge Llama checkpoint.

In a terminal, go to the folder where the hf_llama31_8B_mbridge checkpoint is located. Pull and run the Docker container image using the command shown below. Change the :vr tag to the version of the container you want to use:

docker pull nvcr.io/nvidia/nemo:vr

docker run --gpus all -it --rm \
    --shm-size=4g \
    -p 1024:1024 \
    -v ${PWD}/:/opt/checkpoints/ \
    -w /opt/Export-Deploy \
    --name nemo-fw \
    nvcr.io/nvidia/nemo:vr

Deploy the NeMo model to Ray:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge \
   --model_format megatron \
   --model_id llama \
   --num_replicas 1 \
   --num_gpus 1 \
   --tensor_model_parallel_size 1 \
   --pipeline_model_parallel_size 1 \
   --cuda_visible_devices "0"

In a separate terminal, access the running container as follows:
```
docker exec -it nemo-fw bash
```

Test the deployed model:

python scripts/deploy/nlp/query_ray_deployment.py \
   --model_id llama \
   --host 0.0.0.0 \
   --port 1024

Detailed Deployment Guide#

Deploying Megatron-Bridge models with Ray Serve closely follows the same process as deploying NeMo 2.0 models. The primary differences are:

Use the --megatron_checkpoint argument to specify your Megatron-Bridge checkpoint file.
Set --model_format megatron to indicate the model type.

All other deployment steps, parameters, and Ray Serve features remain the same as for NeMo 2.0 LLMs. For a comprehensive walkthrough of advanced options, scaling, and troubleshooting, refer to the Deploy NeMo 2.0 LLMs with Ray Serve documentation.