Deploy Megatron-Bridge LLMs with Ray Serve#
This section demonstrates how to deploy Megatron-Bridge LLM models using Ray Serve. Ray Serve deployment support provides scalable and flexible deployment for Megatron-Bridge LLMs, offering features such as automatic scaling, load balancing, and multi-replica deployment with support for advanced parallelism strategies.
Note: Single-node examples are shown below. For multi-node clusters managed by SLURM, you can deploy across nodes using the ray.sub helper described in the section “Multi-node on SLURM using ray.sub”.
Quick Example#
Follow the steps on the Generate A Megatron-Bridge Checkpoint page to generate a Megatron-Bridge Llama checkpoint.
In a terminal, go to the folder where the
hf_llama31_8B_mbridgecheckpoint is located. Pull and run the Docker container image. Replace:vrwith your desired version:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm \ --shm-size=4g \ -p 1024:1024 \ -v ${PWD}/:/opt/checkpoints/ \ -w /opt/Export-Deploy \ --name nemo-fw \ nvcr.io/nvidia/nemo:vr
Deploy the Megatron-Bridge LLM with Ray Serve:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id llama \ --num_replicas 1 \ --num_gpus 1 \ --tensor_model_parallel_size 1 \ --pipeline_model_parallel_size 1 \ --cuda_visible_devices "0"
In a separate terminal, access the running container as follows:
docker exec -it nemo-fw bash
Test the deployed model:
python scripts/deploy/nlp/query_ray_deployment.py \ --model_id llama \ --host 0.0.0.0 \ --port 1024
Detailed Deployment Guide#
Deploy a Megatron-Bridge LLM#
Follow these steps to deploy your Megatron-Bridge model on Ray Serve:
Start the container as shown in the Quick Example section.
Deploy your model:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id llama \ --num_replicas 1 \ --num_gpus 2 \ --tensor_model_parallel_size 2 \ --pipeline_model_parallel_size 1 \ --cuda_visible_devices "0,1"
Available Parameters:
--megatron_checkpoint: Path to the Megatron-Bridge checkpoint directory (required).--num_gpus: Number of GPUs to use per node. Default is 1.--tensor_model_parallel_size: Size of the tensor model parallelism. Default is 1.--pipeline_model_parallel_size: Size of the pipeline model parallelism. Default is 1.--expert_model_parallel_size: Size of the expert model parallelism. Default is 1.--context_parallel_size: Size of the context parallelism. Default is 1.--model_id: Identifier for the model in the API responses. Default isnemo-model.--host: Host address to bind the Ray Serve server to. Default is 0.0.0.0.--port: Port number to use for the Ray Serve server. Default is 1024.--num_cpus: Number of CPUs to allocate for the Ray cluster. If None, will use all available CPUs.--num_cpus_per_replica: Number of CPUs per model replica. Default is 8.--include_dashboard: Whether to include the Ray dashboard for monitoring.--cuda_visible_devices: Comma-separated list of CUDA visible devices. Default is “0,1”.--enable_cuda_graphs: Whether to enable CUDA graphs for faster inference.--enable_flash_decode: Whether to enable Flash Attention decode.--num_replicas: Number of replicas for the deployment. Default is 1.
To use a different model, modify the
--megatron_checkpointparameter with the path to your Megatron-Bridge checkpoint directory.
Configure Model Parallelism#
Megatron-Bridge models support advanced parallelism strategies for large model deployment:
Tensor Model Parallelism: Distributes model layers across multiple GPUs:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id large_llama \ --num_gpus 4 \ --tensor_model_parallel_size 4 \ --pipeline_model_parallel_size 1 \ --cuda_visible_devices "0,1,2,3"
Pipeline Model Parallelism: Distributes model layers sequentially across GPUs:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id large_llama \ --num_gpus 4 \ --tensor_model_parallel_size 1 \ --pipeline_model_parallel_size 4 \ --cuda_visible_devices "0,1,2,3"
Combined Parallelism: Uses both tensor and pipeline parallelism:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id large_llama \ --num_gpus 8 \ --tensor_model_parallel_size 2 \ --pipeline_model_parallel_size 4 \ --cuda_visible_devices "0,1,2,3,4,5,6,7"
Deploy Multiple Replicas#
Deploy multiple replicas of your Megatron-Bridge model for increased throughput:
Single GPU per replica:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id llama \ --num_replicas 4 \ --num_gpus 4 \ --tensor_model_parallel_size 1 \ --pipeline_model_parallel_size 1 \ --cuda_visible_devices "0,1,2,3"
Multiple GPUs per replica with tensor parallelism:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id large_llama \ --num_replicas 2 \ --num_gpus 8 \ --tensor_model_parallel_size 4 \ --pipeline_model_parallel_size 1 \ --cuda_visible_devices "0,1,2,3,4,5,6,7"
Important GPU Configuration Notes:
GPUs per replica = Total GPUs ÷
--num_replicas.Each replica needs:
--tensor_model_parallel_size×--pipeline_model_parallel_size×--context_parallel_sizeGPUs.Ensure
--cuda_visible_deviceslists all GPUs that will be used.
Optimize Performance#
Enable performance optimizations for faster inference:
Flash Attention Decode: Optimizes attention computation:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id llama \ --enable_flash_decode \ --num_gpus 2 \ --tensor_model_parallel_size 2 \ --cuda_visible_devices "0,1"
Flash Attention Decode and Cuda Graphs:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id llama \ --enable_cuda_graphs \ --enable_flash_decode \ --num_gpus 4 \ --tensor_model_parallel_size 2 \ --pipeline_model_parallel_size 2 \ --cuda_visible_devices "0,1,2,3"
Test Ray Deployment#
Use the query_ray_deployment.py script to test your deployed Megatron-Bridge model:
Basic testing:
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \ --model_id llama \ --host 0.0.0.0 \ --port 1024
The script will test multiple endpoints:
Health check endpoint:
/v1/healthModels list endpoint:
/v1/modelsText completions endpoint:
/v1/completions/
Available parameters for testing:
--host: Host address of the Ray Serve server. Default is 0.0.0.0.--port: Port number of the Ray Serve server. Default is 1024.--model_id: Identifier for the model in the API responses. Default isnemo-model.
Configure Advanced Deployments#
For more advanced deployment scenarios:
Custom Resource Allocation:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id llama \ --num_replicas 2 \ --num_gpus 4 \ --tensor_model_parallel_size 2 \ --num_cpus 32 \ --num_cpus_per_replica 16 \ --cuda_visible_devices "0,1,2,3"
API Endpoints#
Once deployed, your Megatron-Bridge model will be available through OpenAI-compatible API endpoints:
Health Check:
GET /v1/healthList Models:
GET /v1/modelsText Completions:
POST /v1/completions/Chat Completions:
POST /v1/chat/completions/
Example API request:
curl -X POST http://localhost:1024/v1/completions/ \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"prompt": "The capital of France is",
"max_tokens": 50,
"temperature": 0.7
}'
Troubleshooting#
Out of Memory Errors: Reduce
--num_replicasor adjust parallelism settings.Port Already in Use: Change the
--portparameter.Ray Cluster Issues: Ensure no other Ray processes are running:
ray stop.GPU Allocation: Verify
--cuda_visible_devicesmatches your available GPUs.Parallelism Configuration Errors: Ensure total parallelism per replica matches available GPUs per replica.
CUDA Device Mismatch: Make sure the number of devices in
--cuda_visible_devicesequals total GPUs.Checkpoint Loading Issues: Verify the Megatron-Bridge checkpoint directory path is correct and contains the expected
iter_*subdirectory (e.g.iter_0000000).
For more information on Ray Serve, visit the Ray Serve documentation.
Multi-node on SLURM using ray.sub#
Use scripts/deploy/utils/ray.sub to bring up a Ray cluster across multiple SLURM nodes and run your in-framework Megatron-Bridge deployment automatically. This script configures the Ray head and workers, handles ports, and can optionally run a driver command once the cluster is online.
Script location:
scripts/deploy/utils/ray.subUpstream reference: See the NeMo RL cluster setup doc for background on this pattern: NVIDIA-NeMo RL cluster guide
Prerequisites#
A SLURM cluster with container support for
srun --container-imageand--container-mounts.A container image that includes Export-Deploy at
/opt/Export-Deployand the needed dependencies.A Megatron-Bridge checkpoint directory (e.g.
.../hf_llama31_8B_mbridge/iter_0000000/) accessible on the cluster filesystem.
Quick start (2 nodes, 16 GPUs total)#
Set environment variables to parameterize
ray.sub(these are read by the script at submission time):
export CONTAINER=nvcr.io/nvidia/nemo:vr
export MOUNTS="${PWD}/:/opt/checkpoints/"
# Optional tuning
export GPUS_PER_NODE=8 # default 8; set to your node GPU count
# Driver command to run after the cluster is ready (multi-node Megatron-Bridge deployment)
export COMMAND="python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ --model_id llama --num_replicas 16 --num_gpus 16"
Submit the job (you can override SBATCH directives on the command line):
sbatch --nodes=2 --account <ACCOUNT> --partition <PARTITION> \
--job-name nemo-ray --time 01:00:00 \
/opt/Export-Deploy/scripts/deploy/utils/ray.sub
The script will:
Start a Ray head on node 0 and one Ray worker per remaining node
Wait until all nodes register their resources
Launch the
COMMANDon the head node (driver) once the cluster is healthy
Attaching and monitoring:
Logs:
$SLURM_SUBMIT_DIR/<jobid>-logs/containsray-head.log,ray-worker-<n>.log, and (if set) synced Ray logs.Interactive shell: the job creates
<jobid>-attach.sh. For head:bash <jobid>-attach.sh. For worker i:bash <jobid>-attach.sh i.Ray status: once attached to the head container, run
ray status.
Query the deployment (from within the head container):
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \
--model_id llama --host 0.0.0.0 --port 1024
Notes#
Set
--num_gpusin the deploy command to the total GPUs across all nodes; adjust--num_replicasand model parallel sizes per your topology.If your cluster uses GRES,
ray.subauto-detects and sets--gres=gpu:<GPUS_PER_NODE>; ensureGPUS_PER_NODEmatches the node’s GPU count.You can leave
--cuda_visible_devicesunset for multi-node runs; per-node visibility is managed by Ray workers.