Use Ray Serve for Multi-Instance Evaluations#

This guide explains how to deploy and evaluate NeMo Framework models, trained with the Megatron-Core backend, using Ray Serve to enable multi-instance evaluation across available GPUs.

Introduction#

Deployment with Ray Serve provides support for multiple replicas of your model across available GPUs, enabling higher throughput and better resource utilization during evaluation. This approach is particularly beneficial for evaluation scenarios where you need to process large datasets efficiently and would like to accelerate evaluation.

Key Benefits of Ray Deployment#

Multiple Model Replicas: Deploy multiple instances of your model to handle concurrent requests.
Automatic Load Balancing: Ray automatically distributes requests across available replicas.
Scalable Architecture: Easily scale up or down based on your hardware resources.
Resource Optimization: Better utilization of available GPUs.

Deploy Models Using Ray Serve#

To deploy your model using Ray, use the deploy_ray_inframework.py script from NeMo Export-Deploy:

python \
  /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
  --megatron_checkpoint "/workspace/mbridge_llama3_8b/iter_0000000/" \ # Llama3 8B HF checkpoint converted to MBridge
  --model_id "megatron_model" \
  --port 8080 \                          # Ray server port
  --num_gpus 4 \                         # Total GPUs available
  --num_replicas 2 \                     # Number of model replicas
  --tensor_model_parallel_size 2 \       # Tensor parallelism per replica
  --pipeline_model_parallel_size 1 \     # Pipeline parallelism per replica
  --context_parallel_size 1              # Context parallelism per replica

Note

Adjust num_replicas based on the number of instances/replicas needed. Ensure that total num_gpus is equal to the num_replicas times model parallelism configuration (i.e., tensor_model_parallel_size * pipeline_model_parallel_size * context_parallel_size).

Run Evaluations on Ray-Deployed Models#

Once your model is deployed with Ray, you can run evaluations using the same evaluation API as with PyTriton deployment. It is recommended to use the check_endpoint function to verify that the endpoint is responsive and ready to accept requests before starting the evaluation.

To evaluate on generation benchmarks use the code snippet below:

from nemo_evaluator.api import check_endpoint, evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EvaluationConfig,
    EvaluationTarget,
)

# Configure the evaluation target
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
    type="completions",
    model_id="megatron_model",
)
eval_target = EvaluationTarget(api_endpoint=api_endpoint)
eval_params = ConfigParams(top_p=0, temperature=0, limit_samples=2, parallelism=1)
eval_config = EvaluationConfig(type="mmlu", params=eval_params, output_dir="results")

if __name__ == "__main__":
    check_endpoint(
        endpoint_url=eval_target.api_endpoint.url,
        endpoint_type=eval_target.api_endpoint.type,
        model_name=eval_target.api_endpoint.model_id,
    )
    evaluate(target_cfg=eval_target, eval_cfg=eval_config)

To evaluate the chat endpoint, update the url by replacing /v1/completions/ with /v1/chat/completions/. Additionally, set the type field to "chat" in both ApiEndpoint and EvaluationConfig to indicate a chat benchmark.

To evaluate log-probability benchmarks (e.g., arc_challenge), run the following code snippet after deployment.

Make sure to open a new terminal within the same container to execute it.

from nemo_evaluator.api import check_endpoint, evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EvaluationConfig,
    EvaluationTarget,
)

# Configure the evaluation target
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
    type="completions",
    model_id="megatron_model",
)
eval_target = EvaluationTarget(api_endpoint=api_endpoint)
eval_params = ConfigParams(
    top_p=0,
    temperature=0,
    limit_samples=1,
    parallelism=1,
    extra={
        "tokenizer": "/workspace/mbridge_llama3_8b/iter_0000000/tokenizer",
        "tokenizer_backend": "huggingface",
    },
)
eval_config = EvaluationConfig(
    type="arc_challenge", params=eval_params, output_dir="results"
)

if __name__ == "__main__":
    check_endpoint(
        endpoint_url=eval_target.api_endpoint.url,
        endpoint_type=eval_target.api_endpoint.type,
        model_name=eval_target.api_endpoint.model_id,
    )
    evaluate(target_cfg=eval_target, eval_cfg=eval_config)

Note that in the example above, you must provide a path to the tokenizer:

        extra={
            "tokenizer": "/workspace/mbridge_llama3_8b/iter_0000000/tokenizer",
            "tokenizer_backend": "huggingface",
        },

Tip

To get a performance boost from multiple replicas in Ray, increase the parallelism value in your EvaluationConfig. You won’t see any speed improvement if parallelism=1. Try setting it to a higher value, such as 4 or 8.