Evaluate TensorRT-LLM checkpoints with NeMo Framework#

This guide provides step-by-step instructions for evaluating TensorRT-LLM (TRTLLM) checkpoints or models inside NeMo Framework.

This guide focuses on benchmarks within the lm-evaluation-harness that depend on text generation. For a detailed comparison between generation-based and log-probability-based benchmarks, refer to Evaluation Techniques.

Note

Evaluation on log-probability-based benchmarks for TRTLLM models is currently planned for a future release.

Deploy TRTLLM Checkpoints#

This section outlines the steps to deploy TRTLLM checkpoints using Python commands.

TRTLLM checkpoint deployment uses Ray Serve as the serving backend. It also offers an OpenAI API (OAI)-compatible endpoint, similar to deployments of checkpoints trained with the Megatron Core backend. An example deployment command is shown below.

python \
  /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_trtllm.py \
  --trt_llm_path '/workspace/checkpoints/llama3_8b_trtllm/' \
  --model_id 'megatron_model' \
  --port 8080 \
  --tensor_parallelism_size 1 \
  --num_gpus 1 \
  --num_replicas 1

Evaluate TRTLLM Checkpoints#

This section outlines the steps to evaluate TRTLLM checkpoints using Python commands. This method is quick and easy, making it ideal for interactive evaluations.

Once deployment is successful, you can run evaluations using the same evaluation API described in other sections.

Before starting the evaluation, it’s recommended to use the check_endpoint function to verify that the endpoint is responsive and ready to accept requests.

from nemo_evaluator.api import check_endpoint, evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EvaluationConfig,
    EvaluationTarget,
)

# Configure the evaluation target
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
    type="completions",
    model_id="megatron_model",
)
eval_target = EvaluationTarget(api_endpoint=api_endpoint)
eval_params = ConfigParams(top_p=0, temperature=0, limit_samples=2, parallelism=1)
eval_config = EvaluationConfig(type="mmlu", params=eval_params, output_dir="results")

if __name__ == "__main__":
    check_endpoint(
        endpoint_url=eval_target.api_endpoint.url,
        endpoint_type=eval_target.api_endpoint.type,
        model_name=eval_target.api_endpoint.model_id,
    )
    evaluate(target_cfg=eval_target, eval_cfg=eval_config)