Use PyTriton Server for Evaluations#

This guide explains how to deploy and evaluate NeMo Framework models, trained with the Megatron-Core backend, using PyTriton to serve the model.

Introduction#

Deploymement with PyTriton serving backend provides high-performance inference through the NVIDIA Triton Inference Server, with OpenAI API compatibility via a FastAPI interface. It supports model parallelism across single-node and multi-node configurations, facilitating deployment of large models that cannot fit into a single device.

Key Benefits of PyTriton Deployment#

  • Multi-Node support: Deploy large models on multiple nodes using pipeline-, tensor-, context- or expert-parallelism.

  • Automatic Requests Batching: PyTriton automatically groups your requests into batches for efficient inference.

Deploy Models Using PyTriton#

The deployment scripts are available inside /opt/Export-Deploy/scripts/deploy/nlp/ directory. The example command below uses a Hugging Face LLaMA 3 8B checkpoint that has been converted to NeMo format. To evaluate a checkpoint saved during pretraining or fine-tuning, provide the path to the saved checkpoint using the --nemo_checkpoint flag in the command below.

python \
  /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py \
  --nemo_checkpoint "/workspace/llama3_8b_nemo2" \
  --triton_model_name "megatron_model" \
  --server_port 8080 \
  --num_gpus 1 \
  --max_batch_size 4 \
  --inference_max_seq_length 4096

When working with a larger model, you can use model parallelism to distribute the model across available devices. In the example below we deploy the Llama-3_3-Nemotron-Super-49B-v1 (converted to the NeMo format) with 8 devices and tensor parallelism:

python \
  /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py \
  --nemo_checkpoint "/workspace/Llama-3_3-Nemotron-Super-49B-v1" \
  --triton_model_name "megatron_model" \
  --server_port 8080 \
  --num_gpus 8 \
  --tensor_model_parallel_size 8 \
  --max_batch_size 4 \
  --inference_max_seq_length 4096

Make sure to adjust the parameters to match your available resource and model architecture.

Run Evaluations on PyTriton-Deployed Models#

The entry point for evaluation is the evaluate function. To run evaluations on the deployed model, use the following command. Make sure to open a new terminal within the same container to execute it. For longer evaluations, it is advisable to run both the deploy and evaluate commands in tmux sessions to prevent the processes from being terminated unexpectedly and aborting the runs. It is recommended to use check_endpoint function to verify that the endpoint is responsive and ready to accept requests before starting the evaluation.

from nemo_evaluator.api import check_endpoint, evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EvaluationConfig,
    EvaluationTarget,
)

# Configure the evaluation target
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
    type="completions",
    model_id="megatron_model",
)
eval_target = EvaluationTarget(api_endpoint=api_endpoint)
eval_params = ConfigParams(top_p=0, temperature=0, limit_samples=2, parallelism=1)
eval_config = EvaluationConfig(type="mmlu", params=eval_params, output_dir="results")

if __name__ == "__main__":
    check_endpoint(
        endpoint_url=eval_target.api_endpoint.url,
        endpoint_type=eval_target.api_endpoint.type,
        model_name=eval_target.api_endpoint.model_id,
    )
    evaluate(target_cfg=eval_target, eval_cfg=eval_config)

To evaluate the chat endpoint, update the url by replacing /v1/completions/ with /v1/chat/completions/. Additionally, set the type field to "chat" in both ApiEndpoint and EvaluationConfig to indicate a chat benchmark.

To evaluate log-probability benchmarks (e.g., arc_challenge), run the following code snippet after deployment.

Make sure to open a new terminal within the same container to execute it.

from nemo_evaluator.api import evaluate
from nemo_evaluator.api.api_dataclasses import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
)

model_name = "megatron_model"
completions_url = "http://0.0.0.0:8080/v1/completions/"


target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        url=completions_url, type=EndpointType.COMPLETIONS, model_id=model_name
    )
)
eval_config = EvaluationConfig(
    type="arc_challenge",
    output_dir="/results/",
    params=ConfigParams(
        limit_samples=10,
        temperature=0,
        top_p=0,
        parallelism=1,
        extra={
            "tokenizer": "/workspace/llama3_8b_nemo2/context/nemo_tokenizer",
            "tokenizer_backend": "huggingface",
        },
    ),
)

results = evaluate(target_cfg=target_config, eval_cfg=eval_config)


print(results)

Note that in the example above, you must provide a path to the tokenizer:

        extra={
            "tokenizer": "/workspace/llama3_8b_nemo2/context/nemo_tokenizer",
            "tokenizer_backend": "huggingface",
        },

Please refer to deploy_inframework_triton.py script and evaluate function to review all available argument options, as the provided commands are only examples and do not include all arguments or their default values. For more detailed information on the arguments used in the ApiEndpoint and ConfigParams classes for evaluation, see api_dataclasses submodule.

Tip

If you encounter a TimeoutError on the eval client side, please increase the request_timeout parameter in ConfigParams class to a larger value like 1000 or 1200 seconds (the default is 300).