Overview#

The NeMo Framework is NVIDIA’s GPU-accelerated, end-to-end training platform for large language models (LLMs), multimodal models, and speech models. It enables seamless scaling of both pretraining and post-training workloads, from a single GPU to clusters with thousands of nodes, supporting Hugging Face/PyTorch and Megatron models. NeMo includes a suite of libraries and curated training recipes to help users build models from start to finish.

The Eval library (β€œNeMo Eval”) is a comprehensive evaluation module within the NeMo Framework for LLMs. It offers streamlined deployment and advanced evaluation capabilities for models trained using NeMo, leveraging state-of-the-art evaluation harnesses.

image

πŸš€ Features#

  • Multi-Backend Deployment: Supports PyTriton and multi-instance evaluations using the Ray Serve deployment backend

  • Comprehensive Evaluation: Includes state-of-the-art evaluation harnesses for academic benchmarks, reasoning benchmarks, code generation, and safety testing

  • Adapter System: Features a flexible architecture with chained interceptors for customizable request and response processing

  • Production-Ready: Supports high-performance inference with CUDA graphs and flash decoding

  • Multi-GPU and Multi-Node Support: Enables distributed inference across multiple GPUs and compute nodes

  • OpenAI-Compatible API: Provides RESTful endpoints aligned with OpenAI API specifications

πŸ”§ Install NeMo Eval#

Prerequisites#

  • Python 3.10 or higher

  • CUDA-compatible GPU(s) (tested on RTX A6000, A100, H100)

  • NeMo Framework container (recommended)

Use pip#

For quick exploration of NeMo Eval, we recommend installing our pip package:

pip install torch==2.7.0 setuptools pybind11 wheel_stub  # Required for TE
pip install --no-build-isolation nemo-eval

Use Docker#

For optimal performance and user experience, use the latest version of the NeMo Framework container. Please fetch the most recent $TAG and run the following command to start a container:

docker run --rm -it -w /workdir -v $(pwd):/workdir \
  --entrypoint bash \
  --gpus all \
  nvcr.io/nvidia/nemo:${TAG}

Use uv#

To install NeMo Eval with uv, please refer to our Contribution guide.

πŸš€ Quick Start#

1. Deploy a Model#

from nemo_eval.api import deploy

# Deploy a NeMo checkpoint
deploy(
    nemo_checkpoint="/path/to/your/checkpoint",
    serving_backend="pytriton",  # or "ray"
    server_port=8080,
    num_gpus=1,
    max_input_len=4096,
    max_batch_size=8
)

2. Evaluate the Model#

from nvidia_eval_commons.core.evaluate import evaluate
from nvidia_eval_commons.api.api_dataclasses import ApiEndpoint, EvaluationConfig, EvaluationTarget

# Configure evaluation
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
    type="completions",
    model_id="megatron_model"
)
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="gsm8k", output_dir="results")

# Run evaluation
results = evaluate(target_cfg=target, eval_cfg=config)
print(results)

πŸ“Š Support Matrix#

Checkpoint Type

Inference Backend

Deployment Server

Evaluation Harnesses Supported

NeMo FW checkpoint via Megatron Core backend

Megatron Core in-framework inference engine

PyTriton (single and multi node model parallelism), Ray (single node model parallelism with multi instance evals)

lm-evaluation-harness, simple-evals, BigCode, BFCL, safety-harness, garak

Automodel checkpoint (HF checkpoint)

vLLM

Ray

lm-evaluation-harness, simple-evals, BigCode, BFCL, safety-harness, garak

πŸ—οΈ Architecture#

Core Components#

1. Deployment Layer#

  • PyTriton Backend: Provides high-performance inference through the NVIDIA Triton Inference Server, with OpenAI API compatibility via a FastAPI interface. Supports model parallelism across single-node and multi-node configurations. Note: Multi-instance evaluation is not supported.

  • Ray Backend: Enables multi-instance evaluation with model parallelism on a single node using Ray Serve, while maintaining OpenAI API compatibility. Multi-node support is coming soon.

2. Evaluation Layer#

  • NVIDIA Eval Factory: Provides standardized benchmark evaluations using packages from NVIDIA Eval Factory, bundled in the NeMo Framework container. The lm-evaluation-harness is pre-installed by default, and additional tools listed in the support matrix can be added as needed. For more information, see the documentation.

  • Adapter System: Flexible request/response processing pipeline with Interceptors that provide modular processing:

    • Available Interceptors: Modular components for request/response processing

      • SystemMessageInterceptor: Customize system prompts

      • RequestLoggingInterceptor: Log incoming requests

      • ResponseLoggingInterceptor: Log outgoing responses

      • ResponseReasoningInterceptor: Process reasoning outputs

      • EndpointInterceptor: Route requests to the actual model

πŸ“– Usage Examples#

Basic Deployment with PyTriton as the Serving Backend#

from nemo_eval.api import deploy

# Deploy model
deploy(
    nemo_checkpoint="/path/to/checkpoint",
    serving_backend="pytriton",
    server_port=8080,
    num_gpus=1,
    max_input_len=8192,
    max_batch_size=4
)

Basic Evaluation#

from nvidia_eval_commons.core.evaluate import evaluate
from nvidia_eval_commons.api.api_dataclasses import ApiEndpoint, ConfigParams, EvaluationConfig, EvaluationTarget
# Configure Endpoint
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
    type="completions",
    model_id="megatron_model"
)
# Evaluation target configuration
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure EvaluationConfig with type, number of samples to evaluate on, etc.
config = EvaluationConfig(type="gsm8k",
            output_dir="results",
            params=ConfigParams(
                    limit_samples=10
                ))

# Run evaluation
results = evaluate(target_cfg=target, eval_cfg=config)

Use Adapters#

The example below demonstrates how to configure an Adapter to provide a custom system prompt. Requests and responses are processed through interceptors, which are automatically selected based on the parameters defined in AdapterConfig.

from nemo_eval.utils.api import AdapterConfig

# Configure adapter for reasoning
adapter_config = AdapterConfig(
    interceptors=[
        dict(name="reasoning", config={"end_reasoning_token": "</think>"}),
        dict(name="system_message", config={"system_message": "Detailed thinking on"}),
        dict(name="request_logging", config={"max_requests": 5}),
        dict(name="response_logging", config={"max_responses": 5}),
    ]
)

target = EvaluationTarget(
    api_endpoint={
        "url": "http://0.0.0.0:8080/v1/chat/completions/",
        "model_id": "megatron_model",
        "type": "chat",
        "adapter_config": adapter_config
    }
)

# Run evaluation with adapter
results = evaluate(
    target_cfg=target,
    eval_cfg=config,
)

Deploy with Multiple GPUs#

# Deploy with tensor parallelism or pipeline parallelism
deploy(
    nemo_checkpoint="/path/to/checkpoint",
    serving_backend="pytriton",
    num_gpus=4,
    tensor_parallelism_size=4,
    pipeline_parallelism_size=1,
    max_input_len=8192,
    max_batch_size=8
)

Deploy with Ray#

# Deploy using Ray Serve
deploy(
    nemo_checkpoint="/path/to/checkpoint",
    serving_backend="ray",
    num_gpus=2,
    num_replicas=2,
    num_cpus_per_replica=8,
    server_port=8080,
    include_dashboard=True,
    cuda_visible_devices="0,1"
)

πŸ“ Project Structure#

Eval/
β”œβ”€β”€ src/nemo_eval/           # Main package
β”‚   β”œβ”€β”€ api.py               # Main API functions
β”‚   β”œβ”€β”€ package_info.py      # Package metadata
β”‚   β”œβ”€β”€ adapters/            # Adapter system
β”‚   β”‚   β”œβ”€β”€ server.py        # Adapter server
β”‚   β”‚   β”œβ”€β”€ utils.py         # Adapter utilities
β”‚   β”‚   └── interceptors/    # Request/response interceptors
β”‚   └── utils/               # Utility modules
β”‚       β”œβ”€β”€ api.py           # API configuration classes
β”‚       β”œβ”€β”€ base.py          # Base utilities
β”‚       └── ray_deploy.py    # Ray deployment utilities
β”œβ”€β”€ tests/                   # Test suite
β”‚   β”œβ”€β”€ unit_tests/          # Unit tests
β”‚   └── functional_tests/    # Functional tests
β”œβ”€β”€ tutorials/               # Tutorial notebooks
β”œβ”€β”€ scripts/                 # Reference nemo-run scripts
β”œβ”€β”€ docs/                    # Documentation
β”œβ”€β”€ docker/                  # Docker configuration
└── external/                # External dependencies

🀝 Contributing#

We welcome contributions! Please see our Contributing Guide for details on development setup, testing, and code style guidelines

πŸ“„ License#

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

πŸ“ž Support#