Overview#

NeMo Framework is NVIDIA’s GPU accelerated, end-to-end training framework for large language models (LLMs), multi-modal models and speech models. It enables seamless scaling of training (both pretraining and post-training) workloads from single GPU to thousand-node clusters for both πŸ€—Hugging Face/PyTorch and Megatron models. It includes a suite of libraries and recipe collections to help users train models from end to end. The Eval library (β€œNeMo Eval”) is a comprehensive evaluation module under NeMo Framework for Large Language Models (LLMs). It provides seamless deployment and evaluation capabilities for models trained using NeMo Framework via state-of-the-art evaluation harnesses.

image

πŸš€ Features#

  • Multi-Backend Deployment: Support for both PyTriton and Ray Serve deployment backends.

  • Comprehensive Evaluation: State-of-the-art evaluation harnesses including reasoning benchmarks, code generation, safety testing.

  • Adapter System: Flexible adapter architecture using a chain of interceptors for customizing request/response processing.

  • Production Ready: Optimized for high-performance inference with CUDA graphs and flash decoding.

  • Multi-GPU and Multi-Node Support: Distributed inference across multiple devices and nodes.

  • OpenAI-Compatible API: RESTful endpoints compatible with OpenAI API standards.

πŸ”§ Install NeMo Eval#

Prerequisites#

  • Python 3.10 or higher

  • CUDA-compatible GPU(s) (tested on RTX A6000, A100, H100)

  • NeMo Framework container (recommended)

Use pip#

For quick exploration of NeMo Eval, we recommend installing our pip package:

pip install nemo-eval

Use Docker#

For optimal performance and user experience, use the latest version of the NeMo Framework container. Please fetch the most recent $TAG and run the following command to start a container:

docker run --rm -it -w /workdir -v $(pwd):/workdir \
  --entrypoint bash \
  --gpus all \
  nvcr.io/nvidia/nemo:${TAG}

Use uv#

To install NeMo Eval with uv, please refer to our Contribution guide.

πŸš€ Quick Start#

1. Deploy a Model#

from nemo_eval.api import deploy

# Deploy a NeMo checkpoint
deploy(
    nemo_checkpoint="/path/to/your/checkpoint",
    serving_backend="pytriton",  # or "ray"
    server_port=8080,
    num_gpus=1,
    max_input_len=4096,
    max_batch_size=8
)

2. Evaluate the Model#

from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationTarget, EvaluationConfig, ApiEndpoint

# Configure evaluation
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
    model_id="megatron_model"
)
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="gsm8k")

# Run evaluation
results = evaluate(target_cfg=target, eval_cfg=config)
print(results)

πŸ“Š Support Matrix#

Checkpoint Type

Inference Backend

Deployment Server

Evaluation Harnesses Supported

NeMo FW checkpoint via Megatron Core backend

Megatron Core in-framework inference engine

PyTriton (single and multi node model parallelism), Ray (single node model parallelism with multi instance evals)

lm-evaluation-harness, simple-evals, BigCode, BFCL, safety-harness, garak

πŸ—οΈ Architecture#

Core Components#

1. Deployment Layer#

  • PyTriton Backend: Delivers high-performance inference via NVIDIA Triton Inference Server, with OpenAI API compatibility through a FastAPI interface. Supports model parallelism across both single- and multi-node setups. Note: Multi-instance evaluation is not supported.

  • Ray Backend: Enables multi-instance evaluation with model parallelism on a single node using Ray Serve, while maintaining OpenAI API compatibility. Multi-node support is coming soon.

2. Evaluation Layer#

  • NVIDIA Eval Factory: Provides standardized benchmark evaluations using packages from NVIDIA Eval Factory; bundled in the NeMo Framework container. The lm-evaluation-harness is pre-installed by default, while additional tools listed in the support matrix can be added as needed. For more information, see the docs.

  • Adapter System: Flexible request/response processing pipeline with Interceptors that provide modular processing

    • Available Interceptors: Modular components for request/response processing

      • SystemMessageInterceptor: Customize system prompts

      • RequestLoggingInterceptor: Log incoming requests

      • ResponseLoggingInterceptor: Log outgoing responses

      • ResponseReasoningInterceptor: Process reasoning outputs

      • EndpointInterceptor: Route requests to the actual model

πŸ“– Usage Examples#

Basic Deployment with PyTriton as the Serving Backend#

from nemo_eval.api import deploy

# Deploy model
deploy(
    nemo_checkpoint="/path/to/checkpoint",
    serving_backend="pytriton",
    server_port=8080,
    num_gpus=1,
    max_input_len=8192,
    max_batch_size=4
)

Basic Evaluation#

from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationTarget, EvaluationConfig, ApiEndpoint, ConfigParams  
# Configure Endpoint
api_endpoint = ApiEndpoint(
    url="http://0.0.0.0:8080/v1/completions/",
)
# Evaluation target configuration
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure EvaluationConfig with type, number of samples to evaluate on, etc.
config = EvaluationConfig(type="gsm8k",
            params=ConfigParams(
                    limit_samples=10
                ))

# Run evaluation
results = evaluate(target_cfg=target, eval_cfg=config)

Use Adapters#

The example below demonstrates how to configure an Adapter to provide a custom system prompt. Requests and responses are processed through interceptors, which are automatically selected based on the parameters defined in AdapterConfig.

from nemo_eval.utils.api import AdapterConfig

# Configure adapter for reasoning
adapter_config = AdapterConfig(
    api_url="http://0.0.0.0:8080/v1/completions/",
    use_reasoning=True,
    end_reasoning_token="</think>",
    custom_system_prompt="You are a helpful assistant that thinks step by step.",
    max_logged_requests=5,
    max_logged_responses=5
)

# Run evaluation with adapter
results = evaluate(
    target_cfg=target,
    eval_cfg=config,
    adapter_cfg=adapter_config
)

Deploy with Multiple GPUs#

# Deploy with tensor parallelism or pipeline parallelism
deploy(
    nemo_checkpoint="/path/to/checkpoint",
    serving_backend="pytriton",
    num_gpus=4,
    tensor_parallelism_size=4,
    pipeline_parallelism_size=1,
    max_input_len=8192,
    max_batch_size=8
)

Deploy with Ray#

# Deploy using Ray Serve
deploy(
    nemo_checkpoint="/path/to/checkpoint",
    serving_backend="ray",
    num_gpus=2,
    num_replicas=2,
    num_cpus_per_replica=8,
    server_port=8080,
    include_dashboard=True,
    cuda_visible_devices="0,1"
)

πŸ“ Project Structure#

Eval/
β”œβ”€β”€ src/nemo_eval/           # Main package
β”‚   β”œβ”€β”€ api.py               # Main API functions
β”‚   β”œβ”€β”€ package_info.py      # Package metadata
β”‚   β”œβ”€β”€ adapters/            # Adapter system
β”‚   β”‚   β”œβ”€β”€ server.py        # Adapter server
β”‚   β”‚   β”œβ”€β”€ utils.py         # Adapter utilities
β”‚   β”‚   └── interceptors/    # Request/response interceptors
β”‚   └── utils/               # Utility modules
β”‚       β”œβ”€β”€ api.py           # API configuration classes
β”‚       β”œβ”€β”€ base.py          # Base utilities
β”‚       └── ray_deploy.py    # Ray deployment utilities
β”œβ”€β”€ tests/                   # Test suite
β”‚   β”œβ”€β”€ unit_tests/          # Unit tests
β”‚   └── functional_tests/    # Functional tests
β”œβ”€β”€ tutorials/               # Tutorial notebooks
β”œβ”€β”€ scripts/                 # Reference nemo-run scripts
β”œβ”€β”€ docs/                    # Documentation
β”œβ”€β”€ docker/                  # Docker configuration
└── external/                # External dependencies

🀝 Contributing#

We welcome contributions! Please see our Contributing Guide for details on development setup, testing, and code style guidelines

πŸ“„ License#

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

πŸ“ž Support#