NeMo Eval#
Overview#
NeMo Framework is NVIDIAβs GPU accelerated, end-to-end training framework for large language models (LLMs), multi-modal models and speech models. It enables seamless scaling of training (both pretraining and post-training) workloads from single GPU to thousand-node clusters for both π€Hugging Face/PyTorch and Megatron models. It includes a suite of libraries and recipe collections to help users train models from end to end. The Eval library (βNeMo Evalβ) is a comprehensive evaluation module under NeMo Framework for Large Language Models (LLMs). It provides seamless deployment and evaluation capabilities for models trained using NeMo Framework via state-of-the-art evaluation harnesses.
π Features#
Multi-Backend Deployment: Support for both PyTriton and Ray Serve deployment backends.
Comprehensive Evaluation: State-of-the-art evaluation harnesses including reasoning benchmarks, code generation, safety testing.
Adapter System: Flexible adapter architecture using a chain of interceptors for customizing request/response processing.
Production Ready: Optimized for high-performance inference with CUDA graphs and flash decoding.
Multi-GPU and Multi-Node Support: Distributed inference across multiple devices and nodes.
OpenAI-Compatible API: RESTful endpoints compatible with OpenAI API standards.
π§ Install NeMo Eval#
Prerequisites#
Python 3.10 or higher
CUDA-compatible GPU(s) (tested on RTX A6000, A100, H100)
NeMo Framework container (recommended)
Use pip#
For quick exploration of NeMo Eval, we recommend installing our pip package:
pip install nemo-eval
Use Docker#
For optimal performance and user experience, use the latest version of the NeMo Framework container. Please fetch the most recent $TAG and run the following command to start a container:
docker run --rm -it -w /workdir -v $(pwd):/workdir \
--entrypoint bash \
--gpus all \
nvcr.io/nvidia/nemo:${TAG}
Use uv#
To install NeMo Eval with uv, please refer to our Contribution guide.
π Quick Start#
1. Deploy a Model#
from nemo_eval.api import deploy
# Deploy a NeMo checkpoint
deploy(
nemo_checkpoint="/path/to/your/checkpoint",
serving_backend="pytriton", # or "ray"
server_port=8080,
num_gpus=1,
max_input_len=4096,
max_batch_size=8
)
2. Evaluate the Model#
from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationTarget, EvaluationConfig, ApiEndpoint
# Configure evaluation
api_endpoint = ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
model_id="megatron_model"
)
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="gsm8k")
# Run evaluation
results = evaluate(target_cfg=target, eval_cfg=config)
print(results)
π Support Matrix#
Checkpoint Type |
Inference Backend |
Deployment Server |
Evaluation Harnesses Supported |
---|---|---|---|
NeMo FW checkpoint via Megatron Core backend |
Megatron Core in-framework inference engine |
PyTriton (single and multi node model parallelism), Ray (single node model parallelism with multi instance evals) |
lm-evaluation-harness, simple-evals, BigCode, BFCL, safety-harness, garak |
ποΈ Architecture#
Core Components#
1. Deployment Layer#
PyTriton Backend: Delivers high-performance inference via NVIDIA Triton Inference Server, with OpenAI API compatibility through a FastAPI interface. Supports model parallelism across both single- and multi-node setups. Note: Multi-instance evaluation is not supported.
Ray Backend: Enables multi-instance evaluation with model parallelism on a single node using Ray Serve, while maintaining OpenAI API compatibility. Multi-node support is coming soon.
2. Evaluation Layer#
NVIDIA Eval Factory: Provides standardized benchmark evaluations using packages from NVIDIA Eval Factory; bundled in the NeMo Framework container. The
lm-evaluation-harness
is pre-installed by default, while additional tools listed in the support matrix can be added as needed. For more information, see the docs.Adapter System: Flexible request/response processing pipeline with Interceptors that provide modular processing
Available Interceptors: Modular components for request/response processing
SystemMessageInterceptor: Customize system prompts
RequestLoggingInterceptor: Log incoming requests
ResponseLoggingInterceptor: Log outgoing responses
ResponseReasoningInterceptor: Process reasoning outputs
EndpointInterceptor: Route requests to the actual model
π Usage Examples#
Basic Deployment with PyTriton as the Serving Backend#
from nemo_eval.api import deploy
# Deploy model
deploy(
nemo_checkpoint="/path/to/checkpoint",
serving_backend="pytriton",
server_port=8080,
num_gpus=1,
max_input_len=8192,
max_batch_size=4
)
Basic Evaluation#
from nemo_eval.api import evaluate
from nemo_eval.utils.api import EvaluationTarget, EvaluationConfig, ApiEndpoint, ConfigParams
# Configure Endpoint
api_endpoint = ApiEndpoint(
url="http://0.0.0.0:8080/v1/completions/",
)
# Evaluation target configuration
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure EvaluationConfig with type, number of samples to evaluate on, etc.
config = EvaluationConfig(type="gsm8k",
params=ConfigParams(
limit_samples=10
))
# Run evaluation
results = evaluate(target_cfg=target, eval_cfg=config)
Use Adapters#
The example below demonstrates how to configure an Adapter to provide a custom system prompt. Requests and responses are processed through interceptors, which are automatically selected based on the parameters defined in AdapterConfig
.
from nemo_eval.utils.api import AdapterConfig
# Configure adapter for reasoning
adapter_config = AdapterConfig(
api_url="http://0.0.0.0:8080/v1/completions/",
use_reasoning=True,
end_reasoning_token="</think>",
custom_system_prompt="You are a helpful assistant that thinks step by step.",
max_logged_requests=5,
max_logged_responses=5
)
# Run evaluation with adapter
results = evaluate(
target_cfg=target,
eval_cfg=config,
adapter_cfg=adapter_config
)
Deploy with Multiple GPUs#
# Deploy with tensor parallelism or pipeline parallelism
deploy(
nemo_checkpoint="/path/to/checkpoint",
serving_backend="pytriton",
num_gpus=4,
tensor_parallelism_size=4,
pipeline_parallelism_size=1,
max_input_len=8192,
max_batch_size=8
)
Deploy with Ray#
# Deploy using Ray Serve
deploy(
nemo_checkpoint="/path/to/checkpoint",
serving_backend="ray",
num_gpus=2,
num_replicas=2,
num_cpus_per_replica=8,
server_port=8080,
include_dashboard=True,
cuda_visible_devices="0,1"
)
π Project Structure#
Eval/
βββ src/nemo_eval/ # Main package
β βββ api.py # Main API functions
β βββ package_info.py # Package metadata
β βββ adapters/ # Adapter system
β β βββ server.py # Adapter server
β β βββ utils.py # Adapter utilities
β β βββ interceptors/ # Request/response interceptors
β βββ utils/ # Utility modules
β βββ api.py # API configuration classes
β βββ base.py # Base utilities
β βββ ray_deploy.py # Ray deployment utilities
βββ tests/ # Test suite
β βββ unit_tests/ # Unit tests
β βββ functional_tests/ # Functional tests
βββ tutorials/ # Tutorial notebooks
βββ scripts/ # Reference nemo-run scripts
βββ docs/ # Documentation
βββ docker/ # Docker configuration
βββ external/ # External dependencies
π€ Contributing#
We welcome contributions! Please see our Contributing Guide for details on development setup, testing, and code style guidelines
π License#
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
π Support#
Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: NeMo Documentation