Serve and Deploy Models#
Deploy and serve models with NeMo Evaluator’s flexible deployment options. Select a deployment strategy that matches your workflow, infrastructure, and requirements.
Overview#
NeMo Evaluator keeps model serving separate from evaluation execution, giving you flexible architectures and scalable workflows. Choose who manages deployment based on your needs.
Key Concepts#
Model-Evaluation Separation: Models serve via OpenAI-compatible APIs, evaluations run in containers
Deployment Responsibility: Choose who manages the model serving infrastructure
Multi-Backend Support: Deploy locally, on HPC clusters, or in the cloud
Universal Adapters: Request/response processing works across all deployment types
Deployment Strategy Guide#
Launcher-Orchestrated Deployment (Recommended)#
Let NeMo Evaluator Launcher handle both model deployment and evaluation orchestration:
# Launcher deploys model AND runs evaluation
nv-eval run \
--config-dir examples \
--config-name slurm_llama_3_1_8b_instruct \
-o deployment.checkpoint_path=/shared/models/llama-3.1-8b
When to use:
You want automated deployment lifecycle management
You need multi-backend execution (local, Slurm, Lepton)
You prefer integrated monitoring and cleanup
You want the simplest path from model to results
Supported deployment types: vLLM, NIM, SGLang, or no deployment (existing endpoints)
See also
For detailed YAML configuration reference for each deployment type, see the Configuration in the NeMo Evaluator Launcher library.
Bring-Your-Own-Endpoint#
You handle model deployment, NeMo Evaluator handles evaluation:
Launcher users with existing endpoints:
# Point launcher to your deployed model
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
-o target.api_endpoint.url=http://localhost:8080/v1/completions
Core library users:
from nemo_evaluator import evaluate, ApiEndpoint, EvaluationTarget, EvaluationConfig, ConfigParams
api_endpoint = ApiEndpoint(url="http://localhost:8080/v1/completions")
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="mmlu_pro", output_dir="./results")
evaluate(target_cfg=target, eval_cfg=config)
When to use:
You have existing model serving infrastructure
You need custom deployment configurations
You want to deploy once and run many evaluations
You have specific security or compliance requirements
Deploy using vLLM, Ray Serve, or other serving frameworks.
Use NVIDIA Build, OpenAI, or other hosted model APIs.
Available Deployment Types#
The launcher supports multiple deployment types through Hydra configuration:
vLLM Deployment
deployment:
type: vllm
checkpoint_path: /path/to/model # Or HuggingFace model ID
served_model_name: my-model
tensor_parallel_size: 8
data_parallel_size: 1
NIM Deployment
deployment:
type: nim
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: meta/llama-3.1-8b-instruct
SGLang Deployment
deployment:
type: sglang
checkpoint_path: /path/to/model # Or HuggingFace model ID
served_model_name: my-model
tensor_parallel_size: 8
data_parallel_size: 1
No Deployment
deployment:
type: none # Use existing endpoint
Execution Backend Integration#
Local Backend
# Evaluates against existing endpoints only (no deployment)
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
target:
api_endpoint:
url: http://localhost:8080/v1/completions
model_id: my-model
evaluation:
tasks:
- name: mmlu_pro
- name: gsm8k
Slurm Backend
# Deploys model on Slurm and runs evaluation
defaults:
- execution: slurm/default
- deployment: vllm
- _self_
deployment:
checkpoint_path: /shared/models/llama-3.1-8b
served_model_name: meta-llama/Llama-3.1-8B-Instruct
execution:
account: my-account
output_dir: /shared/results
partition: gpu
walltime: "02:00:00"
evaluation:
tasks:
- name: mmlu_pro
- name: gpqa_diamond
Lepton Backend
# Deploys model on Lepton and runs evaluation
defaults:
- execution: lepton/default
- deployment: vllm
- _self_
deployment:
checkpoint_path: meta-llama/Llama-3.1-8B-Instruct
served_model_name: llama-3.1-8b-instruct
lepton_config:
resource_shape: gpu.1xh200
execution:
output_dir: ./results
evaluation:
tasks:
- name: mmlu_pro
- name: ifeval
Bring-Your-Own-Endpoint Options#
Choose from these approaches when managing your own deployment:
Manual Deployment#
vLLM: High-performance serving with PagedAttention optimization
Custom serving: Any OpenAI-compatible endpoint
Hosted Services#
NVIDIA Build: Ready-to-use hosted models with OpenAI-compatible APIs
OpenAI API: Direct integration with OpenAI’s models
Other providers: Any service providing OpenAI-compatible endpoints
Enterprise Integration#
Kubernetes deployments: Container orchestration in production environments
Existing MLOps pipelines: Integration with current model serving infrastructure
Custom infrastructure: Specialized deployment requirements
Usage Examples#
With Launcher#
# Point to any existing endpoint
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
-o target.api_endpoint.url=http://your-endpoint:8080/v1/completions \
-o target.api_endpoint.model_id=your-model-name
With Core Library#
from nemo_evaluator import (
evaluate,
ApiEndpoint,
EvaluationConfig,
EvaluationTarget,
ConfigParams
)
# Configure any endpoint
api_endpoint = ApiEndpoint(
url="http://your-endpoint:8080/v1/completions",
model_id="your-model-name"
)
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(
type="mmlu_pro",
output_dir="results",
params=ConfigParams(limit_samples=100)
)
evaluate(target_cfg=target, eval_cfg=config)
Evaluation Adapters#
Advanced request/response processing for all deployment types:
Learn how to enable adapters and configure interceptor chains for any deployment.
Strip intermediate reasoning tokens before scoring across all model types.
Enforce standard system prompts for consistent evaluation across endpoints.
Configure logging, caching, reasoning, and custom request processing.