Serve and Deploy Models#

Deploy and serve models with NeMo Evaluator’s flexible deployment options. Select a deployment strategy that matches your workflow, infrastructure, and requirements.

Overview#

NeMo Evaluator keeps model serving separate from evaluation execution, giving you flexible architectures and scalable workflows. Choose who manages deployment based on your needs.

Key Concepts#

Model-Evaluation Separation: Models serve via OpenAI-compatible APIs, evaluations run in containers
Deployment Responsibility: Choose who manages the model serving infrastructure
Multi-Backend Support: Deploy locally, on HPC clusters, or in the cloud
Universal Adapters: Request/response processing works across all deployment types

Deployment Strategy Guide#

Launcher-Orchestrated Deployment (Recommended)#

Let NeMo Evaluator Launcher handle both model deployment and evaluation orchestration:

# Launcher deploys model AND runs evaluation
nv-eval run \
    --config-dir examples \
    --config-name slurm_llama_3_1_8b_instruct \
    -o deployment.checkpoint_path=/shared/models/llama-3.1-8b

When to use:

You want automated deployment lifecycle management
You need multi-backend execution (local, Slurm, Lepton)
You prefer integrated monitoring and cleanup
You want the simplest path from model to results

Supported deployment types: vLLM, NIM, SGLang, or no deployment (existing endpoints)

Bring-Your-Own-Endpoint#

You handle model deployment, NeMo Evaluator handles evaluation:

Launcher users with existing endpoints:

# Point launcher to your deployed model
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o target.api_endpoint.url=http://localhost:8080/v1/completions

Core library users:

from nemo_evaluator import evaluate, ApiEndpoint, EvaluationTarget, EvaluationConfig, ConfigParams

api_endpoint = ApiEndpoint(url="http://localhost:8080/v1/completions")
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="mmlu_pro", output_dir="./results")
evaluate(target_cfg=target, eval_cfg=config)

When to use:

You have existing model serving infrastructure
You need custom deployment configurations
You want to deploy once and run many evaluations
You have specific security or compliance requirements

Manual Deployment

Deploy using vLLM, Ray Serve, or other serving frameworks.

Manual Deployment

Hosted Services

Use NVIDIA Build, OpenAI, or other hosted model APIs.

Hosted Services

Available Deployment Types#

The launcher supports multiple deployment types through Hydra configuration:

vLLM Deployment

deployment:
  type: vllm
  checkpoint_path: /path/to/model  # Or HuggingFace model ID
  served_model_name: my-model
  tensor_parallel_size: 8
  data_parallel_size: 1

NIM Deployment

deployment:
  type: nim
  image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
  served_model_name: meta/llama-3.1-8b-instruct

SGLang Deployment

deployment:
  type: sglang
  checkpoint_path: /path/to/model  # Or HuggingFace model ID
  served_model_name: my-model
  tensor_parallel_size: 8
  data_parallel_size: 1

No Deployment

deployment:
  type: none  # Use existing endpoint

Execution Backend Integration#

Local Backend

# Evaluates against existing endpoints only (no deployment)
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results
  
target:
  api_endpoint:
    url: http://localhost:8080/v1/completions
    model_id: my-model
  
evaluation:
  tasks:
    - name: mmlu_pro
    - name: gsm8k

Slurm Backend

# Deploys model on Slurm and runs evaluation
defaults:
  - execution: slurm/default
  - deployment: vllm
  - _self_

deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  served_model_name: meta-llama/Llama-3.1-8B-Instruct
  
execution:
  account: my-account
  output_dir: /shared/results
  partition: gpu
  walltime: "02:00:00"
  
evaluation:
  tasks:
    - name: mmlu_pro
    - name: gpqa_diamond

Lepton Backend

# Deploys model on Lepton and runs evaluation
defaults:
  - execution: lepton/default
  - deployment: vllm
  - _self_

deployment:
  checkpoint_path: meta-llama/Llama-3.1-8B-Instruct
  served_model_name: llama-3.1-8b-instruct
  lepton_config:
    resource_shape: gpu.1xh200
  
execution:
  output_dir: ./results
  
evaluation:
  tasks:
    - name: mmlu_pro
    - name: ifeval

Bring-Your-Own-Endpoint Options#

Choose from these approaches when managing your own deployment:

Manual Deployment#

vLLM: High-performance serving with PagedAttention optimization
Custom serving: Any OpenAI-compatible endpoint

Hosted Services#

NVIDIA Build: Ready-to-use hosted models with OpenAI-compatible APIs
OpenAI API: Direct integration with OpenAI’s models
Other providers: Any service providing OpenAI-compatible endpoints

Enterprise Integration#

Kubernetes deployments: Container orchestration in production environments
Existing MLOps pipelines: Integration with current model serving infrastructure
Custom infrastructure: Specialized deployment requirements

Usage Examples#

With Launcher#

# Point to any existing endpoint
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o target.api_endpoint.url=http://your-endpoint:8080/v1/completions \
    -o target.api_endpoint.model_id=your-model-name

With Core Library#

from nemo_evaluator import (
    evaluate,
    ApiEndpoint,
    EvaluationConfig,
    EvaluationTarget,
    ConfigParams
)

# Configure any endpoint
api_endpoint = ApiEndpoint(
    url="http://your-endpoint:8080/v1/completions",
    model_id="your-model-name"
)
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="results",
    params=ConfigParams(limit_samples=100)
)

evaluate(target_cfg=target, eval_cfg=config)

Evaluation Adapters#

Advanced request/response processing for all deployment types:

Usage & Configuration

Learn how to enable adapters and configure interceptor chains for any deployment.

Usage

Reasoning Cleanup

Strip intermediate reasoning tokens before scoring across all model types.

Reasoning Cleanup

Custom System Prompt (Chat)

Enforce standard system prompts for consistent evaluation across endpoints.

Custom System Prompt (Chat)

Advanced Interceptors

Configure logging, caching, reasoning, and custom request processing.

Interceptors