Serve and Deploy Models#

Deploy and serve models with NeMo Evaluator’s flexible deployment options. Select a deployment strategy that matches your workflow, infrastructure, and requirements.

Overview#

NeMo Evaluator keeps model serving separate from evaluation execution, giving you flexible architectures and scalable workflows. Choose who manages deployment based on your needs.

Key Concepts#

Model-Evaluation Separation: Models serve via OpenAI-compatible APIs, evaluations run in containers
Deployment Responsibility: Choose who manages the model serving infrastructure
Multi-Backend Support: Deploy locally, on HPC clusters, or in the cloud

Deployment Strategy Guide#

Launcher-Orchestrated Deployment (Recommended)#

Let NeMo Evaluator Launcher handle both model deployment and evaluation orchestration:

# Launcher deploys model AND runs evaluation
nemo-evaluator-launcher run \
    --config packages/nemo-evaluator-launcher/examples/slurm_llama_3_1_8b_instruct.yaml \
    -o deployment.checkpoint_path=/shared/models/llama-3.1-8b

When to use:

You want automated deployment lifecycle management
You prefer integrated monitoring and cleanup
You want the simplest path from model to results

Supported deployment types: vLLM, NIM, SGLang, TRT-LLM, or no deployment (existing endpoints)

Bring-Your-Own-Endpoint#

You handle model deployment, NeMo Evaluator handles evaluation:

Launcher users with existing endpoints:

# Point launcher to your deployed model
nemo-evaluator-launcher run \
    --config packages/nemo-evaluator-launcher/examples/local_llama_3_1_8b_instruct.yaml \
    -o target.api_endpoint.url=http://localhost:8080/v1/chat/completions

Core library users:

from nemo_evaluator import evaluate, ApiEndpoint, EvaluationTarget, EvaluationConfig

api_endpoint = ApiEndpoint(url="http://localhost:8080/v1/completions")
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="gsm8k", output_dir="./results")
evaluate(target_cfg=target, eval_cfg=config)

When to use:

You have existing model serving infrastructure
You need custom deployment configurations
You want to deploy once and run many evaluations
You have specific security or compliance requirements

Available Deployment Types#

The launcher supports multiple deployment types through Hydra configuration:

vLLM Deployment

deployment:
  type: vllm
  image: vllm/vllm-openai:latest
  hf_model_handle: hf-model/handle  # HuggingFace ID
  checkpoint_path: null             # or provide a path to the stored checkpoint
  served_model_name: your-model-name
  port: 8000

NIM Deployment

deployment:
  image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
  served_model_name: meta/llama-3.1-8b-instruct
  port: 8000

SGLang Deployment

deployment:
  type: sglang
  image: lmsysorg/sglang:latest
  hf_model_handle: hf-model/handle  # HuggingFace ID
  checkpoint_path: null             # or provide a path to the stored checkpoint
  served_model_name: your-model-name
  port: 8000

No Deployment

deployment:
  type: none  # Use existing endpoint

Bring-Your-Own-Endpoint Options#

Choose from these approaches when managing your own deployment:

Hosted Services#

NVIDIA Build: Ready-to-use hosted models with OpenAI-compatible APIs
OpenAI API: Direct integration with OpenAI’s models
Other providers: Any service providing OpenAI-compatible endpoints

Enterprise Integration#

Kubernetes deployments: Container orchestration in production environments
Existing MLOps pipelines: Integration with current model serving infrastructure
Custom infrastructure: Specialized deployment requirements