Serve and Deploy Models#

Deploy and serve models with NeMo Evaluator’s flexible deployment options. Select a deployment strategy that matches your workflow, infrastructure, and requirements.

Overview#

NeMo Evaluator keeps model serving separate from evaluation execution, giving you flexible architectures and scalable workflows. Choose who manages deployment based on your needs.

Key Concepts#

  • Model-Evaluation Separation: Models serve via OpenAI-compatible APIs, evaluations run in containers

  • Deployment Responsibility: Choose who manages the model serving infrastructure

  • Multi-Backend Support: Deploy locally, on HPC clusters, or in the cloud

Deployment Strategy Guide#

Bring-Your-Own-Endpoint#

You handle model deployment, NeMo Evaluator handles evaluation:

Launcher users with existing endpoints:

# Point launcher to your deployed model
nemo-evaluator-launcher run \
    --config packages/nemo-evaluator-launcher/examples/local_llama_3_1_8b_instruct.yaml \
    -o target.api_endpoint.url=http://localhost:8080/v1/chat/completions

Core library users:

from nemo_evaluator import evaluate, ApiEndpoint, EvaluationTarget, EvaluationConfig

api_endpoint = ApiEndpoint(url="http://localhost:8080/v1/completions")
target = EvaluationTarget(api_endpoint=api_endpoint)
config = EvaluationConfig(type="gsm8k", output_dir="./results")
evaluate(target_cfg=target, eval_cfg=config)

When to use:

  • You have existing model serving infrastructure

  • You need custom deployment configurations

  • You want to deploy once and run many evaluations

  • You have specific security or compliance requirements

Available Deployment Types#

The launcher supports multiple deployment types through Hydra configuration:

vLLM Deployment

deployment:
  type: vllm
  image: vllm/vllm-openai:latest
  hf_model_handle: hf-model/handle  # HuggingFace ID
  checkpoint_path: null             # or provide a path to the stored checkpoint
  served_model_name: your-model-name
  port: 8000

NIM Deployment

deployment:
  image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
  served_model_name: meta/llama-3.1-8b-instruct
  port: 8000

SGLang Deployment

deployment:
  type: sglang
  image: lmsysorg/sglang:latest
  hf_model_handle: hf-model/handle  # HuggingFace ID
  checkpoint_path: null             # or provide a path to the stored checkpoint
  served_model_name: your-model-name
  port: 8000

No Deployment

deployment:
  type: none  # Use existing endpoint

Bring-Your-Own-Endpoint Options#

Choose from these approaches when managing your own deployment:

Hosted Services#

  • NVIDIA Build: Ready-to-use hosted models with OpenAI-compatible APIs

  • OpenAI API: Direct integration with OpenAI’s models

  • Other providers: Any service providing OpenAI-compatible endpoints

Enterprise Integration#

  • Kubernetes deployments: Container orchestration in production environments

  • Existing MLOps pipelines: Integration with current model serving infrastructure

  • Custom infrastructure: Specialized deployment requirements