Bring-Your-Own-Endpoint#
Deploy and manage model serving yourself, then point NeMo Evaluator to your endpoint. This approach gives you full control over deployment infrastructure while still leveraging NeMo Evaluator’s evaluation capabilities.
Overview#
With bring-your-own-endpoint, you:
Handle model deployment and serving independently
Provide an OpenAI-compatible API endpoint
Use either the launcher or core library for evaluations
Maintain full control over infrastructure and scaling
When to Use This Approach#
Choose bring-your-own-endpoint when you:
Have existing model serving infrastructure
Need custom deployment configurations
Want to deploy once and run many evaluations
Have specific security or compliance requirements
Use enterprise Kubernetes or MLOps pipelines
Deployment Approaches#
Choose the approach that best fits your infrastructure and requirements:
Deploy using vLLM, TensorRT-LLM, or custom serving frameworks for full control.
Use NVIDIA Build, OpenAI API, or other cloud providers for instant availability.
Quick Examples#
Using Launcher with Existing Endpoint#
# Point launcher to your deployed model
nemo-evaluator-launcher run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
-o target.api_endpoint.url=http://your-endpoint:8080/v1/completions \
-o target.api_endpoint.model_id=your-model-name \
-o deployment.type=none # No launcher deployment
Using Core Library#
from nemo_evaluator import (
ApiEndpoint, EvaluationConfig, EvaluationTarget, evaluate
)
# Configure your endpoint
api_endpoint = ApiEndpoint(
url="http://your-endpoint:8080/v1/completions",
model_id="your-model-name"
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Run evaluation
config = EvaluationConfig(type="mmlu_pro", output_dir="results")
results = evaluate(eval_cfg=config, target_cfg=target)
Endpoint Requirements#
Your endpoint must provide OpenAI-compatible APIs:
Required Endpoints#
Completions:
/v1/completions
(POST) - For text completion tasksChat Completions:
/v1/chat/completions
(POST) - For conversational tasksHealth Check:
/v1/triton_health
(GET) - For monitoring (recommended)
Request/Response Format#
Must follow OpenAI API specifications for compatibility with evaluation frameworks.
Configuration Management#
Basic Configuration#
# config/bring_your_own.yaml
deployment:
type: none # No launcher deployment
target:
api_endpoint:
url: http://your-endpoint:8080/v1/completions
model_id: your-model-name
api_key: ${API_KEY} # Optional
evaluation:
tasks:
- name: mmlu_pro
- name: gsm8k
With Adapters#
target:
api_endpoint:
url: http://your-endpoint:8080/v1/completions
model_id: your-model-name
adapter_config:
# Caching for efficiency
use_caching: true
caching_dir: ./cache
# Request logging for debugging
use_request_logging: true
max_logged_requests: 10
# Custom processing
use_reasoning: true
start_reasoning_token: "<think>"
end_reasoning_token: "</think>"
Key Benefits#
Infrastructure Control#
Custom configurations: Tailor deployment to your specific needs
Resource optimization: Optimize for your hardware and workloads
Security compliance: Meet your organization’s security requirements
Cost management: Control costs through efficient resource usage
Operational Flexibility#
Deploy once, evaluate many: Reuse deployments across multiple evaluations
Integration ready: Works with existing infrastructure and workflows
Technology choice: Use any serving framework or cloud provider
Scaling control: Scale according to your requirements
Getting Started#
Choose your approach: Select from manual deployment, hosted services, or enterprise integration
Deploy your model: Set up your OpenAI-compatible endpoint
Configure NeMo Evaluator: Point to your endpoint with proper configuration
Run evaluations: Use launcher or core library to run benchmarks
Monitor and optimize: Track performance and optimize as needed
Next Steps#
Manual Deployment: Learn Manual Deployment techniques
Hosted Services: Explore Hosted Services options
Configure Adapters: Set up Evaluation Adapters for custom processing