Manual Deployment#
Deploy models yourself using popular serving frameworks, then point NeMo Evaluator to your endpoints. This approach gives you full control over deployment infrastructure and serving configuration.
Overview#
Manual deployment involves:
Setting up model serving using frameworks like vLLM, TensorRT-LLM, or custom solutions
Configuring OpenAI-compatible endpoints
Managing infrastructure, scaling, and monitoring yourself
Using either the launcher or core library to run evaluations against your endpoints
Note
This guide focuses on NeMo Evaluator configuration. For specific serving framework installation and deployment instructions, refer to their official documentation:
Using Manual Deployments with NeMo Evaluator#
With Launcher#
Once your manual deployment is running, use the launcher to evaluate:
# Basic evaluation against manual deployment
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
-o target.api_endpoint.url=http://localhost:8080/v1/completions \
-o target.api_endpoint.model_id=your-model-name
Configuration File Approach#
# config/manual_deployment.yaml
defaults:
- execution: local
- deployment: none # No deployment by launcher
- _self_
target:
api_endpoint:
url: http://localhost:8080/v1/completions
model_id: llama-3.1-8b
# Optional authentication (name of environment variable holding API key)
api_key_name: API_KEY
execution:
output_dir: ./results
evaluation:
tasks:
- name: mmlu_pro
overrides:
config.params.limit_samples: 100
- name: gsm8k
overrides:
config.params.limit_samples: 50
With Core Library#
Direct API usage for manual deployments:
from nemo_evaluator import (
ApiEndpoint,
ConfigParams,
EndpointType,
EvaluationConfig,
EvaluationTarget,
evaluate
)
# Configure your manual deployment endpoint
api_endpoint = ApiEndpoint(
url="http://localhost:8080/v1/completions",
type=EndpointType.COMPLETIONS,
model_id="llama-3.1-8b",
api_key="API_KEY" # Name of environment variable holding API key
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure evaluation
config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=100,
parallelism=4
)
)
# Run evaluation
results = evaluate(eval_cfg=config, target_cfg=target)
print(f"Results: {results}")
With Adapter Configuration#
from nemo_evaluator import (
ApiEndpoint,
ConfigParams,
EndpointType,
EvaluationConfig,
EvaluationTarget,
evaluate
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig
# Configure adapter with interceptors
adapter_config = AdapterConfig(
interceptors=[
InterceptorConfig(
name="caching",
config={
"cache_dir": "./cache",
"reuse_cached_responses": True
}
),
InterceptorConfig(
name="request_logging",
config={"max_requests": 10}
),
InterceptorConfig(
name="response_logging",
config={"max_responses": 10}
)
]
)
# Configure endpoint with adapter
api_endpoint = ApiEndpoint(
url="http://localhost:8080/v1/completions",
type=EndpointType.COMPLETIONS,
model_id="llama-3.1-8b",
api_key="API_KEY",
adapter_config=adapter_config
)
target = EvaluationTarget(api_endpoint=api_endpoint)
# Configure evaluation
config = EvaluationConfig(
type="mmlu_pro",
output_dir="./results",
params=ConfigParams(
limit_samples=100,
parallelism=4
)
)
# Run evaluation
results = evaluate(eval_cfg=config, target_cfg=target)
print(f"Results: {results}")
Prerequisites#
Before using a manually deployed endpoint with NeMo Evaluator, ensure:
Your model endpoint is running and accessible
The endpoint supports OpenAI-compatible API format
You have any required API keys or authentication credentials
Your endpoint supports the required generation parameters (see below)
Endpoint Requirements#
Your endpoint must support the following generation parameters for compatibility with NeMo Evaluator:
temperature
: Controls randomness in generation (0.0 to 1.0)top_p
: Nucleus sampling threshold (0.0 to 1.0)max_tokens
: Maximum tokens to generate
Testing Your Endpoint#
Before running evaluations, verify your endpoint is working as expected.
Test Completions Endpoint
# Basic test (no authentication)
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"prompt": "What is machine learning?",
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
# With authentication
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "your-model-name",
"prompt": "What is machine learning?",
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
Test Chat Completions Endpoint
# Basic test (no authentication)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [
{
"role": "user",
"content": "What is machine learning?"
}
],
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
# With authentication
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "your-model-name",
"messages": [
{
"role": "user",
"content": "What is machine learning?"
}
],
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256,
"stream": false
}'
Note
Each evaluation task requires a specific endpoint type. Verify your endpoint supports the correct type for your chosen tasks. Use nemo-evaluator-launcher ls tasks
to see which endpoint type each task requires.
OpenAI API Compatibility#
Your endpoint must implement the OpenAI API format:
Completions Endpoint Format
Request: POST /v1/completions
{
"model": "model-name",
"prompt": "string",
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.9
}
Response:
{
"id": "cmpl-xxx",
"object": "text_completion",
"created": 1234567890,
"model": "model-name",
"choices": [{
"text": "generated text",
"index": 0,
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 20,
"total_tokens": 30
}
}
Chat Completions Endpoint Format
Request: POST /v1/chat/completions
{
"model": "model-name",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100,
"temperature": 0.7
}
Response:
{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"created": 1234567890,
"model": "model-name",
"choices": [{
"message": {
"role": "assistant",
"content": "Hello! How can I help you?"
},
"index": 0,
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 10,
"total_tokens": 25
}
}
Troubleshooting#
Connection Issues#
If you encounter connection errors:
Verify the endpoint is running and accessible. Check the health endpoint (path varies by framework):
# For vLLM, SGLang, NIM curl http://localhost:8080/health # For NeMo/Triton deployments curl http://localhost:8080/v1/triton_health
Check that the URL in your configuration matches your deployment:
Include the full path (e.g.,
/v1/completions
or/v1/chat/completions
)Verify the port number matches your server configuration
Ensure no firewall rules are blocking connections
Test with a simple curl command before running full evaluations
Authentication Errors#
If you see authentication failures:
Verify the environment variable has a value:
echo $API_KEY
Ensure the
api_key_name
in your YAML configuration matches the environment variable nameCheck that your endpoint requires the same authentication method
Timeout Errors#
If requests are timing out:
Increase the timeout in your configuration:
evaluation: overrides: config.params.request_timeout: 300 # 5 minutes
Reduce parallelism to avoid overwhelming your endpoint:
evaluation: overrides: config.params.parallelism: 1
Check your endpoint’s logs for performance issues
Next Steps#
Hosted services: Compare with hosted services for managed solutions
Adapter system: Learn more about adapter configuration for advanced request/response handling
Configuration reference: See Evaluation Configuration Parameters for comprehensive evaluation parameter options