Manual Deployment#

Deploy models yourself using popular serving frameworks, then point NeMo Evaluator to your endpoints. This approach gives you full control over deployment infrastructure and serving configuration.

Overview#

Manual deployment involves:

Setting up model serving using frameworks like vLLM, TensorRT-LLM, or custom solutions
Configuring OpenAI-compatible endpoints
Managing infrastructure, scaling, and monitoring yourself
Using either the launcher or core library to run evaluations against your endpoints

Note

This guide focuses on NeMo Evaluator configuration. For specific serving framework installation and deployment instructions, refer to their official documentation:

Using Manual Deployments with NeMo Evaluator#

With Launcher#

Once your manual deployment is running, use the launcher to evaluate:

# Basic evaluation against manual deployment
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o target.api_endpoint.url=http://localhost:8080/v1/completions \
    -o target.api_endpoint.model_id=your-model-name

Configuration File Approach#

# config/manual_deployment.yaml
defaults:
  - execution: local
  - deployment: none  # No deployment by launcher
  - _self_

target:
  api_endpoint:
    url: http://localhost:8080/v1/completions
    model_id: llama-3.1-8b
    # Optional authentication (name of environment variable holding API key)
    api_key_name: API_KEY

execution:
  output_dir: ./results

evaluation:
  tasks:
    - name: mmlu_pro
      overrides:
        config.params.limit_samples: 100
    - name: gsm8k
      overrides:
        config.params.limit_samples: 50

With Core Library#

Direct API usage for manual deployments:

from nemo_evaluator import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
    evaluate
)

# Configure your manual deployment endpoint
api_endpoint = ApiEndpoint(
    url="http://localhost:8080/v1/completions",
    type=EndpointType.COMPLETIONS,
    model_id="llama-3.1-8b",
    api_key="API_KEY"  # Name of environment variable holding API key
)

target = EvaluationTarget(api_endpoint=api_endpoint)

# Configure evaluation
config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=100,
        parallelism=4
    )
)

# Run evaluation
results = evaluate(eval_cfg=config, target_cfg=target)
print(f"Results: {results}")

With Adapter Configuration#

from nemo_evaluator import (
    ApiEndpoint,
    ConfigParams,
    EndpointType,
    EvaluationConfig,
    EvaluationTarget,
    evaluate
)
from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig

# Configure adapter with interceptors
adapter_config = AdapterConfig(
    interceptors=[
        InterceptorConfig(
            name="caching",
            config={
                "cache_dir": "./cache",
                "reuse_cached_responses": True
            }
        ),
        InterceptorConfig(
            name="request_logging",
            config={"max_requests": 10}
        ),
        InterceptorConfig(
            name="response_logging",
            config={"max_responses": 10}
        )
    ]
)

# Configure endpoint with adapter
api_endpoint = ApiEndpoint(
    url="http://localhost:8080/v1/completions",
    type=EndpointType.COMPLETIONS,
    model_id="llama-3.1-8b",
    api_key="API_KEY",
    adapter_config=adapter_config
)

target = EvaluationTarget(api_endpoint=api_endpoint)

# Configure evaluation
config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=100,
        parallelism=4
    )
)

# Run evaluation
results = evaluate(eval_cfg=config, target_cfg=target)
print(f"Results: {results}")

Prerequisites#

Before using a manually deployed endpoint with NeMo Evaluator, ensure:

Your model endpoint is running and accessible
The endpoint supports OpenAI-compatible API format
You have any required API keys or authentication credentials
Your endpoint supports the required generation parameters (see below)

Endpoint Requirements#

Your endpoint must support the following generation parameters for compatibility with NeMo Evaluator:

temperature: Controls randomness in generation (0.0 to 1.0)
top_p: Nucleus sampling threshold (0.0 to 1.0)
max_tokens: Maximum tokens to generate

Testing Your Endpoint#

Before running evaluations, verify your endpoint is working as expected.

Note

Each evaluation task requires a specific endpoint type. Verify your endpoint supports the correct type for your chosen tasks. Use nemo-evaluator-launcher ls tasks to see which endpoint type each task requires.

OpenAI API Compatibility#

Your endpoint must implement the OpenAI API format:

Troubleshooting#

Connection Issues#

If you encounter connection errors:

Verify the endpoint is running and accessible. Check the health endpoint (path varies by framework):

# For vLLM, SGLang, NIM
curl http://localhost:8080/health

# For NeMo/Triton deployments
curl http://localhost:8080/v1/triton_health

Check that the URL in your configuration matches your deployment:
- Include the full path (e.g., /v1/completions or /v1/chat/completions)
- Verify the port number matches your server configuration
- Ensure no firewall rules are blocking connections
Test with a simple curl command before running full evaluations

Authentication Errors#

If you see authentication failures:

Verify the environment variable has a value:
```
echo $API_KEY
```
Ensure the api_key_name in your YAML configuration matches the environment variable name
Check that your endpoint requires the same authentication method

Timeout Errors#

If requests are timing out:

Increase the timeout in your configuration:

evaluation:
  overrides:
    config.params.request_timeout: 300  # 5 minutes

Reduce parallelism to avoid overwhelming your endpoint:

evaluation:
  overrides:
    config.params.parallelism: 1

Check your endpoint’s logs for performance issues

Next Steps#

Hosted services: Compare with hosted services for managed solutions
Adapter system: Learn more about adapter configuration for advanced request/response handling
Configuration reference: See Evaluation Configuration Parameters for comprehensive evaluation parameter options