nat.eval.llm_validator#

LLM Endpoint Validator for NeMo Agent Toolkit evaluation.

This module provides functionality to validate LLM endpoints before running evaluation workflows. This helps catch deployment issues early (e.g., models not deployed after training cancellation) and provides actionable error messages.

The validation uses the NeMo Agent Toolkit WorkflowBuilder to instantiate LLMs in a framework-agnostic way, then tests them with a minimal ainvoke() call. This approach works for all LLM types (OpenAI, NIM, AWS Bedrock, vLLM, etc.) and respects the auth and config system.

Note: Validation invokes actual LLM endpoints with minimal test prompts. This may incur small API costs for cloud-hosted models.

Attributes#

`logger`
`VALIDATION_TIMEOUT_SECONDS`
`MAX_ERROR_MESSAGE_LENGTH`
`CONCURRENT_VALIDATION_BATCH_SIZE`
`VALIDATION_PROMPT`

Functions#

`_is_404_error`(→ bool)	Detect if an exception represents a 404 (model not found) error.
`_get_llm_endpoint_info`(→ tuple[str \| None, str \| None])	Extract endpoint and model information from an LLM config.
`_truncate_error_message`(→ str)	Truncate error messages to prevent memory issues with large stack traces.
`_validate_single_llm`(→ tuple[str \| None, str \| None])	Validate a single LLM endpoint.
`validate_llm_endpoints`(→ None)	Validate that all LLM endpoints in the config are accessible.

Module Contents#

logger#

VALIDATION_TIMEOUT_SECONDS = 30#

MAX_ERROR_MESSAGE_LENGTH = 500#

CONCURRENT_VALIDATION_BATCH_SIZE = 5#

VALIDATION_PROMPT = 'test'#

_is_404_error(exception: Exception) → bool#

Detect if an exception represents a 404 (model not found) error.

This handles various 404 error formats from different LLM providers: - OpenAI SDK: openai.NotFoundError - HTTP responses: HTTP 404 or status code 404 - LangChain wrappers: Various wrapped 404s

Args:: exception: The exception to check.
Returns:: True if this is a 404 error, False otherwise.

_get_llm_endpoint_info( llm_config: nat.data_models.llm.LLMBaseConfig, ) → tuple[str | None, str | None]#

Extract endpoint and model information from an LLM config.

Args:: llm_config: The LLM configuration object.
Returns:: Tuple of (base_url, model_name), either may be None.

_truncate_error_message( message: str, max_length: int = MAX_ERROR_MESSAGE_LENGTH, ) → str#

Truncate error messages to prevent memory issues with large stack traces.

Keeps both the start and end of the message to preserve context from both the error description (start) and the stack trace (end).

Args:: message: The error message to truncate. max_length: Maximum length to keep.
Returns:: Truncated message with ellipsis if needed.

async _validate_single_llm( builder: nat.builder.workflow_builder.WorkflowBuilder, llm_name: str, llm_config: nat.data_models.llm.LLMBaseConfig, ) → tuple[str | None, str | None]#

Validate a single LLM endpoint.

Args:: builder: The WorkflowBuilder instance. llm_name: Name of the LLM to validate. llm_config: Configuration for the LLM.
Returns:: Tuple of (error_type, error_message): - error_type: “404” for model not found, “warning” for non-critical, None for success - error_message: Description of the error, or None if successful

async validate_llm_endpoints(config: nat.data_models.config.Config) → None#

Validate that all LLM endpoints in the config are accessible.

This function uses NAT’s WorkflowBuilder to instantiate each configured LLM and tests it with a minimal ainvoke() call. This approach is framework-agnostic and works for all LLM types (OpenAI, NIM, AWS Bedrock, vLLM, etc.).

The validation distinguishes between critical errors (404s indicating model not deployed) and non-critical errors (auth issues, rate limits, etc.): - 404 errors: Fail fast with detailed troubleshooting guidance - Other errors: Log warning but continue (to avoid false positives)

LLMs are validated in parallel batches to improve performance while respecting rate limits. Each validation has a timeout to prevent hanging.

Note: This function invokes actual LLM endpoints, which may incur small API costs.

Args:: config: The NAT configuration object containing LLM definitions.
Raises:: RuntimeError: If any LLM endpoint has a 404 error (model not deployed). ValueError: If config.llms is not properly structured.