nemo_evaluator.api.api_dataclasses#

NeMo Evaluator Core operates on strictly defined input and output which are modelled through pydantic dataclasses. Whether you use Python API or CLI, the reference below serves as a map of configuration options and output format.

Modeling Target#

ApiEndpoint

API endpoint configuration containing information on endpoint placement, targeted model name and adapter used before prompting endpoint.

EndpointType

EndpointType is used to determine appropriate URL, payload structure or native harness inference class

EvaluationTarget

Target configuration for API endpoints.

Modeling Evaluation#

EvaluationConfig

Configuration for evaluation runs.

ConfigParams

Parameters for evaluation execution.

Modeling Result#

EvaluationResult

EvaluationResults bundles per-tasks and per-group results.

GroupResult

Some tasks can be grouped or logically split.

MetricResult

Defines mapping from metric name to its scores.

Score

Atomic class that contains the value of particular metric and corresponding stats

ScoreStats

Stats for a score.

TaskResult

Defines set of metrics that were calculated for particular task.

pydantic model nemo_evaluator.api.api_dataclasses.ApiEndpoint[source]#

Bases: BaseModel

API endpoint configuration containing information on endpoint placement, targeted model name and adapter used before prompting endpoint.

Config:
  • use_enum_values: bool = True

  • extra: str = forbid

Validators:
field adapter_config: AdapterConfig | None = None#

Adapter configuration

Validated by:
field api_key_name: str | None = None#

Name of the environment variable that stores API key for the model

Validated by:
field model_id: str | None = None#

Name of the model

Validated by:
field stream: bool | None = None#

Whether responses should be streamed

Validated by:
field type: EndpointType | None = None#

The type of the target

Validated by:
field url: str | None = None#

Url of the model

Validated by:
validator handle_api_key_deprecation  »  all fields[source]#

Handle deprecation of api_key in favor of api_key_name.

api_key: str | None#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#

The deprecation message to be emitted.

wrapped_property#

The property instance if the deprecated field is a computed field, or None.

field_name#

The name of the field being deprecated.

pydantic model nemo_evaluator.api.api_dataclasses.ConfigParams[source]#

Bases: BaseModel

Parameters for evaluation execution.

Config:
  • extra: str = forbid

field extra: Dict[str, Any] | None [Optional]#

Framework specific parameters to be used for evaluation

field limit_samples: int | float | None = None#

Limit number of evaluation samples

field max_new_tokens: int | None = None#

Max tokens to generate

field max_retries: int | None = None#

Number of REST request retries

field parallelism: int | None = None#

Parallelism to be used

field request_timeout: int | None = None#

REST response timeout

field task: str | None = None#

Name of the task

field temperature: float | None = None#

Float value between 0 and 1. temp of 0 indicates greedy decoding, where the token with highest prob is chosen. Temperature can’t be set to 0.0 currently

field top_p: float | None = None#

Float value between 0 and 1; limits to the top tokens within a certain probability. top_p=0 means the model will only consider the single most likely token for the next prediction

pydantic model nemo_evaluator.api.api_dataclasses.EndpointModelConfig[source]#

Bases: BaseModel

Supporting model configuration.

field adapter_config: AdapterConfig | None = None#

Adapter configuration

field api_key_name: str | None = None#

Name of the env variable that stores API key

field extra: Dict[str, Any] | None = None#

Extra

field is_base_url: bool | None = False#

Whether the URL is a base URL

field max_new_tokens: int | None = None#

Max new tokens

field max_retries: int | None = None#

Max retries

field model_id: str [Required]#

Name of the model

field parallelism: int | None = None#

Parallelism

field request_timeout: int | None = None#

Request timeout

field stream: bool | None = None#

Whether responses should be streamed

field temperature: float | None = None#

Temperature

field top_p: float | None = None#

Top p

field type: EndpointType | None = None#

The type of the target

field url: str [Required]#

Url of the model

enum nemo_evaluator.api.api_dataclasses.EndpointType(value)[source]#

Bases: str, Enum

EndpointType is used to determine appropriate URL, payload structure or native harness inference class

Member Type:

str

Valid values are as follows:

UNDEFINED = <EndpointType.UNDEFINED: 'undefined'>#
CHAT = <EndpointType.CHAT: 'chat'>#
COMPLETIONS = <EndpointType.COMPLETIONS: 'completions'>#
VLM = <EndpointType.VLM: 'vlm'>#
EMBEDDING = <EndpointType.EMBEDDING: 'embedding'>#
pydantic model nemo_evaluator.api.api_dataclasses.Evaluation[source]#

Bases: BaseModel

Config:
  • extra: str = forbid

field command: str [Required]#

jinja template of the command to be executed

field config: EvaluationConfig [Required]#
field framework_name: str [Required]#

Name of the framework

field pkg_name: str [Required]#

Name of the package

field target: EvaluationTarget [Required]#
render_command()[source]#
pydantic model nemo_evaluator.api.api_dataclasses.EvaluationConfig[source]#

Bases: BaseModel

Configuration for evaluation runs.

Config:
  • extra: str = forbid

field output_dir: str | None = None#

Directory to output the results

field params: ConfigParams | None = None#

Parameters to be used for evaluation

field supported_endpoint_types: list[str] | None = None#

Supported endpoint types like chat or completions

field type: str | None = None#

Type of the task

class nemo_evaluator.api.api_dataclasses.EvaluationMetadata[source]#

Bases: dict

We put here various evaluation metadata that does not influence the evaluation.

pydantic model nemo_evaluator.api.api_dataclasses.EvaluationResult[source]#

Bases: BaseModel

EvaluationResults bundles per-tasks and per-group results.

field groups: Dict[str, GroupResult] | None [Optional]#

The results at the group-level

field tasks: Dict[str, TaskResult] | None [Optional]#

The results at the task-level

pydantic model nemo_evaluator.api.api_dataclasses.EvaluationTarget[source]#

Bases: BaseModel

Target configuration for API endpoints.

Config:
  • extra: str = forbid

field api_endpoint: ApiEndpoint | None = None#

API endpoint to be used for evaluation

pydantic model nemo_evaluator.api.api_dataclasses.GroupResult[source]#

Bases: BaseModel

Some tasks can be grouped or logically split. This class defines result on grouping level.

field groups: Dict[str, GroupResult] | None = None#

The results for the subgroups.

field metrics: Dict[str, MetricResult] [Optional]#

The value for all the metrics computed for the group.

pydantic model nemo_evaluator.api.api_dataclasses.MetricResult[source]#

Bases: BaseModel

Defines mapping from metric name to its scores.

field scores: Dict[str, Score] [Optional]#

Mapping from metric name to scores.

pydantic model nemo_evaluator.api.api_dataclasses.Score[source]#

Bases: BaseModel

Atomic class that contains the value of particular metric and corresponding stats

field stats: ScoreStats [Required]#

Statistics associated with this metric

field value: float [Required]#

The value/score produced on this metric

pydantic model nemo_evaluator.api.api_dataclasses.ScoreStats[source]#

Bases: BaseModel

Stats for a score.

field count: int | None = None#

The number of values used for computing the score.

field max: float | None = None#

The maximum of all values used for computing the score.

field mean: float | None = None#

The mean of all values used for computing the score.

field min: float | None = None#

The minimum of all values used for computing the score.

field stddev: float | None = None#

This is the population standard deviation, not the sample standard deviation.

field stderr: float | None = None#

The standard error.

field sum: float | None = None#

The sum of all values used for computing the score.

field sum_squared: float | None = None#

The sum of the square of all values used for computing the score.

field variance: float | None = None#

This is the population variance, not the sample variance.

pydantic model nemo_evaluator.api.api_dataclasses.TaskResult[source]#

Bases: BaseModel

Defines set of metrics that were calculated for particular task.

field metrics: Dict[str, MetricResult] [Optional]#

The value for all the metrics computed for the task