nemo_evaluator.api.api_dataclasses#

NeMo Evaluator Core operates on strictly defined input and output which are modelled through pydantic dataclasses. Whether you use Python API or CLI, the reference below serves as a map of configuration options and output format.

Modeling Target#

ApiEndpoint

API endpoint configuration containing information on endpoint placement, targeted model name and adapter used before prompting endpoint.

EndpointType

EndpointType is used to determine appropriate URL, payload structure or native harness inference class

EvaluationTarget

Target configuration for API endpoints.

Modeling Evaluation#

EvaluationConfig

Configuration for evaluation runs.

ConfigParams

Parameters for evaluation execution.

Modeling Result#

EvaluationResult

EvaluationResults bundles per-tasks and per-group results.

GroupResult

Some tasks can be grouped or logically split.

MetricResult

Defines mapping from metric name to its scores.

Score

Atomic class that contains the value of particular metric and corresponding stats

ScoreStats

Stats for a score.

TaskResult

Defines set of metrics that were calculated for particular task.

pydantic model nemo_evaluator.api.api_dataclasses.ApiEndpoint[source]#

Bases: BaseModel

API endpoint configuration containing information on endpoint placement, targeted model name and adapter used before prompting endpoint.

Config:
  • use_enum_values: bool = True

field adapter_config: AdapterConfig | None = None#

Adapter configuration

field api_key: str | None = None#

Name of the env variable that stores API key for the model

field model_id: str | None = None#

Name of the model

field stream: bool | None = None#

Whether responses should be streamed

field type: EndpointType | None = None#

The type of the target

field url: str | None = None#

Url of the model

pydantic model nemo_evaluator.api.api_dataclasses.ConfigParams[source]#

Bases: BaseModel

Parameters for evaluation execution.

field extra: Dict[str, Any] | None [Optional]#

Framework specific parameters to be used for evaluation

field limit_samples: int | float | None = None#

Limit number of evaluation samples

field max_new_tokens: int | None = None#

Max tokens to generate

field max_retries: int | None = None#

Number of REST request retries

field parallelism: int | None = None#

Parallelism to be used

field request_timeout: int | None = None#

REST response timeout

field task: str | None = None#

Name of the task

field temperature: float | None = None#

Float value between 0 and 1. temp of 0 indicates greedy decoding, where the token with highest prob is chosen. Temperature can’t be set to 0.0 currently

field top_p: float | None = None#

Float value between 0 and 1; limits to the top tokens within a certain probability. top_p=0 means the model will only consider the single most likely token for the next prediction

enum nemo_evaluator.api.api_dataclasses.EndpointType(value)[source]#

Bases: str, Enum

EndpointType is used to determine appropriate URL, payload structure or native harness inference class

Member Type:

str

Valid values are as follows:

UNDEFINED = <EndpointType.UNDEFINED: 'undefined'>#
CHAT = <EndpointType.CHAT: 'chat'>#
COMPLETIONS = <EndpointType.COMPLETIONS: 'completions'>#
VLM = <EndpointType.VLM: 'vlm'>#
EMBEDDING = <EndpointType.EMBEDDING: 'embedding'>#
pydantic model nemo_evaluator.api.api_dataclasses.Evaluation[source]#

Bases: BaseModel

field command: str [Required]#

jinja template of the command to be executed

field config: EvaluationConfig [Required]#
field framework_name: str [Required]#

Name of the framework

field pkg_name: str [Required]#

Name of the package

field target: EvaluationTarget [Required]#
render_command()[source]#
pydantic model nemo_evaluator.api.api_dataclasses.EvaluationConfig[source]#

Bases: BaseModel

Configuration for evaluation runs.

field output_dir: str | None = None#

Directory to output the results

field params: ConfigParams | None = None#

Parameters to be used for evaluation

field supported_endpoint_types: list[str] | None = None#

Supported endpoint types like chat or completions

field type: str | None = None#

Type of the task

class nemo_evaluator.api.api_dataclasses.EvaluationMetadata[source]#

Bases: dict

We put here various evaluation metadata that does not influence the evaluation.

pydantic model nemo_evaluator.api.api_dataclasses.EvaluationResult[source]#

Bases: BaseModel

EvaluationResults bundles per-tasks and per-group results.

field groups: Dict[str, GroupResult] | None [Optional]#

The results at the group-level

field tasks: Dict[str, TaskResult] | None [Optional]#

The results at the task-level

pydantic model nemo_evaluator.api.api_dataclasses.EvaluationTarget[source]#

Bases: BaseModel

Target configuration for API endpoints.

field api_endpoint: ApiEndpoint | None = None#

API endpoint to be used for evaluation

pydantic model nemo_evaluator.api.api_dataclasses.GroupResult[source]#

Bases: BaseModel

Some tasks can be grouped or logically split. This class defines result on grouping level.

field groups: Dict[str, GroupResult] | None = None#

The results for the subgroups.

field metrics: Dict[str, MetricResult] [Optional]#

The value for all the metrics computed for the group.

pydantic model nemo_evaluator.api.api_dataclasses.MetricResult[source]#

Bases: BaseModel

Defines mapping from metric name to its scores.

field scores: Dict[str, Score] [Optional]#

Mapping from metric name to scores.

pydantic model nemo_evaluator.api.api_dataclasses.Score[source]#

Bases: BaseModel

Atomic class that contains the value of particular metric and corresponding stats

field stats: ScoreStats [Required]#

Statistics associated with this metric

field value: float [Required]#

The value/score produced on this metric

pydantic model nemo_evaluator.api.api_dataclasses.ScoreStats[source]#

Bases: BaseModel

Stats for a score.

field count: int | None = None#

The number of values used for computing the score.

field max: float | None = None#

The maximum of all values used for computing the score.

field mean: float | None = None#

The mean of all values used for computing the score.

field min: float | None = None#

The minimum of all values used for computing the score.

field stddev: float | None = None#

This is the population standard deviation, not the sample standard deviation.

field stderr: float | None = None#

The standard error.

field sum: float | None = None#

The sum of all values used for computing the score.

field sum_squared: float | None = None#

The sum of the square of all values used for computing the score.

field variance: float | None = None#

This is the population variance, not the sample variance.

pydantic model nemo_evaluator.api.api_dataclasses.TaskResult[source]#

Bases: BaseModel

Defines set of metrics that were calculated for particular task.

field metrics: Dict[str, MetricResult] [Optional]#

The value for all the metrics computed for the task