Provider Response Codecs and Model Pricing

Use this guide when subscribers, exporters, or diagnostics need a provider-neutral view of raw LLM responses.

What You Build

You will attach a response codec to a managed LLM wrapper so NeMo Relay can decode provider responses into AnnotatedLLMResponse data for LLM end events.

Response codecs are observability-only:

They do not rewrite the value returned to the application.
They do not run response middleware.
They attach normalized response data to lifecycle events for subscribers and exporters.
Decode failures are non-fatal. The LLM call still returns the provider response, and Relay emits the end event without an annotation.

Before You Start

You need:

A managed LLM boundary from Wrap LLM Calls.
A raw provider response that is JSON-compatible.
A built-in response codec or a custom response codec for the provider response shape.
A subscriber or exporter that consumes annotated_response from LLM end events.

What Response Codecs Decode

Response codecs normalize provider output into fields that subscribers can inspect consistently:

Field	Purpose
`id`	Provider response identifier.
`model`	Model that served the request, when the provider returns it.
`message`	Primary assistant message content.
`tool_calls`	Tool calls requested by the model.
`finish_reason`	Normalized completion reason, such as `complete`, `length`, `tool_use`, or `content_filter`.
`usage`	Token accounting, including cache-read and cache-write counts when available. It can also include normalized `cost` when the provider reports cost or Relay can estimate it from known model pricing.
`api_specific`	Provider-specific fields that do not fit the common model.
`extra`	Additional unmodeled response fields.

Use these annotations for observability, export, and debugging. Keep business logic that changes the caller-visible response in the framework or provider adapter, not in the response codec.

Cost Estimation

Response codecs should keep reporting provider usage fields without rewriting the caller-visible response. If a provider or framework reports cost, map it to Usage.cost with source: "provider_reported". Otherwise Relay can layer cost estimation onto AnnotatedLlmResponse.usage.cost when all required inputs are available:

The decoded response includes model.
The managed LLM call name identifies the provider or route, such as openai, anthropic, or azure/openai, when provider-specific model pricing is needed.
The decoded response includes prompt and/or completion token usage.
Relay has an explicit model pricing entry for that model or alias.

Model pricing estimates carry pricing_provider, pricing_model, pricing_as_of, pricing_source, and currency metadata so stale model pricing can be audited without failing response decoding. Normalized cost uses currency-neutral amount fields such as total, input, output, cache_read, and cache_write. Unknown model pricing and missing token data are non-fatal: Relay omits the cost field and still exports token metrics and response annotations.

Relay resolves model pricing through an active PricingResolver source chain. Provider or framework-reported cost remains authoritative. The resolver runs only when Usage.cost is missing. Relay does not ship provider price data by default: estimates require a configured inline, file, or embedding-provided model pricing source. With no configured source, every model is treated as unknown for model pricing.

Model pricing is runtime state, not a CLI-only feature. Any host that initializes Relay plugins can activate the built-in pricing component before it runs managed LLM calls. This includes application code, eval harnesses, custom agents, and framework integrations. The CLI commands below are a file-management convenience for the local gateway. Embedded hosts can pass the same component config directly through the plugin APIs.

Source precedence is deployment controlled:

Project or application overrides.
User/global device model pricing.
Enterprise-managed sources, such as a remotely synced file or a service backed by a database.

The built-in pricing plugin component accepts inline catalogs or JSON catalog files in precedence order. In discovered plugins.toml config, system config loads first, project config loads next, and user config loads last. For the pricing component, higher-priority sources are prepended instead of replacing lower-priority sources, so a user override can win for one model while enterprise or fleet model pricing remains available for everything else:

1 [[components]]
2 kind = "pricing"
3 enabled = true
4 
5 [[components.config.sources]]
6 type = "file"
7 path = "/etc/nemo-relay/pricing.json"
8 
9 [[components.config.sources]]
10 type = "inline"
11 [components.config.sources.catalog]
12 version = 1
13 entries = []

Each catalog entry declares:

provider and canonical model_id.
aliases for dated or provider-specific model IDs.
currency, defaulting to USD.
unit, defaulting to per_token. Relay estimates only per_token entries in this version. per_request, per_second, and gpu_hour are representable for future source integrations but are not estimated.
rates per one million input, output, cache-read, and cache-write tokens for flat per_token entries.
rate_schedule for data-driven threshold-based model pricing, such as models whose full-request input/output rates change after a prompt-token threshold.
prompt_cache.read_accounting, which tells Relay whether cache-read tokens are already included in prompt tokens.
pricing_as_of and pricing_source for auditability.

Relay validates catalogs at startup and rejects duplicate canonical IDs or aliases within the same normalized provider/model key. The same model ID can appear under distinct providers, such as openai/gpt-4o-mini and azure/openai/gpt-4o-mini. Adding a model should be a catalog/source update plus tests. It should not require adding another Rust match arm.

Use the CLI to validate catalog files and manage file-backed model pricing sources:

$ nemo-relay model-pricing validate /path/to/pricing.json
$ nemo-relay model-pricing init --project
$ nemo-relay model-pricing add-source /path/to/pricing.json --project
$ nemo-relay model-pricing resolve gpt-4o-mini --provider openai --prompt-tokens 1000 --completion-tokens 500

model-pricing init creates or enables the pricing plugin component in the selected plugins.toml. The initialized component has an empty sources list. Use model-pricing add-source or an inline config edit to provide model pricing data.

model-pricing add-source validates the referenced JSON catalog before updating plugins.toml. It creates the pricing component if needed and prepends the new file source by default, making it the highest-priority source in that scope. Use --append when the file should be a lower-priority fallback. Both commands default to user config at $XDG_CONFIG_HOME/nemo-relay/plugins.toml. Pass --project for .nemo-relay/plugins.toml or --global for /etc/nemo-relay/plugins.toml.

model-pricing resolve uses the same discovered config path as the gateway. It reports the winning catalog source, matched provider/model, and, when token counts are supplied, the estimated total cost. The source line is one of file:<path> or inline:<index>, which makes overlapping project/user/fleet entries debuggable. This is a dry diagnostic command. It does not mutate configuration.

nemo-relay doctor also validates enabled model pricing sources and reports missing, unreadable, or invalid catalogs before the gateway starts.

Model lookup is provider-aware and route-aware. Relay uses the managed LLM call name as the provider/route and first tries provider-scoped keys for the full model and terminal model name, then falls back to model-only suffixes. For example, a call named azure/openai with response model = "gpt-4o-mini" tries azure/openai/gpt-4o-mini before generic gpt-4o-mini. If the model string is itself routed, such as azure/openai/gpt-4o-mini, Relay can infer azure/openai for the terminal model before trying slash-delimited model-only suffixes. This keeps route-specific enterprise model pricing authoritative when configured while still allowing generic model pricing to apply to routed names.

For threshold-based model pricing, use rate_schedule.type = "prompt_token_threshold". Relay selects exactly one tier from prompt_tokens and applies that tier to the full request. It does not price only the overflow tokens at the higher rate. This matches providers that publish “short context” and “long context” prices for the entire request/session. If prompt_tokens is missing for a thresholded entry, Relay omits the estimate instead of guessing.

1 {
2   "provider": "google",
3   "model_id": "gemini-3.1-pro-preview",
4   "aliases": ["gemini-3.1-pro-preview-customtools"],
5   "pricing_as_of": "2026-06-05",
6   "pricing_source": "https://ai.google.dev/gemini-api/docs/pricing",
7   "rate_schedule": {
8     "type": "prompt_token_threshold",
9     "applies_to": "full_request",
10     "tiers": [
11       {
12         "max_prompt_tokens": 200000,
13         "rates": {
14           "input_per_million": 2.0,
15           "output_per_million": 12.0,
16           "cache_read_per_million": 0.2
17         }
18       },
19       {
20         "min_prompt_tokens": 200001,
21         "rates": {
22           "input_per_million": 4.0,
23           "output_per_million": 18.0,
24           "cache_read_per_million": 0.4
25         }
26       }
27     ]
28   },
29   "prompt_cache": {
30     "read_accounting": "included_in_prompt_tokens"
31   }
32 }

Database-backed or remote model pricing should be implemented as a source that returns a validated PricingCatalog snapshot to Relay. Keep database queries, service auth, refresh cadence, and caching outside the LLM response hot path. A fleet deployment can refresh /etc/nemo-relay/pricing.json from an IT-managed service, or embed a custom Rust PricingSource that reads from a database and installs a PricingResolver snapshot during process startup.

External model pricing catalogs should be converted into Relay catalog JSON out-of-band and then loaded through a file source, unless the embedding application installs a custom Rust PricingSource directly.

Embedded applications and eval harnesses can initialize the built-in pricing component for model pricing directly:

Python

Node.js

Rust

1 import nemo_relay
2 
3 config = nemo_relay.plugin.PluginConfig(
4     components=[
5         nemo_relay.plugin.ComponentSpec(
6             kind="pricing",
7             config={
8                 "sources": [
9                     {"type": "file", "path": "./pricing.json"},
10                 ],
11             },
12         )
13     ]
14 )
15 
16 report = nemo_relay.plugin.validate(config)
17 if any(diagnostic["level"] == "error" for diagnostic in report["diagnostics"]):
18     raise RuntimeError(report["diagnostics"])
19 
20 await nemo_relay.plugin.initialize(config)

Initialize model pricing once during process or harness startup, before the managed LLM calls whose responses should be cost-annotated. In tests or reusable harnesses, clear plugin configuration during teardown if later cases need a different resolver.

Built-in response codecs attach estimated cost directly to AnnotatedLlmResponse.usage.cost when model pricing is known. Managed LLM wrappers also enrich decoded custom response-codec output when the custom codec returns model and usage but omits usage.cost. Existing cost values are preserved, so provider-reported cost remains authoritative in the annotation.

Observability exporters prefer codec-normalized usage and cost, then fall back to raw payload fields and model-pricing estimates, subject to each exporter’s currency and reported-cost policy. When cost is available, each exporter projects it per the exporter field mapping below.

Token and Cost Field Semantics

This section is the stable reference for the token and cost fields on LLM end events, carried on AnnotatedLlmResponse.usage.

Granularity

Every token and cost value is per LLM call (one provider completion) unless it is an explicit aggregate:

AnnotatedLlmResponse.usage — The single LLM end event it is attached to.
OpenTelemetry and OpenInference attributes — The single LLM span they appear on; spans are never summed across calls.
ATIF steps — An exported LLM call typically yields a user start step (no metrics) and an agent end step that carries the call’s metrics.
ATIF final_metrics.total_* — The only aggregate: a per-trajectory sum of the metric fields present on its steps. It excludes embedded subagent trajectories and can be partial.

No exporter emits a running cross-call total other than ATIF final_metrics.

Usage Fields

All normalized fields are optional. A provider can omit a field, while a codec can compute one (such as Anthropic’s total_tokens) and configured pricing can synthesize cost. Usage has no catch-all field, so provider usage fields that Relay does not model are dropped.

Field	Meaning
`prompt_tokens`	Input/prompt tokens.
`completion_tokens`	Output/completion tokens, passed through unmodified. For OpenAI Responses this is the provider’s `output_tokens` (which per OpenAI already includes any reasoning tokens); the reasoning count is reported separately under `api_specific` and is not added on top.
`total_tokens`	Provider-reported, or computed as `prompt + completion` by some codecs (such as Anthropic) when the provider omits it.
`cache_read_tokens`	Prompt-cache read tokens, when the provider reports prompt caching.
`cache_write_tokens`	Prompt-cache write tokens (Anthropic-style providers).
`cost`	Normalized `CostEstimate`, when reported by the provider or estimable from configured pricing.

Built-in codecs normalize provider field names as follows:

Normalized field	OpenAI Chat	OpenAI Responses	Anthropic Messages
`prompt_tokens`	`prompt_tokens`	`input_tokens`	`input_tokens`
`completion_tokens`	`completion_tokens`	`output_tokens`	`output_tokens`
`total_tokens`	`total_tokens`	`total_tokens`	computed
`cache_read_tokens`	`prompt_tokens_details.cached_tokens`	`input_tokens_details.cached_tokens`	`cache_read_input_tokens`
`cache_write_tokens`	—	—	`cache_creation_input_tokens`

Built-in codecs preserve only modeled provider-specific usage details under api_specific; other usage fields are dropped. For example, OpenAI Responses reasoning token counts are kept under api_specific (output_tokens_details), but OpenAI Chat completion_tokens_details is not.

Cost Fields

CostEstimate carries cost amounts (in currency) plus pricing provenance. Refer to Cost Estimation above for resolution order and pricing setup.

Field	Meaning
`total`	Optional total cost in `currency`. When absent, some exporters derive a total from the component amounts.
`currency`	ISO 4217 code; defaults to `USD`.
`input` / `output` / `cache_read` / `cache_write`	Per-category amounts in `currency`.
`source`	`provider_reported` (authoritative) or `model_pricing` (estimated).
`pricing_provider` / `pricing_model` / `pricing_as_of` / `pricing_source`	Estimate provenance, for auditing stale pricing.

Missing is not zero: an absent cost or token field means unknown, while an explicit 0 is a reported value and is preserved. Relay does not convert currencies.

Exporter Field Mapping

Each exporter projects usage/cost differently. Projections do not change the canonical fields above.

	ATOF	ATIF step / `final_metrics`	OpenInference	OpenTelemetry
Prompt tokens	full `usage` preserved	`prompt_tokens` / `total_prompt_tokens`	`llm.token_count.prompt`	not emitted
Completion tokens	preserved	`completion_tokens` / `total_completion_tokens`	`llm.token_count.completion`	not emitted
Total tokens	preserved	no first-class field	`llm.token_count.total`	not emitted
Cache read / write	preserved	summed into `cached_tokens` / `total_cached_tokens`	`llm.token_count.prompt_details.cache_read` / `…cache_write`	not emitted
Cost	full `cost` preserved	`cost_usd` / `total_cost_usd` (USD only)	`llm.cost.total` (USD only)	`nemo_relay.llm.cost.total` + `nemo_relay.llm.cost.currency` (any currency)

OpenTelemetry carries cost in any currency, while ATIF and OpenInference report cost only when it is USD-denominated and otherwise omit it. ATIF derives metrics from codec-normalized usage where available and fills missing supported fields from the raw payload. metrics.extra holds only unmapped keys from the raw usage/token_usage object (for example reasoning token counts, or a raw total_tokens), and only when the step already has a recognized metric. Normalized-only or total-only values are not projected.

Stability

The Usage and CostEstimate field names and meanings, and the exporter mappings above, are stable as of ATOF 0.1 (ATIF schema ATIF-v1.7, pricing catalog version: 1). Future NeMo Relay releases can add new optional fields to the serialized JSON/ATOF shapes. Renames or removals are breaking changes and are called out in release notes.

The Rust Usage and CostEstimate structs and the CostSource enum are exhaustive, so adding a field or variant is a source-breaking change for Rust consumers.

The following behaviors are intentional in this release but can change later:

OpenTelemetry emits cost only, not token counts.
ATIF and OpenInference report cost only in USD.
Reasoning tokens are not a first-class Usage field.
Bindings expose usage/cost as snake_case JSON rather than typed objects.

Built-in Response Codecs

The built-in provider codecs also implement response decoding:

OpenAIChatCodec
OpenAIResponsesCodec
AnthropicMessagesCodec

Choose the codec that matches the actual provider response shape. For example, do not use OpenAIChatCodec for an OpenAI Responses API payload only because both came from an OpenAI-compatible provider.

Attach a Built-in Response Codec

The examples below attach built-in response codecs for supported provider response shapes.

Python

Node.js

Rust

1 import nemo_relay
2 from nemo_relay import LLMRequest
3 from nemo_relay.codecs import OpenAIChatCodec
4 
5 async def invoke_provider(request: LLMRequest):
6     return {
7         "id": "chatcmpl-demo",
8         "model": request.content["model"],
9         "choices": [
10             {
11                 "finish_reason": "stop",
12                 "message": {"role": "assistant", "content": "Hello from the provider."},
13             }
14         ],
15         "usage": {"prompt_tokens": 8, "completion_tokens": 5, "total_tokens": 13},
16     }
17 
18 codec = OpenAIChatCodec()
19 response = await nemo_relay.llm.execute(
20     "openai-chat",
21     LLMRequest({}, {"model": "gpt-4o-mini", "messages": []}),
22     invoke_provider,
23     model_name="gpt-4o-mini",
24     response_codec=codec,
25 )

Read Annotated Responses

Subscribers can inspect annotated_response on LLM end events. The exact event category fields are binding-provided, so defensive checks should confirm the annotation exists before reading it.

Python

Node.js

1 import nemo_relay
2 
3 def on_event(event):
4     annotated = getattr(event, "annotated_response", None)
5     if annotated is None:
6         return
7 
8     print("model", annotated.model)
9     print("text", annotated.response_text())
10     print("usage", annotated.usage)
11     print("cost", (annotated.usage or {}).get("cost"))
12 
13 nemo_relay.subscribers.register("response-debugger", on_event)

Custom Response Codecs

Use a custom response codec when the provider or framework response does not match a built-in shape.

In Python, a custom response codec can route to built-in codecs and return their native AnnotatedLLMResponse values:

1 from nemo_relay.codecs import OpenAIChatCodec, OpenAIResponsesCodec
2 
3 class OpenAIRoutingResponseCodec:
4     def __init__(self):
5         self.chat = OpenAIChatCodec()
6         self.responses = OpenAIResponsesCodec()
7 
8     def decode_response(self, response):
9         if response.get("object") == "response":
10             return self.responses.decode_response(response)
11         return self.chat.decode_response(response)

In Node.js, implement decodeResponse and return the normalized response JSON shape:

1 import type { JsonValue, LlmResponseCodec } from 'nemo-relay-node/typed';
2 
3 const frameworkResponseCodec: LlmResponseCodec = {
4   decodeResponse(response: JsonValue): JsonValue {
5     const raw = response as {
6       id?: string;
7       model_name?: string;
8       text?: string;
9       stop_reason?: string;
10       token_usage?: {
11         input?: number;
12         output?: number;
13       };
14     };
15 
16     return {
17       id: raw.id ?? null,
18       model: raw.model_name ?? null,
19       message: raw.text ?? '',
20       finish_reason: raw.stop_reason === 'max_tokens' ? 'length' : 'complete',
21       usage: {
22         prompt_tokens: raw.token_usage?.input ?? null,
23         completion_tokens: raw.token_usage?.output ?? null,
24         total_tokens:
25           raw.token_usage?.input === undefined || raw.token_usage?.output === undefined
26             ? null
27             : raw.token_usage.input + raw.token_usage.output,
28       },
29       provider_stop_reason: raw.stop_reason ?? null,
30     };
31   },
32 };

In Rust, implement LlmResponseCodec directly:

1 use nemo_relay::codec::request::MessageContent;
2 use nemo_relay::codec::response::{AnnotatedLlmResponse, FinishReason, Usage};
3 use nemo_relay::codec::traits::LlmResponseCodec;
4 use nemo_relay::error::{FlowError, Result};
5 use serde::Deserialize;
6 use serde_json::{Map, Value as Json};
7 
8 #[derive(Deserialize)]
9 struct FrameworkResponse {
10     id: Option<String>,
11     model_name: Option<String>,
12     text: Option<String>,
13     input_tokens: Option<u64>,
14     output_tokens: Option<u64>,
15 }
16 
17 struct FrameworkResponseCodec;
18 
19 impl LlmResponseCodec for FrameworkResponseCodec {
20     fn decode_response(&self, response: &Json) -> Result<AnnotatedLlmResponse> {
21         let raw: FrameworkResponse = serde_json::from_value(response.clone())
22             .map_err(|error| FlowError::Internal(error.to_string()))?;
23         let total_tokens = match (raw.input_tokens, raw.output_tokens) {
24             (Some(input), Some(output)) => Some(input + output),
25             _ => None,
26         };
27 
28         Ok(AnnotatedLlmResponse {
29             id: raw.id,
30             model: raw.model_name,
31             message: raw.text.map(MessageContent::Text),
32             tool_calls: None,
33             finish_reason: Some(FinishReason::Complete),
34             usage: Some(Usage {
35                 prompt_tokens: raw.input_tokens,
36                 completion_tokens: raw.output_tokens,
37                 total_tokens,
38                 cache_read_tokens: None,
39                 cache_write_tokens: None,
40                 cost: None,
41             }),
42             api_specific: None,
43             extra: Map::new(),
44         })
45     }
46 }

Streaming Responses

Streaming LLM wrappers decode the aggregated response produced by the stream finalizer. The response codec does not see each token or chunk. Use stream collectors for chunk-level behavior, and use response codecs for the final normalized end-event annotation.

Validation Checklist

Use this checklist to confirm the implementation preserves the expected runtime contract.

The response codec matches the actual provider response shape.
decode_response returns a normalized response with safe, JSON-compatible fields.
The provider response returned to the application is unchanged.
Subscribers see annotated_response only on LLM end events where decode succeeds.
Decode errors are tested and do not break the LLM call.
Streaming finalizers produce the same shape the response codec expects.

Common Issues

Check these symptoms first when the workflow does not behave as expected.

No annotation appears: The response codec returned an error or the raw provider response did not match the codec.
Returned response changed unexpectedly: Response codecs are not the right place to mutate caller-visible output.
Tool calls are missing: The codec did not map the provider’s tool-call structure into tool_calls.
Usage is inconsistent across providers: Normalize known token fields and preserve provider-specific usage details in api_specific or extra.

Next Steps

Use these links to continue from this workflow into the next related task.

Use Provider Codecs for request-side provider codecs and full request/response examples.
Use Wrap LLM Calls to add the managed LLM boundary first.
Use Observability after annotations are visible in local subscribers.