Provider Response Codecs

View as Markdown

Use this guide when subscribers, exporters, or diagnostics need a provider-neutral view of raw LLM responses.

What You Build

You will attach a response codec to a managed LLM wrapper so NeMo Relay can decode provider responses into AnnotatedLLMResponse data for LLM end events.

Response codecs are observability-only:

  • They do not rewrite the value returned to the application.
  • They do not run response middleware.
  • They attach normalized response data to lifecycle events for subscribers and exporters.
  • Decode failures are non-fatal; the LLM call still returns the provider response and the end event is emitted without an annotation.

Before You Start

You need:

  • A managed LLM boundary from Wrap LLM Calls.
  • A raw provider response that is JSON-compatible.
  • A built-in response codec or a custom response codec for the provider response shape.
  • A subscriber or exporter that consumes annotated_response from LLM end events.

What Response Codecs Decode

Response codecs normalize provider output into fields that subscribers can inspect consistently:

FieldPurpose
idProvider response identifier.
modelModel that served the request, when the provider returns it.
messagePrimary assistant message content.
tool_callsTool calls requested by the model.
finish_reasonNormalized completion reason, such as complete, length, tool_use, or content_filter.
usageToken accounting, including cache-read and cache-write counts when available. May also include normalized cost when the provider reports cost or Relay can estimate it from known model pricing.
api_specificProvider-specific fields that do not fit the common model.
extraAdditional unmodeled response fields.

Use these annotations for observability, export, and debugging. Keep business logic that changes the caller-visible response in the framework or provider adapter, not in the response codec.

Cost Estimation

Response codecs should keep reporting provider usage fields without rewriting the caller-visible response. If a provider or framework reports cost, map it to Usage.cost with source: "provider_reported". Otherwise Relay can layer cost estimation onto AnnotatedLlmResponse.usage.cost when all required inputs are available:

  • The decoded response includes model.
  • The managed LLM call name identifies the provider or route, such as openai, anthropic, or azure/openai, when provider-specific pricing is needed.
  • The decoded response includes prompt and/or completion token usage.
  • Relay has an explicit pricing entry for that model or alias.

Pricing estimates carry pricing_provider, pricing_model, pricing_as_of, pricing_source, and currency metadata so stale pricing can be audited without failing response decoding. Normalized cost uses currency-neutral amount fields such as total, input, output, cache_read, and cache_write. Unknown model pricing and missing token data are non-fatal: Relay omits the cost field and still exports token metrics and response annotations.

Relay resolves pricing through an active PricingResolver source chain. Provider or framework-reported cost remains authoritative; the resolver is used only when Usage.cost is missing. Relay does not ship provider price data by default: estimates require a configured inline, file, or embedding-provided pricing source. With no configured source, every model is treated as unknown for pricing.

Pricing is runtime state, not a CLI-only feature. Any host that initializes Relay plugins can activate the built-in pricing component before it runs managed LLM calls. This includes application code, eval harnesses, custom agents, framework integrations, and third-party patches. The CLI commands below are a file-management convenience for the local gateway; embedded hosts can pass the same component config directly through the plugin APIs.

Source precedence is deployment controlled:

  1. Project or application overrides.
  2. User/global device pricing.
  3. Enterprise-managed sources, such as a remotely synced file or a service backed by a database.

The built-in pricing plugin component accepts inline catalogs or JSON catalog files in precedence order. In discovered plugins.toml config, system config loads first, project config loads next, and user config loads last. For the pricing component, higher-priority sources are prepended instead of replacing lower-priority sources, so a user override can win for one model while enterprise or fleet pricing remains available for everything else:

1[[components]]
2kind = "pricing"
3enabled = true
4
5[[components.config.sources]]
6type = "file"
7path = "/etc/nemo-relay/pricing.json"
8
9[[components.config.sources]]
10type = "inline"
11[components.config.sources.catalog]
12version = 1
13entries = []

Each catalog entry declares:

  • provider and canonical model_id.
  • aliases for dated or provider-specific model IDs.
  • currency, defaulting to USD.
  • unit, defaulting to per_token. Relay estimates only per_token entries in this version; per_request, per_second, and gpu_hour are representable for future source integrations but are not estimated.
  • rates per one million input, output, cache-read, and cache-write tokens for flat per_token entries.
  • rate_schedule for data-driven threshold pricing, such as models whose full-request input/output rates change after a prompt-token threshold.
  • prompt_cache.read_accounting, which tells Relay whether cache-read tokens are already included in prompt tokens.
  • pricing_as_of and pricing_source for auditability.

Relay validates catalogs at startup and rejects duplicate canonical IDs or aliases within the same normalized provider/model key. The same model ID can appear under distinct providers, such as openai/gpt-4o-mini and azure/openai/gpt-4o-mini. Adding a model should be a catalog/source update plus tests; it should not require adding another Rust match arm.

Use the CLI to validate catalog files and manage file-backed pricing sources:

$nemo-relay pricing validate /path/to/pricing.json
$nemo-relay pricing init --project
$nemo-relay pricing add-source /path/to/pricing.json --project
$nemo-relay pricing resolve gpt-4o-mini --provider openai --prompt-tokens 1000 --completion-tokens 500

pricing init creates or enables the pricing plugin component in the selected plugins.toml. The initialized component has an empty sources list; use pricing add-source or an inline config edit to provide pricing data.

pricing add-source validates the referenced JSON catalog before updating plugins.toml. It creates the pricing component if needed and prepends the new file source by default, making it the highest-priority source in that scope. Use --append when the file should be a lower-priority fallback. Both commands default to user config at $XDG_CONFIG_HOME/nemo-relay/plugins.toml; pass --project for .nemo-relay/plugins.toml or --global for /etc/nemo-relay/plugins.toml.

pricing resolve uses the same discovered config path as the gateway. It reports the winning catalog source, matched provider/model, and, when token counts are supplied, the estimated total cost. The source line is one of file:<path> or inline:<index>, which makes overlapping project/user/fleet entries debuggable. This is a dry diagnostic command; it does not mutate configuration.

nemo-relay doctor also validates enabled pricing sources and reports missing, unreadable, or invalid catalogs before the gateway starts.

Model lookup is provider-aware and route-aware. Relay uses the managed LLM call name as the provider/route and first tries provider-scoped keys for the full model and terminal model name, then falls back to model-only suffixes. For example, a call named azure/openai with response model = "gpt-4o-mini" tries azure/openai/gpt-4o-mini before generic gpt-4o-mini. If the model string is itself routed, such as azure/openai/gpt-4o-mini, Relay can infer azure/openai for the terminal model before trying slash-delimited model-only suffixes. This keeps route-specific enterprise pricing authoritative when configured while still allowing generic model pricing to apply to routed names.

For threshold pricing, use rate_schedule.type = "prompt_token_threshold". Relay selects exactly one tier from prompt_tokens and applies that tier to the full request; it does not price only the overflow tokens at the higher rate. This matches providers that publish “short context” and “long context” prices for the entire request/session. If prompt_tokens is missing for a thresholded entry, Relay omits the estimate instead of guessing.

1{
2 "provider": "google",
3 "model_id": "gemini-3.1-pro-preview",
4 "aliases": ["gemini-3.1-pro-preview-customtools"],
5 "pricing_as_of": "2026-06-05",
6 "pricing_source": "https://ai.google.dev/gemini-api/docs/pricing",
7 "rate_schedule": {
8 "type": "prompt_token_threshold",
9 "applies_to": "full_request",
10 "tiers": [
11 {
12 "max_prompt_tokens": 200000,
13 "rates": {
14 "input_per_million": 2.0,
15 "output_per_million": 12.0,
16 "cache_read_per_million": 0.2
17 }
18 },
19 {
20 "min_prompt_tokens": 200001,
21 "rates": {
22 "input_per_million": 4.0,
23 "output_per_million": 18.0,
24 "cache_read_per_million": 0.4
25 }
26 }
27 ]
28 },
29 "prompt_cache": {
30 "read_accounting": "included_in_prompt_tokens"
31 }
32}

Database-backed or remote pricing should be implemented as a source that returns a validated PricingCatalog snapshot to Relay. Keep database queries, service auth, refresh cadence, and caching outside the LLM response hot path. A fleet deployment can refresh /etc/nemo-relay/pricing.json from an IT-managed service, or embed a custom Rust PricingSource that reads from a database and installs a PricingResolver snapshot during process startup.

External pricing catalogs should be converted into Relay catalog JSON out-of-band and then loaded through a file source, unless the embedding application installs a custom Rust PricingSource directly.

Embedded applications and eval harnesses can initialize the built-in pricing component directly:

1import nemo_relay
2
3config = nemo_relay.plugin.PluginConfig(
4 components=[
5 nemo_relay.plugin.ComponentSpec(
6 kind="pricing",
7 config={
8 "sources": [
9 {"type": "file", "path": "./pricing.json"},
10 ],
11 },
12 )
13 ]
14)
15
16report = nemo_relay.plugin.validate(config)
17if any(diagnostic["level"] == "error" for diagnostic in report["diagnostics"]):
18 raise RuntimeError(report["diagnostics"])
19
20await nemo_relay.plugin.initialize(config)

Initialize pricing once during process or harness startup, before the managed LLM calls whose responses should be cost-annotated. In tests or reusable harnesses, clear plugin configuration during teardown if later cases need a different resolver.

Built-in response codecs attach estimated cost directly to AnnotatedLlmResponse.usage.cost when pricing is known. Managed LLM wrappers also enrich decoded custom response-codec output when the custom codec returns model and usage but omits usage.cost. Existing cost values are preserved, so provider-reported cost remains authoritative in the annotation.

Observability exporters prefer an explicit cost in the raw payload, then normalized Usage.cost, then a derived estimate from model pricing. When cost is available, ATIF step metrics and final metrics include cost_usd, OpenInference includes the USD-denominated llm.cost.total, and OpenTelemetry includes nemo_relay.llm.cost.total and nemo_relay.llm.cost.currency.

Built-in Response Codecs

The built-in provider codecs also implement response decoding:

  • OpenAIChatCodec
  • OpenAIResponsesCodec
  • AnthropicMessagesCodec

Choose the codec that matches the actual provider response shape. For example, do not use OpenAIChatCodec for an OpenAI Responses API payload only because both came from an OpenAI-compatible provider.

Attach a Built-in Response Codec

The examples below attach built-in response codecs for supported provider response shapes.

1import nemo_relay
2from nemo_relay import LLMRequest
3from nemo_relay.codecs import OpenAIChatCodec
4
5async def invoke_provider(request: LLMRequest):
6 return {
7 "id": "chatcmpl-demo",
8 "model": request.content["model"],
9 "choices": [
10 {
11 "finish_reason": "stop",
12 "message": {"role": "assistant", "content": "Hello from the provider."},
13 }
14 ],
15 "usage": {"prompt_tokens": 8, "completion_tokens": 5, "total_tokens": 13},
16 }
17
18codec = OpenAIChatCodec()
19response = await nemo_relay.llm.execute(
20 "openai-chat",
21 LLMRequest({}, {"model": "gpt-4o-mini", "messages": []}),
22 invoke_provider,
23 model_name="gpt-4o-mini",
24 response_codec=codec,
25)

Read Annotated Responses

Subscribers can inspect annotated_response on LLM end events. The exact event category fields are binding-provided, so defensive checks should confirm the annotation exists before reading it.

1import nemo_relay
2
3def on_event(event):
4 annotated = getattr(event, "annotated_response", None)
5 if annotated is None:
6 return
7
8 print("model", annotated.model)
9 print("text", annotated.response_text())
10 print("usage", annotated.usage)
11 print("cost", (annotated.usage or {}).get("cost"))
12
13nemo_relay.subscribers.register("response-debugger", on_event)

Custom Response Codecs

Use a custom response codec when the provider or framework response does not match a built-in shape.

In Python, a custom response codec can route to built-in codecs and return their native AnnotatedLLMResponse values:

1from nemo_relay.codecs import OpenAIChatCodec, OpenAIResponsesCodec
2
3class OpenAIRoutingResponseCodec:
4 def __init__(self):
5 self.chat = OpenAIChatCodec()
6 self.responses = OpenAIResponsesCodec()
7
8 def decode_response(self, response):
9 if response.get("object") == "response":
10 return self.responses.decode_response(response)
11 return self.chat.decode_response(response)

In Node.js, implement decodeResponse and return the normalized response JSON shape:

1import type { JsonValue, LlmResponseCodec } from 'nemo-relay-node/typed';
2
3const frameworkResponseCodec: LlmResponseCodec = {
4 decodeResponse(response: JsonValue): JsonValue {
5 const raw = response as {
6 id?: string;
7 model_name?: string;
8 text?: string;
9 stop_reason?: string;
10 token_usage?: {
11 input?: number;
12 output?: number;
13 };
14 };
15
16 return {
17 id: raw.id ?? null,
18 model: raw.model_name ?? null,
19 message: raw.text ?? '',
20 finish_reason: raw.stop_reason === 'max_tokens' ? 'length' : 'complete',
21 usage: {
22 prompt_tokens: raw.token_usage?.input ?? null,
23 completion_tokens: raw.token_usage?.output ?? null,
24 total_tokens:
25 raw.token_usage?.input === undefined || raw.token_usage?.output === undefined
26 ? null
27 : raw.token_usage.input + raw.token_usage.output,
28 },
29 provider_stop_reason: raw.stop_reason ?? null,
30 };
31 },
32};

In Rust, implement LlmResponseCodec directly:

1use nemo_relay::codec::request::MessageContent;
2use nemo_relay::codec::response::{AnnotatedLlmResponse, FinishReason, Usage};
3use nemo_relay::codec::traits::LlmResponseCodec;
4use nemo_relay::error::{FlowError, Result};
5use serde::Deserialize;
6use serde_json::{Map, Value as Json};
7
8#[derive(Deserialize)]
9struct FrameworkResponse {
10 id: Option<String>,
11 model_name: Option<String>,
12 text: Option<String>,
13 input_tokens: Option<u64>,
14 output_tokens: Option<u64>,
15}
16
17struct FrameworkResponseCodec;
18
19impl LlmResponseCodec for FrameworkResponseCodec {
20 fn decode_response(&self, response: &Json) -> Result<AnnotatedLlmResponse> {
21 let raw: FrameworkResponse = serde_json::from_value(response.clone())
22 .map_err(|error| FlowError::Internal(error.to_string()))?;
23 let total_tokens = match (raw.input_tokens, raw.output_tokens) {
24 (Some(input), Some(output)) => Some(input + output),
25 _ => None,
26 };
27
28 Ok(AnnotatedLlmResponse {
29 id: raw.id,
30 model: raw.model_name,
31 message: raw.text.map(MessageContent::Text),
32 tool_calls: None,
33 finish_reason: Some(FinishReason::Complete),
34 usage: Some(Usage {
35 prompt_tokens: raw.input_tokens,
36 completion_tokens: raw.output_tokens,
37 total_tokens,
38 cache_read_tokens: None,
39 cache_write_tokens: None,
40 cost: None,
41 }),
42 api_specific: None,
43 extra: Map::new(),
44 })
45 }
46}

Streaming Responses

Streaming LLM wrappers decode the aggregated response produced by the stream finalizer. The response codec does not see each token or chunk. Use stream collectors for chunk-level behavior, and use response codecs for the final normalized end-event annotation.

Validation Checklist

Use this checklist to confirm the implementation preserves the expected runtime contract.

  • The response codec matches the actual provider response shape.
  • decode_response returns a normalized response with safe, JSON-compatible fields.
  • The provider response returned to the application is unchanged.
  • Subscribers see annotated_response only on LLM end events where decode succeeds.
  • Decode errors are tested and do not break the LLM call.
  • Streaming finalizers produce the same shape the response codec expects.

Common Issues

Check these symptoms first when the workflow does not behave as expected.

  • No annotation appears: The response codec returned an error or the raw provider response did not match the codec.
  • Returned response changed unexpectedly: Response codecs are not the right place to mutate caller-visible output.
  • Tool calls are missing: The codec did not map the provider’s tool-call structure into tool_calls.
  • Usage is inconsistent across providers: Normalize known token fields and preserve provider-specific usage details in api_specific or extra.

Next Steps

Use these links to continue from this workflow into the next related task.

  • Use Provider Codecs for request-side provider codecs and full request/response examples.
  • Use Wrap LLM Calls to add the managed LLM boundary first.
  • Use Observability after annotations are visible in local subscribers.