Provider Response Codecs
Use this guide when subscribers, exporters, or diagnostics need a provider-neutral view of raw LLM responses.
What You Build
You will attach a response codec to a managed LLM wrapper so NeMo Relay can decode provider responses into AnnotatedLLMResponse data for LLM end events.
Response codecs are observability-only:
- They do not rewrite the value returned to the application.
- They do not run response middleware.
- They attach normalized response data to lifecycle events for subscribers and exporters.
- Decode failures are non-fatal; the LLM call still returns the provider response and the end event is emitted without an annotation.
Before You Start
You need:
- A managed LLM boundary from Wrap LLM Calls.
- A raw provider response that is JSON-compatible.
- A built-in response codec or a custom response codec for the provider response shape.
- A subscriber or exporter that consumes
annotated_responsefrom LLM end events.
What Response Codecs Decode
Response codecs normalize provider output into fields that subscribers can inspect consistently:
Use these annotations for observability, export, and debugging. Keep business logic that changes the caller-visible response in the framework or provider adapter, not in the response codec.
Cost Estimation
Response codecs should keep reporting provider usage fields without rewriting
the caller-visible response. If a provider or framework reports cost, map it to
Usage.cost with source: "provider_reported". Otherwise Relay can layer cost
estimation onto AnnotatedLlmResponse.usage.cost when all required inputs are
available:
- The decoded response includes
model. - The managed LLM call name identifies the provider or route, such as
openai,anthropic, orazure/openai, when provider-specific pricing is needed. - The decoded response includes prompt and/or completion token usage.
- Relay has an explicit pricing entry for that model or alias.
Pricing estimates carry pricing_provider, pricing_model, pricing_as_of,
pricing_source, and currency metadata so stale pricing can be audited
without failing response decoding. Normalized cost uses currency-neutral amount
fields such as total, input, output, cache_read, and cache_write.
Unknown model pricing and missing token data are non-fatal: Relay omits the cost
field and still exports token metrics and response annotations.
Relay resolves pricing through an active PricingResolver source chain. Provider
or framework-reported cost remains authoritative; the resolver is used only when
Usage.cost is missing. Relay does not ship provider price data by default:
estimates require a configured inline, file, or embedding-provided pricing
source. With no configured source, every model is treated as unknown for pricing.
Pricing is runtime state, not a CLI-only feature. Any host that initializes
Relay plugins can activate the built-in pricing component before it runs
managed LLM calls. This includes application code, eval harnesses, custom
agents, framework integrations, and third-party patches. The CLI commands below
are a file-management convenience for the local gateway; embedded hosts can pass
the same component config directly through the plugin APIs.
Source precedence is deployment controlled:
- Project or application overrides.
- User/global device pricing.
- Enterprise-managed sources, such as a remotely synced file or a service backed by a database.
The built-in pricing plugin component accepts inline catalogs or JSON catalog
files in precedence order. In discovered plugins.toml config, system config
loads first, project config loads next, and user config loads last. For the
pricing component, higher-priority sources are prepended instead of
replacing lower-priority sources, so a user override can win for one model while
enterprise or fleet pricing remains available for everything else:
Each catalog entry declares:
providerand canonicalmodel_id.aliasesfor dated or provider-specific model IDs.currency, defaulting toUSD.unit, defaulting toper_token. Relay estimates onlyper_tokenentries in this version;per_request,per_second, andgpu_hourare representable for future source integrations but are not estimated.ratesper one million input, output, cache-read, and cache-write tokens for flatper_tokenentries.rate_schedulefor data-driven threshold pricing, such as models whose full-request input/output rates change after a prompt-token threshold.prompt_cache.read_accounting, which tells Relay whether cache-read tokens are already included in prompt tokens.pricing_as_ofandpricing_sourcefor auditability.
Relay validates catalogs at startup and rejects duplicate canonical IDs or
aliases within the same normalized provider/model key. The same model ID can
appear under distinct providers, such as openai/gpt-4o-mini and
azure/openai/gpt-4o-mini. Adding a model should be a catalog/source update
plus tests; it should not require adding another Rust match arm.
Use the CLI to validate catalog files and manage file-backed pricing sources:
pricing init creates or enables the pricing plugin component in the selected
plugins.toml. The initialized component has an empty sources list; use
pricing add-source or an inline config edit to provide pricing data.
pricing add-source validates the referenced JSON catalog before updating
plugins.toml. It creates the pricing component if needed and prepends the new
file source by default, making it the highest-priority source in that scope. Use
--append when the file should be a lower-priority fallback. Both commands
default to user config at $XDG_CONFIG_HOME/nemo-relay/plugins.toml; pass
--project for .nemo-relay/plugins.toml or --global for
/etc/nemo-relay/plugins.toml.
pricing resolve uses the same discovered config path as the gateway. It
reports the winning catalog source, matched provider/model, and, when token
counts are supplied, the estimated total cost. The source line is one of
file:<path> or inline:<index>, which makes overlapping project/user/fleet
entries debuggable. This is a dry diagnostic command; it does not mutate
configuration.
nemo-relay doctor also validates enabled pricing sources and reports missing,
unreadable, or invalid catalogs before the gateway starts.
Model lookup is provider-aware and route-aware. Relay uses the managed LLM call
name as the provider/route and first tries provider-scoped keys for the full
model and terminal model name, then falls back to model-only suffixes. For
example, a call named azure/openai with response model = "gpt-4o-mini" tries
azure/openai/gpt-4o-mini before generic gpt-4o-mini. If the model string is
itself routed, such as azure/openai/gpt-4o-mini, Relay can infer
azure/openai for the terminal model before trying slash-delimited model-only
suffixes. This keeps route-specific enterprise pricing authoritative when
configured while still allowing generic model pricing to apply to routed names.
For threshold pricing, use rate_schedule.type = "prompt_token_threshold".
Relay selects exactly one tier from prompt_tokens and applies that tier to the
full request; it does not price only the overflow tokens at the higher rate.
This matches providers that publish “short context” and “long context” prices
for the entire request/session. If prompt_tokens is missing for a thresholded
entry, Relay omits the estimate instead of guessing.
Database-backed or remote pricing should be implemented as a source that returns
a validated PricingCatalog snapshot to Relay. Keep database queries, service
auth, refresh cadence, and caching outside the LLM response hot path. A fleet
deployment can refresh /etc/nemo-relay/pricing.json from an IT-managed service,
or embed a custom Rust PricingSource that reads from a database and installs a
PricingResolver snapshot during process startup.
External pricing catalogs should be converted into Relay catalog JSON
out-of-band and then loaded through a file source, unless the embedding
application installs a custom Rust PricingSource directly.
Embedded applications and eval harnesses can initialize the built-in pricing component directly:
Python
Node.js
Rust
Initialize pricing once during process or harness startup, before the managed LLM calls whose responses should be cost-annotated. In tests or reusable harnesses, clear plugin configuration during teardown if later cases need a different resolver.
Built-in response codecs attach estimated cost directly to
AnnotatedLlmResponse.usage.cost when pricing is known. Managed LLM wrappers
also enrich decoded custom response-codec output when the custom codec returns
model and usage but omits usage.cost. Existing cost values are preserved,
so provider-reported cost remains authoritative in the annotation.
Observability exporters prefer an explicit cost in the raw payload, then
normalized Usage.cost, then a derived estimate from model pricing. When cost is
available, ATIF step metrics and final metrics include cost_usd,
OpenInference includes the USD-denominated llm.cost.total, and OpenTelemetry
includes nemo_relay.llm.cost.total and nemo_relay.llm.cost.currency.
Built-in Response Codecs
The built-in provider codecs also implement response decoding:
OpenAIChatCodecOpenAIResponsesCodecAnthropicMessagesCodec
Choose the codec that matches the actual provider response shape. For example, do not use OpenAIChatCodec for an OpenAI Responses API payload only because both came from an OpenAI-compatible provider.
Attach a Built-in Response Codec
The examples below attach built-in response codecs for supported provider response shapes.
Python
Node.js
Rust
Read Annotated Responses
Subscribers can inspect annotated_response on LLM end events. The exact event category fields are binding-provided, so defensive checks should confirm the annotation exists before reading it.
Python
Node.js
Custom Response Codecs
Use a custom response codec when the provider or framework response does not match a built-in shape.
In Python, a custom response codec can route to built-in codecs and return their native AnnotatedLLMResponse values:
In Node.js, implement decodeResponse and return the normalized response JSON shape:
In Rust, implement LlmResponseCodec directly:
Streaming Responses
Streaming LLM wrappers decode the aggregated response produced by the stream finalizer. The response codec does not see each token or chunk. Use stream collectors for chunk-level behavior, and use response codecs for the final normalized end-event annotation.
Validation Checklist
Use this checklist to confirm the implementation preserves the expected runtime contract.
- The response codec matches the actual provider response shape.
decode_responsereturns a normalized response with safe, JSON-compatible fields.- The provider response returned to the application is unchanged.
- Subscribers see
annotated_responseonly on LLM end events where decode succeeds. - Decode errors are tested and do not break the LLM call.
- Streaming finalizers produce the same shape the response codec expects.
Common Issues
Check these symptoms first when the workflow does not behave as expected.
- No annotation appears: The response codec returned an error or the raw provider response did not match the codec.
- Returned response changed unexpectedly: Response codecs are not the right place to mutate caller-visible output.
- Tool calls are missing: The codec did not map the provider’s tool-call structure into
tool_calls. - Usage is inconsistent across providers: Normalize known token fields and preserve provider-specific usage details in
api_specificorextra.
Next Steps
Use these links to continue from this workflow into the next related task.
- Use Provider Codecs for request-side provider codecs and full request/response examples.
- Use Wrap LLM Calls to add the managed LLM boundary first.
- Use Observability after annotations are visible in local subscribers.