Vendor Usage Field Reference
This document catalogues the exact JSON shape of the usage field that each LLM provider returns in chat / completion responses, cross-referenced against their official SDK source code. It exists so that:
- A maintainer adding a new vendor knows what to look for and where existing vendors agree or differ.
- A debugger investigating “why doesn’t my usage metric show a value” can find the canonical field-name list per vendor.
- A reviewer of a future usage-parsing change can verify that no vendor’s wire format was missed.
The verification work behind this document was performed by inspecting each provider’s Python SDK source (or REST API documentation when no SDK type was available). All conclusions are dated against the SDK / docs commit at the time of verification (early 2026).
Quick reference: vendor shape map
How AIPerf normalizes these shapes
AIPerf wraps every API-reported usage dict in a Usage class (src/aiperf/common/models/usage_models.py). On construction, two recognized vendor envelopes are unwrapped to the top level so all properties read from a single flat dict:
- Gemini
usageMetadata→ top-level (liftspromptTokenCount,candidatesTokenCount, etc.). - Cohere v1
meta→ top-level (liftsmeta.tokens.{input,output}_tokens,meta.cached_tokens). - Cohere v2 top-level
tokenssub-dict → top-level (liftstokens.{input,output}_tokens).
The original keys are preserved if a normalized key would collide; the original wins.
After normalization, each property reads through an ordered synonym list (the *_KEYS class attributes). The first present key wins. Properties return None when no synonym is present, so 0 is correctly distinguished from “missing”.
Per-vendor verification details
OpenAI
Verified against: openai-python / src/openai/types/completion_usage.py.
All field names match AIPerf’s modelled synonyms. cached_tokens is read-only on OpenAI (writes are transparent and free), so we do not raise NoMetricValue for OpenAI when the cache-write metric is queried — we just return None. OpenAI does NOT surface a separate cache-miss count; you can derive it from prompt_tokens - prompt_tokens_details.cached_tokens if needed.
vLLM
Verified against: vllm / vllm/entrypoints/openai/engine/protocol.py.
vLLM is OpenAI-compatible. Its prompt_tokens_details is narrower than OpenAI’s (only cached_tokens, no audio_tokens). vLLM may emit prompt_tokens_details: null and completion_tokens_details: null explicitly; AIPerf’s nested-field walk handles that case (the isinstance(details, dict) guard returns False, and the property returns None).
Anthropic
Verified against: anthropic-sdk-python / src/anthropic/types/usage.py, message_delta_usage.py, cache_creation.py, and server_tool_usage.py.
Streaming chunks use MessageDeltaUsage, which carries the same fields as Usage for cache and tokens (a non-streaming chunk + MessageDeltaUsage contain the same shape for our purposes).
Modelled: input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens.
Not modelled (preserved on dict):
cache_creationTTL breakdown (sum ofephemeral_1h_input_tokens + ephemeral_5m_input_tokensequals the parentcache_creation_input_tokens). Could be added if TTL-aware analysis is needed.server_tool_use(web_fetch_requests,web_search_requests). Non-token metadata.service_tier(“standard”/“priority”/“batch”). String label, not a count.inference_geo. String label.
Google Gemini
Verified against: google-genai / google/genai/types.py (GenerateContentResponseUsageMetadata) and _common.py (alias_generator=to_camel).
The Python SDK declares fields in snake_case for Python ergonomics, but the Pydantic alias_generator=to_camel config means the wire (JSON) format is camelCase. AIPerf operates at the JSON level, so the camelCase names are what we synonym-match.
Wire-format field names (after to_camel): cachedContentTokenCount, candidatesTokenCount, promptTokenCount, thoughtsTokenCount, toolUsePromptTokenCount, totalTokenCount.
The whole object is wrapped in usageMetadata at the response top level; AIPerf’s Usage.__init__ unwraps it.
Not modelled (preserved on dict): the four *Details[] arrays of ModalityTokenCount objects (per-modality breakdowns: TEXT / IMAGE / AUDIO / VIDEO). Useful for multimodal benchmarks where you want to know what fraction of input tokens were images, but currently surfaced verbatim as a list rather than as a metric.
Note on prompt_token_count: Gemini’s docs say “When cached_content is set, prompt_token_count includes the number of tokens in the cached content.” So for Gemini, prompt_tokens is total-including-cached, and cached_content_token_count is the subset that was cached. This matches OpenAI’s semantic where prompt_tokens is the total and cached_tokens is the subset of those that hit cache.
AWS Bedrock
Verified against: AWS Bedrock TokenUsage API reference. No Python SDK clone needed — boto3 follows the documented API verbatim.
Modelled: inputTokens, outputTokens, totalTokens, cacheReadInputTokens, cacheWriteInputTokens. All synonyms in the *_KEYS lists.
Not modelled (preserved on dict): cacheDetails[] TTL breakdown array.
Note that Bedrock’s field names exactly match Anthropic’s concept names but use camelCase. This is because Bedrock primarily proxies Anthropic models and converted the snake_case names to camelCase for AWS API conventions. The semantic mapping is one-to-one:
DeepSeek
Verified against: DeepSeek API documentation.
Modelled: all of the above. prompt_cache_hit_tokens is mapped to prompt_cache_read_tokens via the synonym list. prompt_cache_miss_tokens is its own first-class metric (UsagePromptCacheMissTokensMetric) since DeepSeek bills hits and misses at different rates and no other vendor surfaces the miss count as its own field.
Invariant: prompt_tokens == prompt_cache_hit_tokens + prompt_cache_miss_tokens for DeepSeek responses. AIPerf has a test asserting this end-to-end.
Cohere
Cohere has TWO API versions with different envelopes. AIPerf handles both.
v1 — verified against: cohere-python / src/cohere/types/api_meta.py and api_meta_tokens.py.
The meta envelope is at the response root (not under a usage key). If the parser hands the full response to Usage(), meta is what’s there. AIPerf unwraps:
meta.tokens.input_tokens→ top-level (resolved viaPROMPT_TOKENS_KEYS)meta.tokens.output_tokens→ top-level (resolved viaCOMPLETION_TOKENS_KEYS)meta.cached_tokens→ top-level (resolved viaCACHE_READ_TOP_LEVEL_KEYS)
v2 — verified against: cohere-python / src/cohere/types/usage.py, usage_tokens.py, and usage_billed_units.py.
The usage field at the response root contains billed_units, tokens, and cached_tokens directly — no meta wrapper. AIPerf treats top-level tokens (a sub-dict) the same way as meta.tokens and unwraps it. Top-level cached_tokens is in CACHE_READ_TOP_LEVEL_KEYS.
billed_units is intentionally NOT surfaced as a metric. Cohere’s billed-vs-raw distinction is a Cohere-specific accounting filter (the framework injects special tokens that count toward the raw tokens total but aren’t billed). For perf benchmarks, the raw count is what the model actually processed — which is what every other vendor reports — so we keep prompt_tokens consistent across vendors. Callers that need billing reconciliation can read usage["meta"]["billed_units"] (v1) or usage["billed_units"] (v2) directly off the underlying dict.
billed_units for chat:
input_tokens,output_tokens— billed token countssearch_units,classifications— non-token billable units (RAG / classification endpoints)
Mistral
Verified against: mistralai/client-python / src/mistralai/client/models/usageinfo.py.
The SDK type declares prompt_audio_seconds as Optional[int], but observed wire responses on Mistral’s agents endpoint have shown the field emit as {} (an empty dict) when no audio is present in the prompt — visible in Mistral’s documented response examples. AIPerf’s prompt_audio_seconds property is defensive — it only coerces numeric values (int / float, excluding bool); any other type returns None rather than raising TypeError from float({}). The defensiveness is cheap and protects against either SDK / wire-format drift.
Note: prompt_audio_seconds is in MetricTimeUnit.SECONDS, distinct from UsagePromptAudioTokensMetric which is in GenericMetricUnit.TOKENS. The two metrics can coexist for the same response when Mistral reports both.
Groq
Verified against: groq-python / src/groq/types/completion_usage.py.
Token fields are pure OpenAI shape. The four *_time fields are server-side timing in seconds — useful for performance benchmarks (queue time + prompt time + completion time = end-to-end latency components). Currently preserved on the dict but not surfaced as metrics. Adding them as optional BaseUsageRecordMetric[float] subclasses with MetricTimeUnit.SECONDS would be a small follow-up if Groq benchmarking becomes a priority.
Together AI / Fireworks / Replicate / Azure OpenAI
These are passthrough proxies that emit OpenAI-compatible usage shapes. Verified Together via together-python / src/together/types/common.py:
Verified Fireworks via fw-ai-external/python-sdk / src/fireworks/types/shared/usage_info.py:
Replicate’s SDK does not declare a fixed Usage type because it passes through whatever the underlying hosted model emits. Azure OpenAI uses the openai-python SDK directly, so it inherits OpenAI’s exact shape.
No vendor-specific changes needed for any of these; they’re covered by the OpenAI synonyms.
Cerebras
Verified against: Cerebras/cerebras-cloud-sdk-python / src/cerebras/cloud/sdk/types/chat/chat_completion.py.
OpenAI-shape token-count fields (Stainless-generated SDK), but the *_tokens_details sub-objects are a strict subset of OpenAI’s: no audio_tokens in either, no reasoning_tokens in completion details. AIPerf’s broader OpenAI-shape coverage is forward-compatible — Cerebras responses simply don’t populate the missing inner keys, and the corresponding metrics raise NoMetricValue rather than crashing.
AI21 Labs
Verified against: AI21Labs/ai21-python / ai21/models/usage_info.py.
Minimal OpenAI-shape — only the three baseline fields. No nested details, no cache info, no extras. Already covered.
SambaNova
Verified against: sambanova/sambanova-python / src/sambanova/types/chat/chat_completion_response.py.
The Usage class is unusually rich because SambaNova bakes server-side timing/throughput data directly into the usage envelope:
Modelled: all token-count fields via OpenAI synonyms.
Not modelled (preserved on dict): the rich timing/throughput data. AIPerf computes equivalents client-side (TTFTMetric, RequestLatencyMetric, OutputTokenThroughputPerUserMetric, InterTokenLatencyMetric); SambaNova’s server-side measurements are parallel/redundant signals. They could be surfaced as their own metrics if a workflow needed server-vs-client divergence checking.
Bailian / DashScope (Alibaba Qwen)
Verified against: dashscope/dashscope-sdk-python / dashscope/api_entities/dashscope_response.py.
Modelled: input_tokens and output_tokens are already in PROMPT_TOKENS_KEYS / COMPLETION_TOKENS_KEYS (Anthropic-shape synonyms).
Notable absences: no total_tokens field (in either Bailian variant). The total_tokens property returns None for native DashScope responses; callers that need it can compute input_tokens + output_tokens themselves.
Not modelled: characters (multimodal-only). It represents image/audio inputs measured in characters rather than tokens — useful for billing reconciliation but not a standard cross-vendor metric.
Note: Bailian also offers an OpenAI-compatible REST endpoint (compatible-mode) that emits standard OpenAI shape. AIPerf benchmarking either endpoint is supported.
Vertex AI (Gemini)
Verified against: googleapis/python-aiplatform / google/cloud/aiplatform_v1/types/usage_metadata.py (the protobuf message definition).
The Python proto attributes are snake_case but Google’s proto JSON serialization emits camelCase on the wire (per the protobuf JSON style: prompt_token_count → promptTokenCount). This matches Gemini Direct’s wire format exactly. Already covered by the existing Gemini synonyms.
The traffic_type enum (ON_DEMAND vs PROVISIONED_THROUGHPUT) is Vertex-specific — useful for cost attribution but not modelled as a metric. Preserved on the dict.
IBM watsonx
Verified against: IBM watsonx text generation API documentation. The IBM/ibm-watsonx-ai GitHub repo I cloned was a stub (README only) and has since been removed (returns 404 as of the verification re-check); the real Python SDK ships only via PyPI / IBM Cloud Pak Foundation Models endpoints, and I did not download it. This vendor is therefore documented from API reference rather than SDK type definitions — flagged here so future maintainers know it’s the lowest-confidence entry in this catalog.
watsonx is the only verified vendor that does not wrap usage in a usage (or equivalent) envelope. Token counts are emitted as response-root fields:
Modelled (added to synonym lists at lowest precedence): input_token_count (in PROMPT_TOKENS_KEYS), generated_token_count (in COMPLETION_TOKENS_KEYS). No total_tokens analog — callers needing it should compute the sum themselves.
Caveat: because watsonx has no usage envelope, an AIPerf parser for watsonx would need to either pass the response-root dict to Usage() directly or pluck out the relevant fields. The synonym lookup handles either approach.
xAI Grok
Verified against: xai-org/xai-sdk-python / src/xai_sdk/chat.py.
xAI offers two APIs: a native gRPC API and an OpenAI-compatible REST endpoint at https://api.x.ai/v1/chat/completions.
The gRPC path exposes additional fields not present in the REST shape:
cached_prompt_text_tokens— cache hits (top-level, not nested)reasoning_tokens— top-level (not undercompletion_tokens_details)prompt_text_tokens,prompt_image_tokens— multimodal input splitcost_in_usd_ticks— pricing in micro-cents
AIPerf does not model these because we benchmark via REST endpoints, not gRPC. The REST endpoint is OpenAI-compatible, so xAI usage flows through the existing OpenAI synonyms.
If gRPC-native xAI benchmarking is ever needed, adding the four gRPC field names to the appropriate *_KEYS lists would be a one-line change per field.
Adding a new vendor: checklist
When you encounter a vendor not yet supported:
- Find the SDK source for the vendor. Look for the type that wraps the response’s
usagefield (often calledUsage,UsageInfo,CompletionUsage, or similar). If no SDK exists, find the API documentation’s response schema. - Identify the wrapper. Is the usage field at the response root, nested inside
usage, nested insideusageMetadata, or in some other envelope? Snake-case or camelCase? If a Python SDK uses Pydantic withalias_generator=to_camel, the wire format is camelCase even though Python sees snake_case. - Map each token-count field to AIPerf’s properties. Look for synonyms of
prompt_tokens,completion_tokens,total_tokens,reasoning_tokens, cache reads, cache writes, etc. Add any new field names to the appropriate*_KEYSlist inUsage. - Identify any genuinely novel concepts (i.e. fields with no AIPerf-side analog). If they’re token-shaped and useful, add a new
BaseUsageRecordMetricsubclass inusage_extras_metrics.py(orusage_cache_metrics.pyfor cache-related) plus a matchingDerivedSumMetrictotal inusage_total_metrics.py. Subclass declarations are 5–10 lines: justtag,header,unit,flags,usage_field,missing_message. - If the vendor uses an envelope (like Gemini’s
usageMetadataor Cohere’smeta), extendUsage.__init__to unwrap it. Usesetdefaultso original keys win on collision. - Add a fixture to
tests/unit/common/models/test_usage_models_adversarial.py::VENDOR_FIXTURESwith a verbatim payload from the vendor’s docs. Add it to the parametrized basic-token-count test. - Add specific tests for any novel fields the vendor introduces (e.g. cache misses, audio durations, modality breakdowns).
- Update this document. Add a row to the quick-reference table and a per-vendor section with the SDK source citation.
Change history
- 2026-05 — Initial cross-vendor verification. Added support for Gemini
usageMetadata, AWS Bedrock camelCase, DeepSeekprompt_cache_hit_tokens/prompt_cache_miss_tokens, Mistralprompt_audio_seconds, Cohere v1metaand v2usageenvelopes. Three real bugs found and fixed during SDK-source verification: Cohere v1meta.cached_tokenslift, Cohere v2 envelope (nometawrapper), Mistral{}sentinel defense. - 2026-05 — Second-wave SDK-source verification covering AI21, Cerebras, SambaNova, Bailian/DashScope, Vertex AI, Fireworks, IBM watsonx. Added
input_token_count(watsonx) toPROMPT_TOKENS_KEYSandgenerated_token_count(watsonx) toCOMPLETION_TOKENS_KEYS. SambaNova’s rich server-side timing fields catalogued as preserved-on-dict (parallel to client-computed metrics). Bailian’s multimodalcharactersfield catalogued as non-token billing unit. Vertex AI confirmed identical to Gemini direct.