This document catalogues the exact JSON shape of the usage field that each LLM provider returns in chat / completion responses, cross-referenced against their official SDK source code. It exists so that:
The verification work behind this document was performed by inspecting each provider’s Python SDK source (or REST API documentation when no SDK type was available). All conclusions are dated against the SDK / docs commit at the time of verification (early 2026).
AIPerf wraps every API-reported usage dict in a Usage class (src/aiperf/common/models/usage_models.py). On construction, two recognized vendor envelopes are unwrapped to the top level so all properties read from a single flat dict:
usageMetadata → top-level (lifts promptTokenCount, candidatesTokenCount, etc.).meta → top-level (lifts meta.tokens.{input,output}_tokens, meta.cached_tokens).tokens sub-dict → top-level (lifts tokens.{input,output}_tokens).The original keys are preserved if a normalized key would collide; the original wins.
After normalization, each property reads through an ordered synonym list (the *_KEYS class attributes). The first present key wins. Properties return None when no synonym is present, so 0 is correctly distinguished from “missing”.
Verified against: openai-python / src/openai/types/completion_usage.py.
All field names match AIPerf’s modelled synonyms. cached_tokens is read-only on OpenAI (writes are transparent and free), so we do not raise NoMetricValue for OpenAI when the cache-write metric is queried — we just return None. OpenAI does NOT surface a separate cache-miss count; you can derive it from prompt_tokens - prompt_tokens_details.cached_tokens if needed.
Verified against: vllm / vllm/entrypoints/openai/engine/protocol.py.
vLLM is OpenAI-compatible. Its prompt_tokens_details is narrower than OpenAI’s (only cached_tokens, no audio_tokens). vLLM may emit prompt_tokens_details: null and completion_tokens_details: null explicitly; AIPerf’s nested-field walk handles that case (the isinstance(details, dict) guard returns False, and the property returns None).
Verified against: anthropic-sdk-python / src/anthropic/types/usage.py, message_delta_usage.py, cache_creation.py, and server_tool_usage.py.
Streaming chunks use MessageDeltaUsage, which carries the same fields as Usage for cache and tokens (a non-streaming chunk + MessageDeltaUsage contain the same shape for our purposes).
Modelled: input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens.
Not modelled (preserved on dict):
cache_creation TTL breakdown (sum of ephemeral_1h_input_tokens + ephemeral_5m_input_tokens equals the parent cache_creation_input_tokens). Could be added if TTL-aware analysis is needed.server_tool_use (web_fetch_requests, web_search_requests). Non-token metadata.service_tier (“standard”/“priority”/“batch”). String label, not a count.inference_geo. String label.Verified against: google-genai / google/genai/types.py (GenerateContentResponseUsageMetadata) and _common.py (alias_generator=to_camel).
The Python SDK declares fields in snake_case for Python ergonomics, but the Pydantic alias_generator=to_camel config means the wire (JSON) format is camelCase. AIPerf operates at the JSON level, so the camelCase names are what we synonym-match.
Wire-format field names (after to_camel): cachedContentTokenCount, candidatesTokenCount, promptTokenCount, thoughtsTokenCount, toolUsePromptTokenCount, totalTokenCount.
The whole object is wrapped in usageMetadata at the response top level; AIPerf’s Usage.__init__ unwraps it.
Not modelled (preserved on dict): the four *Details[] arrays of ModalityTokenCount objects (per-modality breakdowns: TEXT / IMAGE / AUDIO / VIDEO). Useful for multimodal benchmarks where you want to know what fraction of input tokens were images, but currently surfaced verbatim as a list rather than as a metric.
Note on prompt_token_count: Gemini’s docs say “When cached_content is set, prompt_token_count includes the number of tokens in the cached content.” So for Gemini, prompt_tokens is total-including-cached, and cached_content_token_count is the subset that was cached. This matches OpenAI’s semantic where prompt_tokens is the total and cached_tokens is the subset of those that hit cache.
Verified against: AWS Bedrock TokenUsage API reference. No Python SDK clone needed — boto3 follows the documented API verbatim.
Modelled: inputTokens, outputTokens, totalTokens, cacheReadInputTokens, cacheWriteInputTokens. All synonyms in the *_KEYS lists.
Not modelled (preserved on dict): cacheDetails[] TTL breakdown array.
Note that Bedrock’s field names exactly match Anthropic’s concept names but use camelCase. This is because Bedrock primarily proxies Anthropic models and converted the snake_case names to camelCase for AWS API conventions. The semantic mapping is one-to-one:
Verified against: DeepSeek API documentation.
Modelled: all of the above. prompt_cache_hit_tokens is mapped to prompt_cache_read_tokens via the synonym list. prompt_cache_miss_tokens is its own first-class metric (UsagePromptCacheMissTokensMetric) since DeepSeek bills hits and misses at different rates and no other vendor surfaces the miss count as its own field.
Invariant: prompt_tokens == prompt_cache_hit_tokens + prompt_cache_miss_tokens for DeepSeek responses. AIPerf has a test asserting this end-to-end.
Cohere has TWO API versions with different envelopes. AIPerf handles both.
v1 — verified against: cohere-python / src/cohere/types/api_meta.py and api_meta_tokens.py.
The meta envelope is at the response root (not under a usage key). If the parser hands the full response to Usage(), meta is what’s there. AIPerf unwraps:
meta.tokens.input_tokens → top-level (resolved via PROMPT_TOKENS_KEYS)meta.tokens.output_tokens → top-level (resolved via COMPLETION_TOKENS_KEYS)meta.cached_tokens → top-level (resolved via CACHE_READ_TOP_LEVEL_KEYS)v2 — verified against: cohere-python / src/cohere/types/usage.py, usage_tokens.py, and usage_billed_units.py.
The usage field at the response root contains billed_units, tokens, and cached_tokens directly — no meta wrapper. AIPerf treats top-level tokens (a sub-dict) the same way as meta.tokens and unwraps it. Top-level cached_tokens is in CACHE_READ_TOP_LEVEL_KEYS.
billed_units is intentionally NOT surfaced as a metric. Cohere’s billed-vs-raw distinction is a Cohere-specific accounting filter (the framework injects special tokens that count toward the raw tokens total but aren’t billed). For perf benchmarks, the raw count is what the model actually processed — which is what every other vendor reports — so we keep prompt_tokens consistent across vendors. Callers that need billing reconciliation can read usage["meta"]["billed_units"] (v1) or usage["billed_units"] (v2) directly off the underlying dict.
billed_units for chat:
input_tokens, output_tokens — billed token countssearch_units, classifications — non-token billable units (RAG / classification endpoints)Verified against: mistralai/client-python / src/mistralai/client/models/usageinfo.py.
The SDK type declares prompt_audio_seconds as Optional[int], but observed wire responses on Mistral’s agents endpoint have shown the field emit as {} (an empty dict) when no audio is present in the prompt — visible in Mistral’s documented response examples. AIPerf’s prompt_audio_seconds property is defensive — it only coerces numeric values (int / float, excluding bool); any other type returns None rather than raising TypeError from float({}). The defensiveness is cheap and protects against either SDK / wire-format drift.
Note: prompt_audio_seconds is in MetricTimeUnit.SECONDS, distinct from UsagePromptAudioTokensMetric which is in GenericMetricUnit.TOKENS. The two metrics can coexist for the same response when Mistral reports both.
Verified against: groq-python / src/groq/types/completion_usage.py.
Token fields are pure OpenAI shape. The four *_time fields are server-side timing in seconds — useful for performance benchmarks (queue time + prompt time + completion time = end-to-end latency components). Currently preserved on the dict but not surfaced as metrics. Adding them as optional BaseUsageRecordMetric[float] subclasses with MetricTimeUnit.SECONDS would be a small follow-up if Groq benchmarking becomes a priority.
These are passthrough proxies that emit OpenAI-compatible usage shapes. Verified Together via together-python / src/together/types/common.py:
Verified Fireworks via fw-ai-external/python-sdk / src/fireworks/types/shared/usage_info.py:
Replicate’s SDK does not declare a fixed Usage type because it passes through whatever the underlying hosted model emits. Azure OpenAI uses the openai-python SDK directly, so it inherits OpenAI’s exact shape.
No vendor-specific changes needed for any of these; they’re covered by the OpenAI synonyms.
Verified against: Cerebras/cerebras-cloud-sdk-python / src/cerebras/cloud/sdk/types/chat/chat_completion.py.
OpenAI-shape token-count fields (Stainless-generated SDK), but the *_tokens_details sub-objects are a strict subset of OpenAI’s: no audio_tokens in either, no reasoning_tokens in completion details. AIPerf’s broader OpenAI-shape coverage is forward-compatible — Cerebras responses simply don’t populate the missing inner keys, and the corresponding metrics raise NoMetricValue rather than crashing.
Verified against: AI21Labs/ai21-python / ai21/models/usage_info.py.
Minimal OpenAI-shape — only the three baseline fields. No nested details, no cache info, no extras. Already covered.
Verified against: sambanova/sambanova-python / src/sambanova/types/chat/chat_completion_response.py.
The Usage class is unusually rich because SambaNova bakes server-side timing/throughput data directly into the usage envelope:
Modelled: all token-count fields via OpenAI synonyms.
Not modelled (preserved on dict): the rich timing/throughput data. AIPerf computes equivalents client-side (TTFTMetric, RequestLatencyMetric, OutputTokenThroughputPerUserMetric, InterTokenLatencyMetric); SambaNova’s server-side measurements are parallel/redundant signals. They could be surfaced as their own metrics if a workflow needed server-vs-client divergence checking.
Verified against: dashscope/dashscope-sdk-python / dashscope/api_entities/dashscope_response.py.
Modelled: input_tokens and output_tokens are already in PROMPT_TOKENS_KEYS / COMPLETION_TOKENS_KEYS (Anthropic-shape synonyms).
Notable absences: no total_tokens field (in either Bailian variant). The total_tokens property returns None for native DashScope responses; callers that need it can compute input_tokens + output_tokens themselves.
Not modelled: characters (multimodal-only). It represents image/audio inputs measured in characters rather than tokens — useful for billing reconciliation but not a standard cross-vendor metric.
Note: Bailian also offers an OpenAI-compatible REST endpoint (compatible-mode) that emits standard OpenAI shape. AIPerf benchmarking either endpoint is supported.
Verified against: googleapis/python-aiplatform / google/cloud/aiplatform_v1/types/usage_metadata.py (the protobuf message definition).
The Python proto attributes are snake_case but Google’s proto JSON serialization emits camelCase on the wire (per the protobuf JSON style: prompt_token_count → promptTokenCount). This matches Gemini Direct’s wire format exactly. Already covered by the existing Gemini synonyms.
The traffic_type enum (ON_DEMAND vs PROVISIONED_THROUGHPUT) is Vertex-specific — useful for cost attribution but not modelled as a metric. Preserved on the dict.
Verified against: IBM watsonx text generation API documentation. The IBM/ibm-watsonx-ai GitHub repo I cloned was a stub (README only) and has since been removed (returns 404 as of the verification re-check); the real Python SDK ships only via PyPI / IBM Cloud Pak Foundation Models endpoints, and I did not download it. This vendor is therefore documented from API reference rather than SDK type definitions — flagged here so future maintainers know it’s the lowest-confidence entry in this catalog.
watsonx is the only verified vendor that does not wrap usage in a usage (or equivalent) envelope. Token counts are emitted as response-root fields:
Modelled (added to synonym lists at lowest precedence): input_token_count (in PROMPT_TOKENS_KEYS), generated_token_count (in COMPLETION_TOKENS_KEYS). No total_tokens analog — callers needing it should compute the sum themselves.
Caveat: because watsonx has no usage envelope, an AIPerf parser for watsonx would need to either pass the response-root dict to Usage() directly or pluck out the relevant fields. The synonym lookup handles either approach.
Verified against: xai-org/xai-sdk-python / src/xai_sdk/chat.py.
xAI offers two APIs: a native gRPC API and an OpenAI-compatible REST endpoint at https://api.x.ai/v1/chat/completions.
The gRPC path exposes additional fields not present in the REST shape:
cached_prompt_text_tokens — cache hits (top-level, not nested)reasoning_tokens — top-level (not under completion_tokens_details)prompt_text_tokens, prompt_image_tokens — multimodal input splitcost_in_usd_ticks — pricing in micro-centsAIPerf does not model these because we benchmark via REST endpoints, not gRPC. The REST endpoint is OpenAI-compatible, so xAI usage flows through the existing OpenAI synonyms.
If gRPC-native xAI benchmarking is ever needed, adding the four gRPC field names to the appropriate *_KEYS lists would be a one-line change per field.
When you encounter a vendor not yet supported:
usage field (often called Usage, UsageInfo, CompletionUsage, or similar). If no SDK exists, find the API documentation’s response schema.usage, nested inside usageMetadata, or in some other envelope? Snake-case or camelCase? If a Python SDK uses Pydantic with alias_generator=to_camel, the wire format is camelCase even though Python sees snake_case.prompt_tokens, completion_tokens, total_tokens, reasoning_tokens, cache reads, cache writes, etc. Add any new field names to the appropriate *_KEYS list in Usage.BaseUsageRecordMetric subclass in usage_extras_metrics.py (or usage_cache_metrics.py for cache-related) plus a matching DerivedSumMetric total in usage_total_metrics.py. Subclass declarations are 5–10 lines: just tag, header, unit, flags, usage_field, missing_message.usageMetadata or Cohere’s meta), extend Usage.__init__ to unwrap it. Use setdefault so original keys win on collision.tests/unit/common/models/test_usage_models_adversarial.py::VENDOR_FIXTURES with a verbatim payload from the vendor’s docs. Add it to the parametrized basic-token-count test.usageMetadata, AWS Bedrock camelCase, DeepSeek prompt_cache_hit_tokens/prompt_cache_miss_tokens, Mistral prompt_audio_seconds, Cohere v1 meta and v2 usage envelopes. Three real bugs found and fixed during SDK-source verification: Cohere v1 meta.cached_tokens lift, Cohere v2 envelope (no meta wrapper), Mistral {} sentinel defense.input_token_count (watsonx) to PROMPT_TOKENS_KEYS and generated_token_count (watsonx) to COMPLETION_TOKENS_KEYS. SambaNova’s rich server-side timing fields catalogued as preserved-on-dict (parallel to client-computed metrics). Bailian’s multimodal characters field catalogued as non-token billing unit. Vertex AI confirmed identical to Gemini direct.