Vendor Usage Field Reference

View as Markdown

This document catalogues the exact JSON shape of the usage field that each LLM provider returns in chat / completion responses, cross-referenced against their official SDK source code. It exists so that:

  • A maintainer adding a new vendor knows what to look for and where existing vendors agree or differ.
  • A debugger investigating “why doesn’t my usage metric show a value” can find the canonical field-name list per vendor.
  • A reviewer of a future usage-parsing change can verify that no vendor’s wire format was missed.

The verification work behind this document was performed by inspecting each provider’s Python SDK source (or REST API documentation when no SDK type was available). All conclusions are dated against the SDK / docs commit at the time of verification (early 2026).

Quick reference: vendor shape map

VendorWrapperToken-count fieldsCache fieldsNotable extras
OpenAIflat usageprompt_tokens, completion_tokens, total_tokensprompt_tokens_details.cached_tokens (read-only)nested *_tokens_details for audio / reasoning / prediction
vLLMflat usageOpenAI-shapeprompt_tokens_details.cached_tokensmatches OpenAI; sometimes emits prompt_tokens_details: null
Anthropicflat usageinput_tokens, output_tokenscache_creation_input_tokens, cache_read_input_tokenscache_creation TTL sub-object; service_tier; server_tool_use
Google GeminiusageMetadata envelope (camelCase)promptTokenCount, candidatesTokenCount, totalTokenCountcachedContentTokenCount (read-only)thoughtsTokenCount, toolUsePromptTokenCount, modality *Details[] arrays
AWS Bedrockflat usage (camelCase)inputTokens, outputTokens, totalTokenscacheReadInputTokens, cacheWriteInputTokenscacheDetails[] TTL array
DeepSeekflat usageOpenAI-shapeprompt_cache_hit_tokens, prompt_cache_miss_tokensOpenAI-style completion_tokens_details.reasoning_tokens for thinking mode
Cohere v1meta envelope (response root)meta.tokens.{input,output}_tokensmeta.cached_tokensmeta.billed_units (raw vs billed split); api_version; warnings[]
Cohere v2flat usagetop-level tokens.{input,output}_tokenstop-level cached_tokenstop-level billed_units (same split)
Mistralflat usageOpenAI-shapeOpenAI-style nested cached_tokensprompt_audio_seconds (audio duration, NOT tokens; emits {} sentinel when absent)
Groqflat usageOpenAI-shapeOpenAI-shapeper-stage timings: prompt_time, completion_time, queue_time, total_time (seconds)
Together / Fireworks / Replicateflat usageOpenAI-shapeOpenAI-shapepassthrough proxies; whatever underlying model emits
Cerebrasflat usageOpenAI-shapeOpenAI-shape (prompt_tokens_details.cached_tokens)OpenAI-compatible Stainless-generated SDK
AI21 Labsflat usageOpenAI-shapen/abasic prompt_tokens/completion_tokens/total_tokens only
SambaNovaflat usageOpenAI-shapeOpenAI-shaperich server-side timing/throughput (time_to_first_token, total_latency, acceptance_rate, *_tokens_per_sec, etc.)
Bailian / DashScope (Alibaba Qwen)flat usageinput_tokens / output_tokens (Anthropic-style)n/amultimodal endpoint adds characters (non-token billing); OpenAI-compat endpoint emits OpenAI shape
Vertex AI (Gemini)usageMetadata envelopesame camelCase as Gemini directsameidentical wire format to Gemini
IBM watsonxresponse root (no usage envelope)input_token_count, generated_token_countn/adistinct _count suffix; sibling fields stop_reason, response_time at response root too
xAI Grok (REST)flat usageOpenAI-shapeOpenAI-shapexAI’s REST endpoint is OpenAI-compatible
xAI Grok (gRPC)proto messageprompt_tokens, completion_tokens, total_tokenscached_prompt_text_tokens (top-level)top-level reasoning_tokens, prompt_text_tokens, prompt_image_tokens, cost_in_usd_ticks — NOT exposed via REST so AIPerf doesn’t model them

How AIPerf normalizes these shapes

AIPerf wraps every API-reported usage dict in a Usage class (src/aiperf/common/models/usage_models.py). On construction, two recognized vendor envelopes are unwrapped to the top level so all properties read from a single flat dict:

  • Gemini usageMetadata → top-level (lifts promptTokenCount, candidatesTokenCount, etc.).
  • Cohere v1 meta → top-level (lifts meta.tokens.{input,output}_tokens, meta.cached_tokens).
  • Cohere v2 top-level tokens sub-dict → top-level (lifts tokens.{input,output}_tokens).

The original keys are preserved if a normalized key would collide; the original wins.

After normalization, each property reads through an ordered synonym list (the *_KEYS class attributes). The first present key wins. Properties return None when no synonym is present, so 0 is correctly distinguished from “missing”.

Per-vendor verification details

OpenAI

Verified against: openai-python / src/openai/types/completion_usage.py.

1class CompletionUsage(BaseModel):
2 completion_tokens: int
3 prompt_tokens: int
4 total_tokens: int
5 completion_tokens_details: Optional[CompletionTokensDetails] = None
6 prompt_tokens_details: Optional[PromptTokensDetails] = None
7
8class CompletionTokensDetails(BaseModel):
9 accepted_prediction_tokens: Optional[int] = None
10 audio_tokens: Optional[int] = None
11 reasoning_tokens: Optional[int] = None
12 rejected_prediction_tokens: Optional[int] = None
13
14class PromptTokensDetails(BaseModel):
15 audio_tokens: Optional[int] = None
16 cached_tokens: Optional[int] = None

All field names match AIPerf’s modelled synonyms. cached_tokens is read-only on OpenAI (writes are transparent and free), so we do not raise NoMetricValue for OpenAI when the cache-write metric is queried — we just return None. OpenAI does NOT surface a separate cache-miss count; you can derive it from prompt_tokens - prompt_tokens_details.cached_tokens if needed.

vLLM

Verified against: vllm / vllm/entrypoints/openai/engine/protocol.py.

1class UsageInfo(OpenAIBaseModel):
2 prompt_tokens: int = 0
3 total_tokens: int = 0
4 completion_tokens: int | None = 0
5 prompt_tokens_details: PromptTokenUsageInfo | None = None
6
7class PromptTokenUsageInfo(OpenAIBaseModel):
8 cached_tokens: int | None = None

vLLM is OpenAI-compatible. Its prompt_tokens_details is narrower than OpenAI’s (only cached_tokens, no audio_tokens). vLLM may emit prompt_tokens_details: null and completion_tokens_details: null explicitly; AIPerf’s nested-field walk handles that case (the isinstance(details, dict) guard returns False, and the property returns None).

Anthropic

Verified against: anthropic-sdk-python / src/anthropic/types/usage.py, message_delta_usage.py, cache_creation.py, and server_tool_usage.py.

1class Usage(BaseModel):
2 cache_creation: Optional[CacheCreation] = None
3 cache_creation_input_tokens: Optional[int] = None
4 cache_read_input_tokens: Optional[int] = None
5 inference_geo: Optional[str] = None
6 input_tokens: int
7 output_tokens: int
8 server_tool_use: Optional[ServerToolUsage] = None
9 service_tier: Optional[Literal["standard", "priority", "batch"]] = None
10
11class CacheCreation(BaseModel):
12 ephemeral_1h_input_tokens: int
13 ephemeral_5m_input_tokens: int
14
15class ServerToolUsage(BaseModel):
16 web_fetch_requests: int
17 web_search_requests: int

Streaming chunks use MessageDeltaUsage, which carries the same fields as Usage for cache and tokens (a non-streaming chunk + MessageDeltaUsage contain the same shape for our purposes).

Modelled: input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens.

Not modelled (preserved on dict):

  • cache_creation TTL breakdown (sum of ephemeral_1h_input_tokens + ephemeral_5m_input_tokens equals the parent cache_creation_input_tokens). Could be added if TTL-aware analysis is needed.
  • server_tool_use (web_fetch_requests, web_search_requests). Non-token metadata.
  • service_tier (“standard”/“priority”/“batch”). String label, not a count.
  • inference_geo. String label.

Google Gemini

Verified against: google-genai / google/genai/types.py (GenerateContentResponseUsageMetadata) and _common.py (alias_generator=to_camel).

The Python SDK declares fields in snake_case for Python ergonomics, but the Pydantic alias_generator=to_camel config means the wire (JSON) format is camelCase. AIPerf operates at the JSON level, so the camelCase names are what we synonym-match.

1class GenerateContentResponseUsageMetadata(BaseModel):
2 cached_content_token_count: Optional[int]
3 candidates_token_count: Optional[int]
4 prompt_token_count: Optional[int]
5 thoughts_token_count: Optional[int]
6 tool_use_prompt_token_count: Optional[int]
7 total_token_count: Optional[int]
8
9 # Modality-detail breakdown arrays (not modelled)
10 cache_tokens_details: Optional[list[ModalityTokenCount]]
11 candidates_tokens_details: Optional[list[ModalityTokenCount]]
12 prompt_tokens_details: Optional[list[ModalityTokenCount]]
13 tool_use_prompt_tokens_details: Optional[list[ModalityTokenCount]]
14 traffic_type: Optional[TrafficType]

Wire-format field names (after to_camel): cachedContentTokenCount, candidatesTokenCount, promptTokenCount, thoughtsTokenCount, toolUsePromptTokenCount, totalTokenCount.

The whole object is wrapped in usageMetadata at the response top level; AIPerf’s Usage.__init__ unwraps it.

Not modelled (preserved on dict): the four *Details[] arrays of ModalityTokenCount objects (per-modality breakdowns: TEXT / IMAGE / AUDIO / VIDEO). Useful for multimodal benchmarks where you want to know what fraction of input tokens were images, but currently surfaced verbatim as a list rather than as a metric.

Note on prompt_token_count: Gemini’s docs say “When cached_content is set, prompt_token_count includes the number of tokens in the cached content.” So for Gemini, prompt_tokens is total-including-cached, and cached_content_token_count is the subset that was cached. This matches OpenAI’s semantic where prompt_tokens is the total and cached_tokens is the subset of those that hit cache.

AWS Bedrock

Verified against: AWS Bedrock TokenUsage API reference. No Python SDK clone needed — boto3 follows the documented API verbatim.

TokenUsage:
inputTokens: int (required)
outputTokens: int (required)
totalTokens: int (required)
cacheReadInputTokens: int (optional)
cacheWriteInputTokens: int (optional)
cacheDetails: list[CacheDetail] (optional, sorted by TTL: 1h before 5m)

Modelled: inputTokens, outputTokens, totalTokens, cacheReadInputTokens, cacheWriteInputTokens. All synonyms in the *_KEYS lists.

Not modelled (preserved on dict): cacheDetails[] TTL breakdown array.

Note that Bedrock’s field names exactly match Anthropic’s concept names but use camelCase. This is because Bedrock primarily proxies Anthropic models and converted the snake_case names to camelCase for AWS API conventions. The semantic mapping is one-to-one:

AnthropicBedrock
input_tokensinputTokens
output_tokensoutputTokens
cache_read_input_tokenscacheReadInputTokens
cache_creation_input_tokenscacheWriteInputTokens

DeepSeek

Verified against: DeepSeek API documentation.

usage:
prompt_tokens: int
completion_tokens: int
total_tokens: int
prompt_cache_hit_tokens: int # DeepSeek-specific
prompt_cache_miss_tokens: int # DeepSeek-specific (genuinely novel)
completion_tokens_details: # OpenAI-shape (thinking mode)
reasoning_tokens: int

Modelled: all of the above. prompt_cache_hit_tokens is mapped to prompt_cache_read_tokens via the synonym list. prompt_cache_miss_tokens is its own first-class metric (UsagePromptCacheMissTokensMetric) since DeepSeek bills hits and misses at different rates and no other vendor surfaces the miss count as its own field.

Invariant: prompt_tokens == prompt_cache_hit_tokens + prompt_cache_miss_tokens for DeepSeek responses. AIPerf has a test asserting this end-to-end.

Cohere

Cohere has TWO API versions with different envelopes. AIPerf handles both.

v1 — verified against: cohere-python / src/cohere/types/api_meta.py and api_meta_tokens.py.

1class ApiMeta(BaseModel):
2 api_version: Optional[ApiMetaApiVersion]
3 billed_units: Optional[ApiMetaBilledUnits]
4 tokens: Optional[ApiMetaTokens]
5 cached_tokens: Optional[float]
6 warnings: Optional[List[str]]

The meta envelope is at the response root (not under a usage key). If the parser hands the full response to Usage(), meta is what’s there. AIPerf unwraps:

  • meta.tokens.input_tokens → top-level (resolved via PROMPT_TOKENS_KEYS)
  • meta.tokens.output_tokens → top-level (resolved via COMPLETION_TOKENS_KEYS)
  • meta.cached_tokens → top-level (resolved via CACHE_READ_TOP_LEVEL_KEYS)

v2 — verified against: cohere-python / src/cohere/types/usage.py, usage_tokens.py, and usage_billed_units.py.

1class Usage(BaseModel):
2 billed_units: Optional[UsageBilledUnits]
3 tokens: Optional[UsageTokens]
4 cached_tokens: Optional[float]

The usage field at the response root contains billed_units, tokens, and cached_tokens directly — no meta wrapper. AIPerf treats top-level tokens (a sub-dict) the same way as meta.tokens and unwraps it. Top-level cached_tokens is in CACHE_READ_TOP_LEVEL_KEYS.

billed_units is intentionally NOT surfaced as a metric. Cohere’s billed-vs-raw distinction is a Cohere-specific accounting filter (the framework injects special tokens that count toward the raw tokens total but aren’t billed). For perf benchmarks, the raw count is what the model actually processed — which is what every other vendor reports — so we keep prompt_tokens consistent across vendors. Callers that need billing reconciliation can read usage["meta"]["billed_units"] (v1) or usage["billed_units"] (v2) directly off the underlying dict.

billed_units for chat:

  • input_tokens, output_tokens — billed token counts
  • search_units, classifications — non-token billable units (RAG / classification endpoints)

Mistral

Verified against: mistralai/client-python / src/mistralai/client/models/usageinfo.py.

1class UsageInfo(BaseModel):
2 prompt_tokens: Optional[int] = 0
3 completion_tokens: Optional[int] = 0
4 total_tokens: Optional[int] = 0
5 prompt_audio_seconds: OptionalNullable[int] = UNSET

The SDK type declares prompt_audio_seconds as Optional[int], but observed wire responses on Mistral’s agents endpoint have shown the field emit as {} (an empty dict) when no audio is present in the prompt — visible in Mistral’s documented response examples. AIPerf’s prompt_audio_seconds property is defensive — it only coerces numeric values (int / float, excluding bool); any other type returns None rather than raising TypeError from float({}). The defensiveness is cheap and protects against either SDK / wire-format drift.

Note: prompt_audio_seconds is in MetricTimeUnit.SECONDS, distinct from UsagePromptAudioTokensMetric which is in GenericMetricUnit.TOKENS. The two metrics can coexist for the same response when Mistral reports both.

Groq

Verified against: groq-python / src/groq/types/completion_usage.py.

1class CompletionUsage(BaseModel):
2 completion_tokens: int
3 prompt_tokens: int
4 total_tokens: int
5 completion_time: Optional[float] # seconds
6 prompt_time: Optional[float] # seconds
7 queue_time: Optional[float] # seconds
8 total_time: Optional[float] # seconds
9 completion_tokens_details: Optional[CompletionTokensDetails]
10 prompt_tokens_details: Optional[PromptTokensDetails]
11
12class CompletionTokensDetails(BaseModel):
13 reasoning_tokens: int
14
15class PromptTokensDetails(BaseModel):
16 cached_tokens: int

Token fields are pure OpenAI shape. The four *_time fields are server-side timing in seconds — useful for performance benchmarks (queue time + prompt time + completion time = end-to-end latency components). Currently preserved on the dict but not surfaced as metrics. Adding them as optional BaseUsageRecordMetric[float] subclasses with MetricTimeUnit.SECONDS would be a small follow-up if Groq benchmarking becomes a priority.

Together AI / Fireworks / Replicate / Azure OpenAI

These are passthrough proxies that emit OpenAI-compatible usage shapes. Verified Together via together-python / src/together/types/common.py:

1class UsageData(BaseModel):
2 prompt_tokens: int
3 completion_tokens: int
4 total_tokens: int

Verified Fireworks via fw-ai-external/python-sdk / src/fireworks/types/shared/usage_info.py:

1class UsageInfo(BaseModel):
2 prompt_tokens: int
3 total_tokens: int
4 completion_tokens: Optional[int] = None
5 prompt_tokens_details: Optional[PromptTokensDetails] = None # {cached_tokens}

Replicate’s SDK does not declare a fixed Usage type because it passes through whatever the underlying hosted model emits. Azure OpenAI uses the openai-python SDK directly, so it inherits OpenAI’s exact shape.

No vendor-specific changes needed for any of these; they’re covered by the OpenAI synonyms.

Cerebras

Verified against: Cerebras/cerebras-cloud-sdk-python / src/cerebras/cloud/sdk/types/chat/chat_completion.py.

1class ChatCompletionResponseUsage(BaseModel):
2 completion_tokens: Optional[int]
3 completion_tokens_details: Optional[ChatCompletionResponseUsageCompletionTokensDetails]
4 prompt_tokens: Optional[int]
5 prompt_tokens_details: Optional[ChatCompletionResponseUsagePromptTokensDetails]
6 total_tokens: Optional[int]
7
8class ChatCompletionResponseUsageCompletionTokensDetails(BaseModel):
9 accepted_prediction_tokens: Optional[int]
10 rejected_prediction_tokens: Optional[int]
11 # NOTE: NO audio_tokens, NO reasoning_tokens (narrower than OpenAI)
12
13class ChatCompletionResponseUsagePromptTokensDetails(BaseModel):
14 cached_tokens: Optional[int]
15 # NOTE: NO audio_tokens (narrower than OpenAI)

OpenAI-shape token-count fields (Stainless-generated SDK), but the *_tokens_details sub-objects are a strict subset of OpenAI’s: no audio_tokens in either, no reasoning_tokens in completion details. AIPerf’s broader OpenAI-shape coverage is forward-compatible — Cerebras responses simply don’t populate the missing inner keys, and the corresponding metrics raise NoMetricValue rather than crashing.

AI21 Labs

Verified against: AI21Labs/ai21-python / ai21/models/usage_info.py.

1class UsageInfo(AI21BaseModel):
2 prompt_tokens: int
3 completion_tokens: int
4 total_tokens: int

Minimal OpenAI-shape — only the three baseline fields. No nested details, no cache info, no extras. Already covered.

SambaNova

Verified against: sambanova/sambanova-python / src/sambanova/types/chat/chat_completion_response.py.

The Usage class is unusually rich because SambaNova bakes server-side timing/throughput data directly into the usage envelope:

1class Usage(BaseModel):
2 # Standard OpenAI token-count fields (already covered):
3 prompt_tokens: Optional[int]
4 completion_tokens: Optional[int]
5 total_tokens: Optional[int]
6 prompt_tokens_details: Optional[UsagePromptTokensDetails]
7 completion_tokens_details: Optional[UsageCompletionTokensDetails]
8
9 # SambaNova-specific server-side timing (preserved on dict, not modelled):
10 acceptance_rate: Optional[float] # speculative-decoding accept rate
11 completion_tokens_after_first_per_sec: Optional[float] # post-TTFT throughput
12 completion_tokens_after_first_per_sec_first_ten: Optional[float] # first-10 post-TTFT throughput
13 completion_tokens_after_first_per_sec_graph: Optional[float] # adjusted for graph rendering
14 completion_tokens_per_sec: Optional[float] # full-run completion throughput
15 end_time: Optional[float] # Unix timestamp seconds
16 start_time: Optional[float] # Unix timestamp seconds
17 time_to_first_token: Optional[float] # TTFT seconds
18 time_to_first_token_graph: Optional[float] # adjusted TTFT
19 total_latency: Optional[float] # full-run latency seconds
20 total_tokens_per_sec: Optional[float] # full-run throughput
21 is_last_response: Optional[Literal[True]]
22 stop_reason: Optional[str]

Modelled: all token-count fields via OpenAI synonyms.

Not modelled (preserved on dict): the rich timing/throughput data. AIPerf computes equivalents client-side (TTFTMetric, RequestLatencyMetric, OutputTokenThroughputPerUserMetric, InterTokenLatencyMetric); SambaNova’s server-side measurements are parallel/redundant signals. They could be surfaced as their own metrics if a workflow needed server-vs-client divergence checking.

Bailian / DashScope (Alibaba Qwen)

Verified against: dashscope/dashscope-sdk-python / dashscope/api_entities/dashscope_response.py.

1@dataclass
2class GenerationUsage: # text endpoints
3 input_tokens: int
4 output_tokens: int
5
6@dataclass
7class MultiModalConversationUsage: # multimodal endpoints
8 input_tokens: int
9 output_tokens: int
10 characters: int # non-token billing for non-tokenizable inputs

Modelled: input_tokens and output_tokens are already in PROMPT_TOKENS_KEYS / COMPLETION_TOKENS_KEYS (Anthropic-shape synonyms).

Notable absences: no total_tokens field (in either Bailian variant). The total_tokens property returns None for native DashScope responses; callers that need it can compute input_tokens + output_tokens themselves.

Not modelled: characters (multimodal-only). It represents image/audio inputs measured in characters rather than tokens — useful for billing reconciliation but not a standard cross-vendor metric.

Note: Bailian also offers an OpenAI-compatible REST endpoint (compatible-mode) that emits standard OpenAI shape. AIPerf benchmarking either endpoint is supported.

Vertex AI (Gemini)

Verified against: googleapis/python-aiplatform / google/cloud/aiplatform_v1/types/usage_metadata.py (the protobuf message definition).

1class UsageMetadata(proto.Message):
2 prompt_token_count: int
3 candidates_token_count: int
4 total_token_count: int
5 tool_use_prompt_token_count: int
6 thoughts_token_count: int
7 cached_content_token_count: int
8 prompt_tokens_details: MutableSequence[ModalityTokenCount]
9 cache_tokens_details: MutableSequence[ModalityTokenCount]
10 candidates_tokens_details: MutableSequence[ModalityTokenCount]
11 tool_use_prompt_tokens_details: MutableSequence[ModalityTokenCount]
12 traffic_type: TrafficType # ON_DEMAND or PROVISIONED_THROUGHPUT

The Python proto attributes are snake_case but Google’s proto JSON serialization emits camelCase on the wire (per the protobuf JSON style: prompt_token_countpromptTokenCount). This matches Gemini Direct’s wire format exactly. Already covered by the existing Gemini synonyms.

The traffic_type enum (ON_DEMAND vs PROVISIONED_THROUGHPUT) is Vertex-specific — useful for cost attribution but not modelled as a metric. Preserved on the dict.

IBM watsonx

Verified against: IBM watsonx text generation API documentation. The IBM/ibm-watsonx-ai GitHub repo I cloned was a stub (README only) and has since been removed (returns 404 as of the verification re-check); the real Python SDK ships only via PyPI / IBM Cloud Pak Foundation Models endpoints, and I did not download it. This vendor is therefore documented from API reference rather than SDK type definitions — flagged here so future maintainers know it’s the lowest-confidence entry in this catalog.

watsonx is the only verified vendor that does not wrap usage in a usage (or equivalent) envelope. Token counts are emitted as response-root fields:

1{
2 "generated_text": "...",
3 "input_token_count": 100,
4 "generated_token_count": 50,
5 "stop_reason": "eos_token",
6 "response_time": 1234,
7 "scoring_id": "..."
8}

Modelled (added to synonym lists at lowest precedence): input_token_count (in PROMPT_TOKENS_KEYS), generated_token_count (in COMPLETION_TOKENS_KEYS). No total_tokens analog — callers needing it should compute the sum themselves.

Caveat: because watsonx has no usage envelope, an AIPerf parser for watsonx would need to either pass the response-root dict to Usage() directly or pluck out the relevant fields. The synonym lookup handles either approach.

xAI Grok

Verified against: xai-org/xai-sdk-python / src/xai_sdk/chat.py.

xAI offers two APIs: a native gRPC API and an OpenAI-compatible REST endpoint at https://api.x.ai/v1/chat/completions.

The gRPC path exposes additional fields not present in the REST shape:

  • cached_prompt_text_tokens — cache hits (top-level, not nested)
  • reasoning_tokens — top-level (not under completion_tokens_details)
  • prompt_text_tokens, prompt_image_tokens — multimodal input split
  • cost_in_usd_ticks — pricing in micro-cents

AIPerf does not model these because we benchmark via REST endpoints, not gRPC. The REST endpoint is OpenAI-compatible, so xAI usage flows through the existing OpenAI synonyms.

If gRPC-native xAI benchmarking is ever needed, adding the four gRPC field names to the appropriate *_KEYS lists would be a one-line change per field.

Adding a new vendor: checklist

When you encounter a vendor not yet supported:

  1. Find the SDK source for the vendor. Look for the type that wraps the response’s usage field (often called Usage, UsageInfo, CompletionUsage, or similar). If no SDK exists, find the API documentation’s response schema.
  2. Identify the wrapper. Is the usage field at the response root, nested inside usage, nested inside usageMetadata, or in some other envelope? Snake-case or camelCase? If a Python SDK uses Pydantic with alias_generator=to_camel, the wire format is camelCase even though Python sees snake_case.
  3. Map each token-count field to AIPerf’s properties. Look for synonyms of prompt_tokens, completion_tokens, total_tokens, reasoning_tokens, cache reads, cache writes, etc. Add any new field names to the appropriate *_KEYS list in Usage.
  4. Identify any genuinely novel concepts (i.e. fields with no AIPerf-side analog). If they’re token-shaped and useful, add a new BaseUsageRecordMetric subclass in usage_extras_metrics.py (or usage_cache_metrics.py for cache-related) plus a matching DerivedSumMetric total in usage_total_metrics.py. Subclass declarations are 5–10 lines: just tag, header, unit, flags, usage_field, missing_message.
  5. If the vendor uses an envelope (like Gemini’s usageMetadata or Cohere’s meta), extend Usage.__init__ to unwrap it. Use setdefault so original keys win on collision.
  6. Add a fixture to tests/unit/common/models/test_usage_models_adversarial.py::VENDOR_FIXTURES with a verbatim payload from the vendor’s docs. Add it to the parametrized basic-token-count test.
  7. Add specific tests for any novel fields the vendor introduces (e.g. cache misses, audio durations, modality breakdowns).
  8. Update this document. Add a row to the quick-reference table and a per-vendor section with the SDK source citation.

Change history

  • 2026-05 — Initial cross-vendor verification. Added support for Gemini usageMetadata, AWS Bedrock camelCase, DeepSeek prompt_cache_hit_tokens/prompt_cache_miss_tokens, Mistral prompt_audio_seconds, Cohere v1 meta and v2 usage envelopes. Three real bugs found and fixed during SDK-source verification: Cohere v1 meta.cached_tokens lift, Cohere v2 envelope (no meta wrapper), Mistral {} sentinel defense.
  • 2026-05 — Second-wave SDK-source verification covering AI21, Cerebras, SambaNova, Bailian/DashScope, Vertex AI, Fireworks, IBM watsonx. Added input_token_count (watsonx) to PROMPT_TOKENS_KEYS and generated_token_count (watsonx) to COMPLETION_TOKENS_KEYS. SambaNova’s rich server-side timing fields catalogued as preserved-on-dict (parallel to client-computed metrics). Bailian’s multimodal characters field catalogued as non-token billing unit. Vertex AI confirmed identical to Gemini direct.