For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Welcome to AIPerf Documentation
  • Getting Started
    • Profiling with AIPerf
    • Comprehensive LLM Benchmarking
    • Migrating from GenAI-Perf
    • GenAI-Perf vs AIPerf CLI Feature Comparison Matrix
  • Tutorials
      • Architecture of AIPerf
      • Metrics Flow
      • Mixins
      • Code Patterns
      • Global Property-Test Invariants
      • Sweep Orchestrator (Dev Reference)
      • YAML Config Future Goals
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Architecture Overview
  • Three-Plane Architecture
  • Request Lifecycle
  • Core Components
  • System Controller
  • Dataset Manager
  • Timing Manager
  • Worker Manager
  • Workers
  • Record Processor
  • Records Manager
  • GPU Telemetry Manager
  • Server Metrics Manager
  • How AIPerf Works
  • Credit System & Request Timing
  • Data Flow & Messaging
  • Communication Architecture
  • Why ZMQ?
  • Communication Patterns
  • State Management
  • Design Principles
  • Deployment Modes
  • Configuration Envelope
  • External Dependencies
  • Telemetry Plane
  • OTelMetricsResultsProcessor Registration
  • multiprocessing.Queue and Fanout Process
  • Strategy-Protocol Dispatch via —stream
  • Deferred MLflow Path in ExporterManager
  • Data Flow: Record Telemetry
  • Data Flow: Timing Telemetry
Architecture & Internals

Architecture of AIPerf

||View as Markdown|
Previous

Search History API Reference

Next

Metrics Flow

AIPerf is a modular benchmarking tool for measuring AI inference performance. It generates load against inference endpoints, collects detailed performance metrics, and provides comprehensive analysis of throughput, latency, and resource utilization.

Architecture Overview

AIPerf is designed as a modular, extensible benchmarking framework that separates concerns across three architectural planes. The system scales horizontally as more workers are added while maintaining centralized orchestration.

AIPerf High-Level Architecture

Three-Plane Architecture

PlaneComponentsPurpose
Control PlaneSystemController, Timing Manager, Dataset Manager, Worker ManagerDecides what, when, and how many requests to send
Data PlaneWorkers, Inference ServerExecutes the actual I/O and request/response cycle
Analytic PlaneRecord Processors, Records Manager, GPU Telemetry Manager, Server Metrics ManagerComputes metrics and collects telemetry

Request Lifecycle

  1. Initialization: Dataset Manager loads data, Timing Manager prepares schedule
  2. Warmup (optional): Workers send warmup requests to prime JIT, caches, and connection pools. Results are discarded.
  3. Profiling: Workers receive credits, access data, send requests to inference server
  4. Collection: Workers capture response timing and content
  5. Processing: Record Processors compute metrics in parallel
  6. Aggregation: Records Manager collects and exports results

Core Components

System Controller

The System Controller is the central orchestrator that manages the lifecycle and coordination of all major modules involved in a benchmarking run.

Key Responsibilities:

  • Registering and initializing core components
  • Orchestrating the start, execution, and shutdown of benchmarking tasks
  • Handling configuration, resource allocation, and inter-module communication
  • Monitoring the overall progress and health of the benchmarking process
  • Managing error handling, cleanup, and graceful termination of all modules

Dataset Manager

The Dataset Manager handles all aspects of input data management during benchmarking runs.

Key Responsibilities:

  • Loading datasets from various sources (JSONL, CSV, synthetic generators, trace replay formats)
  • Parsing and validating input data to ensure it matches the expected format
  • Writing dataset to memory-mapped files, enabling workers to access data directly without message passing
  • Supporting custom dataset types, such as MoonCake traces, for advanced benchmarking scenarios
  • Managing the lifecycle of datasets, including initialization, iteration, and cleanup

Timing Manager

The Timing Manager controls and coordinates the timing of requests during benchmarking runs through a credit-based system.

Key Responsibilities:

  • Scheduling when each request should be sent based on the selected timing mode (fixed schedule, request-rate, or user-centric rate)
  • Managing precise timing to accurately reproduce real-world or synthetic load patterns
  • Supporting advanced timing scenarios, such as replaying traces with specific inter-arrival times or simulating bursty traffic
  • Ensuring that requests are dispatched to workers at the correct intervals for reliable measurement

Worker Manager

The Worker Manager orchestrates and manages the pool of worker processes that execute benchmarking tasks.

Key Responsibilities:

  • Coordinating with the system controller to spawn and shut down workers that send requests to the inference server
  • Monitoring worker status, progress, and resource usage
  • Handling worker lifecycle events, such as startup, shutdown, and error recovery
  • Managing worker pool size based on benchmarking requirements

Workers

Workers are the processes that send HTTP requests to the inference server and measure response times.

Key Responsibilities:

  • Send HTTP requests to inference servers and measure response timing
  • Wait for timing credits before sending requests (enables precise load control)
  • Track conversation state for multi-turn interactions
  • Report timing measurements to Record Processors for analysis

Scalability:

  • Run multiple workers (e.g., 10, 50, 100+) to support different workload patterns
  • No coordination between workers
  • Adding more workers increases load capacity and request rates

Record Processor

The Record Processor processes and interprets the responses received from the inference server during benchmarking.

Key Responsibilities:

  • Parsing raw inference results to extract relevant metrics (latency, output tokens, correctness)
  • Handling different response formats from various model endpoints (OpenAI, vLLM, Triton, custom APIs)
  • Validating and normalizing results to ensure consistency across benchmarking runs
  • Computing metrics derived from individual requests (TTFT, ITL, Request Latency, Request Throughput etc.)
  • Supporting error detection and handling for malformed or unexpected responses
  • Scales horizontally to handle high-volume metric computation

Records Manager

The Records Manager handles the collection, organization, and storage of benchmarking records and results.

Key Responsibilities:

  • Aggregating data from the records processors (inference results, timing information, metrics)
  • Storing records in memory and/or exporting them to files (CSV, JSON, Parquet) for later analysis
  • Providing interfaces for querying, filtering, and summarizing benchmarking results
  • Supporting the generation of reports and artifacts for performance evaluation
  • Managing the final export of aggregated performance summaries and per-request details

GPU Telemetry Manager

The GPU Telemetry Manager collects GPU metrics during benchmarking runs via pluggable collectors.

Key Responsibilities:

  • Collecting GPU metrics (power, utilization, memory, temperature, errors) via two collector backends:
    • DCGM: Scrapes DCGM Exporter HTTP endpoints (Prometheus format)
    • PyNVML: Queries NVIDIA GPUs directly via the pynvml Python library (no external endpoint required)
  • Auto-discovering DCGM endpoints
  • Supporting custom endpoints via --gpu-telemetry flag
  • Exporting GPU telemetry alongside benchmark results

Server Metrics Manager

The Server Metrics Manager collects metrics from Prometheus-compatible endpoints during benchmarking runs.

Key Responsibilities:

  • Collecting metrics from Prometheus-compatible endpoints (inference server application metrics, system metrics, custom metrics)
  • Auto-discovering metrics endpoints from configured inference server URLs (--url)
  • Supporting custom Prometheus endpoints via --server-metrics flag
  • Parsing any metrics exposed in Prometheus format (gauges, counters, histograms)
  • Typical metrics collected: inference server KV cache usage, request counts, latencies, batch sizes, model-specific metrics, and server resource metrics
  • Auto-detecting non-Prometheus endpoints (e.g. TRT-LLM serves an iteration-stats JSON array at /metrics by default), probing <base>/prometheus/metrics once as a fallback, and disabling collection for that endpoint after a single warning if neither path yields parseable Prometheus data — see Server Metrics Compatibility & auto-disable
  • Exporting server metrics alongside benchmark results

How AIPerf Works

Credit System & Request Timing

The Timing Manager uses a credit-based flow control system to control when requests are sent. This enables accurate load pattern reproduction and prevents server overload.

How Credits Work:

  • Each credit grants permission to send one request
  • The Timing Manager issues credits according to the configured timing mode:
    • Fixed schedule mode: Replays conversation traces at precise timestamps from dataset metadata
    • Request-rate mode: Issues credits at a specific rate with configurable arrival patterns (constant, Poisson, gamma, concurrency burst)
    • User-centric rate mode: Each session acts as a separate user with calculated gaps between turns

Flow Control Benefits:

  • Prevents overwhelming the inference server
  • Enables precise reproduction of load patterns
  • Provides natural backpressure when the server slows down
  • Allows accurate measurement without artificial delays

Credit Distribution:

  • Credits are routed to workers via ROUTER/DEALER pattern
  • Router selects workers based on sticky sessions (multi-turn conversations) or least-loaded worker selection
  • No coordination required between workers
  • Scales to large numbers of workers without bottlenecks
  • Efficient message routing minimizes overhead

Data Flow & Messaging

This section describes the end-to-end message flow during a benchmark run, showing how data moves between components through the ZMQ message bus.

Data Flow

Key Data Structures:

  • Timing Credit: Grants permission to send one request
  • Dataset Entry: Prompt and conversation context
  • Raw Result: Request timing, tokens, response text
  • Metric Record: Per-request computed metrics plus trace data
  • Aggregated Results: Final performance summary and per-request details

Message Flow:

  1. Credit Router routes credits to workers via ROUTER/DEALER pattern
  2. Workers access dataset entries via memory-mapped files
  3. Workers send requests to Inference Server (external HTTP)
  4. Workers push raw results to Record Processors
  5. Record Processors push metric records to Records Manager
  6. Records Manager aggregates and exports final results

Communication Architecture

AIPerf services communicate internally via a ZeroMQ (ZMQ) message bus, designed for low-latency, high-throughput message passing between components.

Why ZMQ?

AIPerf uses ZMQ to maintain measurement accuracy by decoupling orchestration logic from execution:

  • Low-overhead messaging: Credits are routed directly to workers
  • Asynchronous by design: No blocking calls between services, ensuring workers spend maximum time on I/O and timing
  • Efficient transport: ZMQ is designed for low-overhead inter-process communication
  • Scalability: Supports many local worker processes; Kubernetes is referenced by future-facing code paths, but no Kubernetes service-manager plugin or ServiceRunType is registered in this checkout, and distributed Kubernetes execution is not supported

Communication Patterns

AIPerf uses ZMQ proxies for message routing between services and workers:

  • Services publish strongly-typed messages to specific topics (Pub/Sub pattern)
  • Services subscribe to relevant message types
  • Router/Dealer patterns for credit distribution to workers
  • Request/Reply patterns for synchronous operations

State Management

Stateless design for scalability:

  • Workers: No shared state between workers; each maintains only local conversation context for multi-turn requests
  • Services: All service state is ephemeral and can be reconstructed from configuration
  • Coordination: Credit distribution happens through the message bus; dataset access via memory-mapped files
  • Results: Only aggregated results are persistent (exported to files)

Design Principles

AIPerf is built on three core principles:

  • Separation of Concerns: Control plane orchestrates, workers execute, record processors compute metrics
  • Scalability: Horizontal scaling for workers and processors with credit-based flow control
  • Extensibility: Plugin system for datasets, endpoints, transports, and metrics

Deployment Modes

AIPerf currently supports one local deployment model:

  • Multiprocess Mode: Each service runs as a separate process on a single node (default for single-node deployments)

Kubernetes is referenced by future-facing code paths, but no Kubernetes service-manager plugin or ServiceRunType is registered in this checkout; do not treat Kubernetes distributed execution as supported.

Configuration Envelope

The top-level AIPerfConfig YAML accepts several optional sibling keys alongside the core benchmark: block — sweep:, multi_run:, variables:, random_seed:, and plot:. Each owns a single concern and is loaded by its own Pydantic model.

The plot: envelope describes which plots are rendered after the run. It accepts two forms: a bare-string path reference (e.g. plot: ./plots/baseline.yaml, resolved relative to the AIPerf YAML’s directory) or an inline mapping mirroring src/aiperf/plot/default_plot_config.yaml 1:1. When plot: is set it replaces the ~/.aiperf/plot_config.yaml fallback (the envelope is the spec) and presence implies artifacts.auto_plot=True unless the user explicitly sets it to false. The auto-plot callback materializes the resolved envelope to <artifact_dir>/.aiperf-plot-config.yaml so aiperf plot <dir> later reproduces the same plots without the original YAML. See src/aiperf/config/plot.py for the Pydantic models.

External Dependencies

AIPerf integrates with external systems:

  • Inference Server: The target system being benchmarked (vLLM, Dynamo, SGLang, etc.)
  • DCGM Exporter: Optional GPU telemetry source (exposes GPU metrics in Prometheus format). Alternative: PyNVML queries GPUs directly without an external endpoint.
  • Prometheus-compatible endpoints: Optional server/application metrics source for Server Metrics Manager (inference servers like vLLM expose metrics in Prometheus format at their /metrics endpoint)

Telemetry Plane

The Telemetry Plane provides real-time streaming of benchmark metrics to OpenTelemetry collectors and MLflow tracking servers. It operates as a sidecar to the Analytic Plane, consuming processed results without affecting the core benchmarking pipeline.

OTelMetricsResultsProcessor Registration

OTelMetricsResultsProcessor is a results processor registered with RecordsManager. When --otel-url is set (with --stream controlling which domains are active), the processor is instantiated and added to the Records Manager’s processor chain. It receives every MetricRecordsData and CreditPhaseStats event that flows through the analytic plane, acting as the entry point into the telemetry pipeline.

The processor does not emit metrics directly. Instead, it delegates to strategy objects that decide whether a given result type is relevant and how to transform it into telemetry events.

multiprocessing.Queue and Fanout Process

Telemetry export runs in a dedicated child process (Fanout_Process) to isolate the benchmarking hot path from network I/O to collectors and tracking servers. Communication between the main process and the fanout process uses a bounded multiprocessing.Queue.

Back-pressure policy: When the queue is full, the oldest event is dropped and the put is retried once. A _fanout_dropped_events counter tracks discards so operators can detect saturation without impacting benchmark accuracy.

The fanout process drains the queue with a configurable poll_timeout_sec, batches events through the OTel SDK periodic reader for OTLP export, and optionally forwards metrics to MLflow when --mlflow-tracking-uri is set.

Strategy-Protocol Dispatch via —stream

The --stream flag activates strategy-based dispatch inside OTelMetricsResultsProcessor. Each strategy implements a two-method protocol:

  • supports(result) — returns True if the strategy handles this result type.
  • process(result) — transforms the result into one or more telemetry events pushed to the queue.

Two built-in strategies ship with the system:

StrategyTriggers onEmits
MetricResultsStrategyMetricRecordsDataHistogram record events (latency, throughput)
TimingResultsStrategyCreditPhaseStatsCounter add / UpDownCounter add events

Strategies are evaluated in registration order; the first whose supports returns True wins.

Deferred MLflow Path in ExporterManager

Post-run artifact upload follows a deferred export path. MLflowDataExporter is registered as a deferred exporter in ExporterManager so it runs after all local exporters (JSON, CSV, Parquet) have written their files. This guarantees that every artifact is available on disk before the upload begins.

The deferred path handles run identity, metric logging, and artifact upload in a single pass:

Data Flow: Record Telemetry

The following diagram shows how per-request metrics flow from workers through the analytic plane into the telemetry pipeline:

Data Flow: Timing Telemetry

Credit-phase timing events follow a similar path but use counters and up-down counters rather than histograms: