For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Welcome to AIPerf Documentation
  • Getting Started
    • Profiling with AIPerf
    • Comprehensive LLM Benchmarking
    • Migrating from GenAI-Perf
    • GenAI-Perf vs AIPerf CLI Feature Comparison Matrix
  • Tutorials
      • Benchmark Goodput with AIPerf
      • Multi-Run Confidence Reporting
      • Parameter Sweeps and Multi-Run Statistics
      • Adaptive Search
      • Time Slicing for Performance Analysis
      • HTTP Trace Metrics Guide
      • Working with Profile Export Files
      • Visualization and Plotting with AIPerf
      • Auto-Plot After `aiperf profile`
      • User-Centric Timing for KV Cache Benchmarking
      • GPU Telemetry with AIPerf
      • OTel and MLflow Telemetry
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Overview
  • Path 1: Dynamo (Built-in DCGM)
  • Path 2: Other Inference Servers (Custom DCGM)
  • Path 3: Local GPU Monitoring (pynvml)
  • Path 4: AMD ROCm GPUs (amdsmi)
  • Prerequisites
  • Understanding GPU Telemetry in AIPerf
  • How the --gpu-telemetry Flag Works
  • Real-Time Dashboard View
  • 1: Using Dynamo
  • Setup Dynamo Server
  • Verify Dynamo is Running
  • Run AIPerf Benchmark
  • 2: Using Other Inference Server
  • Setup vLLM Server with DCGM
  • Verify Everything is Running
  • Run AIPerf Benchmark
  • 3: Using pynvml (Local GPU Monitoring)
  • Prerequisites
  • When to Use pynvml
  • Run AIPerf with pynvml
  • Metrics Collected via pynvml
  • Comparing DCGM vs pynvml
  • 4. Using amdsmi (Local AMD ROCm GPU Monitoring)
  • When to Use amdsmi
  • Run AIPerf with amdsmi
  • Metrics Collected via amdsmi
  • Comparing DCGM vs pynvml vs amdsmi
  • Multi-Node GPU Telemetry Example
  • Customizing Displayed Metrics
  • Custom Metrics CSV Format
  • Example Console Display:
  • Example CSV Export
  • Example JSON Export
TutorialsMetrics & Analysis

GPU Telemetry with AIPerf

||View as Markdown|
Previous

User-Centric Timing for KV Cache Benchmarking

Next

OTel and MLflow Telemetry

This guide shows you how to collect GPU metrics (power, utilization, memory, temperature, etc.) during AIPerf benchmarking. GPU telemetry provides insights into GPU performance and resource usage while running inference workloads.

Overview

This guide covers three setup paths depending on your inference backend and requirements:

Path 1: Dynamo (Built-in DCGM)

If you’re using Dynamo, it comes with DCGM pre-configured on port 9401. No additional setup needed! Just use the --gpu-telemetry flag to enable console display and optionally add additional DCGM url endpoints. URLs can be specified with or without the http:// prefix (e.g., localhost:9400 or http://localhost:9400).

Path 2: Other Inference Servers (Custom DCGM)

If you’re using any other inference backend, you’ll need to set up DCGM separately.

Path 3: Local GPU Monitoring (pynvml)

If you want simple local GPU monitoring without DCGM, use --gpu-telemetry pynvml. This uses NVIDIA’s nvidia-ml-py Python library (commonly known as pynvml) to collect metrics directly from the GPU driver. No HTTP endpoints or additional containers required.

Path 4: AMD ROCm GPUs (amdsmi)

If you’re benchmarking against an inference server running on AMD ROCm GPUs (Instinct MI300X, MI355X, etc.), use --gpu-telemetry amdsmi. This uses the amdsmi Python bindings shipped with ROCm to collect metrics directly from the AMD driver. No HTTP endpoints required. Install the bindings via pip install /opt/rocm/share/amd_smi/amdsmi-*.whl if not already present (they ship with ROCm).

Prerequisites

  • NVIDIA GPU with CUDA support, or AMD GPU with ROCm 6.x/7.x
  • Docker installed and configured

Understanding GPU Telemetry in AIPerf

AIPerf provides GPU telemetry collection with the --gpu-telemetry flag. Here’s how it works:

How the --gpu-telemetry Flag Works

UsageCommandWhat Gets Collected (If Available)Console DisplayDashboard ViewCSV/JSON Export
No flagaiperf profile --model MODEL ...http://localhost:9400/metrics + http://localhost:9401/metrics❌ No❌ No✅ Yes
Flag onlyaiperf profile --model MODEL ... --gpu-telemetryhttp://localhost:9400/metrics + http://localhost:9401/metrics✅ Yes❌ No✅ Yes
Dashboard modeaiperf profile --model MODEL ... --gpu-telemetry dashboardhttp://localhost:9400/metrics + http://localhost:9401/metrics✅ Yes✅ Yes (see dashboard)✅ Yes
Custom URLsaiperf profile --model MODEL ... --gpu-telemetry node1:9400 http://node2:9400/metricshttp://localhost:9400/metrics + http://localhost:9401/metrics + custom URLs✅ Yes❌ No✅ Yes
Dashboard + URLsaiperf profile --model MODEL ... --gpu-telemetry dashboard localhost:9400http://localhost:9400/metrics + http://localhost:9401/metrics + custom URLs✅ Yes✅ Yes (see dashboard)✅ Yes
Custom metricsaiperf profile --model MODEL ... --gpu-telemetry custom_gpu_metrics.csvhttp://localhost:9400/metrics + http://localhost:9401/metrics + custom metrics from CSV✅ Yes❌ No✅ Yes
pynvml modeaiperf profile --model MODEL ... --gpu-telemetry pynvmlLocal GPUs via pynvml library (see pynvml section)✅ Yes❌ No✅ Yes
pynvml + dashboardaiperf profile --model MODEL ... --gpu-telemetry pynvml dashboardLocal GPUs via pynvml library✅ Yes✅ Yes (see dashboard)✅ Yes
amdsmi modeaiperf profile --model MODEL ... --gpu-telemetry amdsmiLocal AMD ROCm GPUs via amdsmi library✅ Yes❌ No✅ Yes
Disabledaiperf profile --model MODEL ... --no-gpu-telemetryNone❌ No❌ No❌ No

DCGM mode (default): The default endpoints http://localhost:9400/metrics and http://localhost:9401/metrics are always attempted for telemetry collection, regardless of whether the --gpu-telemetry flag is used. The flag primarily controls whether metrics are displayed on the console and allows you to specify additional custom DCGM exporter endpoints.

pynvml mode: When using --gpu-telemetry pynvml, DCGM endpoints are NOT used. Metrics are collected directly from local GPUs via the nvidia-ml-py library.

amdsmi mode: When using --gpu-telemetry amdsmi, DCGM endpoints are NOT used. Metrics are collected directly from local AMD GPUs via the amdsmi library and emitted under vendor-namespaced amd_* field names (amd_power, amd_gfx_activity, amd_temperature, etc.) rather than NVML-shaped names. On Instinct datacenter parts amd_mm_activity is generally absent (sensor returns 'N/A'); amd_throttle_status is a 0.0/1.0 snapshot per scrape (amdsmi exposes a boolean state, not a duration counter).

To completely disable GPU telemetry collection, use --no-gpu-telemetry.

When specifying custom DCGM exporter URLs, the http:// prefix is optional. URLs like localhost:9400 will automatically be treated as http://localhost:9400. Both formats work identically.

For simple local GPU monitoring without DCGM setup, use --gpu-telemetry pynvml. This collects metrics directly from the NVIDIA driver using the nvidia-ml-py library. See Path 3: pynvml for details.

Real-Time Dashboard View

Adding dashboard to the --gpu-telemetry flag enables a live terminal UI (TUI) that displays GPU metrics in real-time during your benchmark runs:

$aiperf profile --model MODEL ... --gpu-telemetry dashboard

1: Using Dynamo

Dynamo includes DCGM out of the box on port 9401 - no extra setup needed!

Setup Dynamo Server

$# Set environment variables
$export AIPERF_REPO_TAG="main"
$export DYNAMO_PREBUILT_IMAGE_TAG="nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
$export MODEL="Qwen/Qwen3-0.6B"
$
$# Download the Dynamo container
$docker pull ${DYNAMO_PREBUILT_IMAGE_TAG}
$export DYNAMO_REPO_TAG=$(docker run --rm --entrypoint "" ${DYNAMO_PREBUILT_IMAGE_TAG} cat /workspace/version.txt | cut -d'+' -f2)
$
$# Start up required services
$curl -O https://raw.githubusercontent.com/ai-dynamo/dynamo/${DYNAMO_REPO_TAG}/deploy/docker-compose.yml
$docker compose -f docker-compose.yml down || true
$docker compose -f docker-compose.yml up -d
$
$# Launch Dynamo in the background
$docker run \
> --rm \
> --gpus all \
> --network host \
> ${DYNAMO_PREBUILT_IMAGE_TAG} \
> /bin/bash -c "python3 -m dynamo.frontend & python3 -m dynamo.vllm --model ${MODEL} --enforce-eager --no-enable-prefix-caching" > server.log 2>&1 &
$# Set up AIPerf
$docker run \
> -it \
> --rm \
> --gpus all \
> --network host \
> -e AIPERF_REPO_TAG=${AIPERF_REPO_TAG} \
> -e MODEL=${MODEL} \
> ubuntu:24.04
$
$apt update && apt install -y curl git
$
$curl -LsSf https://astral.sh/uv/install.sh | sh
$
$source $HOME/.local/bin/env
$
$uv venv --python 3.10
$
$source .venv/bin/activate
$
$git clone -b ${AIPERF_REPO_TAG} --depth 1 https://github.com/ai-dynamo/aiperf.git
$
$uv pip install ./aiperf

Verify Dynamo is Running

$# Wait for Dynamo API to be ready (up to 15 minutes)
$timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"a\"}],\"max_completion_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "Dynamo not ready after 15min"; exit 1; }
$# Wait for DCGM Exporter to be ready (up to 2 minutes after Dynamo is ready)
$echo "Dynamo ready, waiting for DCGM metrics to be available..."
$timeout 120 bash -c 'while true; do STATUS=$(curl -s -o /dev/null -w "%{http_code}" localhost:9401/metrics); if [ "$STATUS" = "200" ]; then if curl -s localhost:9401/metrics | grep -q "DCGM_FI_DEV_GPU_UTIL"; then break; fi; fi; echo "Waiting for DCGM metrics..."; sleep 5; done' || { echo "GPU utilization metrics not found after 2min"; exit 1; }
$echo "DCGM GPU metrics are now available"

Run AIPerf Benchmark

$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --endpoint /v1/chat/completions \
> --streaming \
> --url localhost:8000 \
> --synthetic-input-tokens-mean 100 \
> --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 200 \
> --output-tokens-stddev 0 \
> --extra-inputs min_tokens:200 \
> --extra-inputs ignore_eos:true \
> --concurrency 4 \
> --request-count 64 \
> --warmup-request-count 1 \
> --num-dataset-entries 8 \
> --random-seed 100 \
> --gpu-telemetry

Sample Output (Successful Run):

INFO Starting AIPerf System
INFO AIPerf System is PROFILING
Profiling: 64/64 |████████████████████████| 100% [00:45<00:00]
INFO Benchmark completed successfully
NVIDIA AIPerf | GPU Telemetry Summary
1/1 DCGM endpoints reachable
• localhost:9401 ✔
localhost:9401 | GPU 0 | NVIDIA H100 80GB HBM3
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━┩
│ GPU Power Usage (W) │ 348.69 │ 120.57 │ 386.02 │ 386.02 │ 386.02 │ 378.34 │ 85.97 │
│ Energy Consumption (MJ) │ 0.24 │ 0.23 │ 0.25 │ 0.25 │ 0.25 │ 0.23 │ 0.01 │
│ GPU Utilization (%) │ 45.82 │ 0.00 │ 66.00 │ 66.00 │ 66.00 │ 66.00 │ 24.52 │
│ Memory Copy Utilization (%) │ 21.10 │ 0.00 │ 29.00 │ 29.00 │ 29.00 │ 29.00 │ 10.11 │
│ GPU Memory Used (GB) │ 92.70 │ 92.70 │ 92.70 │ 92.70 │ 92.70 │ 92.70 │ 0.00 │
│ GPU Memory Free (GB) │ 9.39 │ 9.39 │ 9.39 │ 9.39 │ 9.39 │ 9.39 │ 0.00 │
│ SM Clock Frequency (MHz) │ 1,980.00 │ 1,980.00 │ 1,980.00 │ 1,980.00 │ 1,980.00 │ 1,980.00 │ 0.00 │
│ Memory Clock Frequency (MHz) │ 2,619.00 │ 2,619.00 │ 2,619.00 │ 2,619.00 │ 2,619.00 │ 2,619.00 │ 0.00 │
│ Memory Temperature (°C) │ 45.99 │ 41.00 │ 48.00 │ 48.00 │ 48.00 │ 46.00 │ 2.08 │
│ GPU Temperature (°C) │ 38.87 │ 33.00 │ 41.00 │ 41.00 │ 41.00 │ 39.00 │ 2.38 │
│ XID Errors (count) │ 0.00 │ 0.00 │ 0.00 │ 0.00 │ 0.00 │ 0.00 │ 0.00 │
└──────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴───────┘
CLI Command: aiperf profile --model "Qwen/Qwen3-0.6B" --endpoint-type "chat" ...
JSON Export: artifacts/Qwen_Qwen3-0.6B-chat-concurrency4/profile_export_aiperf.json
GPU Telemetry: artifacts/Qwen_Qwen3-0.6B-chat-concurrency4/gpu_telemetry_export.json

The GPU telemetry table displays real-time metrics collected from DCGM during the benchmark. Each GPU is shown with its metrics aggregated across the benchmark duration.

The dashboard keyword enables a live terminal UI for real-time GPU telemetry visualization. Press 5 to maximize the GPU Telemetry panel during the benchmark run.


2: Using Other Inference Server

This path works with vLLM, SGLang, TRT-LLM, or any inference server. We’ll use vLLM as an example.

Setup vLLM Server with DCGM

The setup includes three steps: creating a custom metrics configuration, starting the DCGM Exporter, and launching the vLLM server.

$# Step 1: Create a custom metrics configuration
$cat > custom_gpu_metrics.csv << 'EOF'
$# Format
$# If line starts with a '#' it is considered a comment
$# DCGM FIELD, Prometheus metric type, help message
$
$# Clocks
$DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz)
$DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz)
$
$# Temperature
$DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in °C)
$DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in °C)
$
$# Power
$DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W)
$DCGM_FI_DEV_POWER_MGMT_LIMIT, gauge, Power management limit (in W)
$DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ)
$
$# Memory usage
$DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB)
$DCGM_FI_DEV_FB_TOTAL, gauge, Total framebuffer memory (in MiB)
$DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB)
$
$# Utilization
$DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %)
$DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory copy utilization (in %)
$
$# Errors and Violations
$DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered
$DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us)
$DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us)
$EOF
$
$# Step 2: Start DCGM Exporter container (forwards port 9400 → 9401)
$export DCGM_EXPORTER_IMAGE="nvcr.io/nvidia/k8s/dcgm-exporter:4.2.0-4.1.0-ubuntu22.04"
$
$docker run -d --name dcgm-exporter \
> --gpus all \
> --cap-add SYS_ADMIN \
> -p 9401:9400 \
> -v "$PWD/custom_gpu_metrics.csv:/etc/dcgm-exporter/custom.csv" \
> -e DCGM_EXPORTER_INTERVAL=33 \
> ${DCGM_EXPORTER_IMAGE} \
> -f /etc/dcgm-exporter/custom.csv
$
$# Wait for DCGM to start
$sleep 10
$
$# Step 3: Start vLLM Inference Server
$export MODEL="Qwen/Qwen3-0.6B"
$
$docker pull vllm/vllm-openai:latest
$
$docker run -d --name vllm-server \
> --gpus all \
> -p 8000:8000 \
> vllm/vllm-openai:latest \
> --model Qwen/Qwen3-0.6B \
> --host 0.0.0.0 \
> --port 8000

You can customize the custom_gpu_metrics.csv file by commenting out metrics you don’t need. Lines starting with # are ignored.

Key Configuration:

  • -p 9401:9400 - Forward container’s port 9400 to host’s port 9401 (AIPerf’s default)
  • -e DCGM_EXPORTER_INTERVAL=33 - Collect metrics every 33ms for fine-grained profiling
  • -v custom_gpu_metrics.csv:... - Mount your custom metrics configuration
$# Set up AIPerf
$export AIPERF_REPO_TAG="main"
$
$docker run \
> -it \
> --rm \
> --gpus all \
> --network host \
> -e AIPERF_REPO_TAG=${AIPERF_REPO_TAG} \
> -e MODEL=${MODEL} \
> ubuntu:24.04
$
$apt update && apt install -y curl git
$
$curl -LsSf https://astral.sh/uv/install.sh | sh
$
$source $HOME/.local/bin/env
$
$uv venv --python 3.10
$
$source .venv/bin/activate
$
$git clone -b ${AIPERF_REPO_TAG} --depth 1 https://github.com/ai-dynamo/aiperf.git
$
$uv pip install ./aiperf

Replace the vLLM command above with your preferred backend (SGLang, TRT-LLM, etc.). The DCGM setup works with any server.

Verify Everything is Running

$# Wait for vLLM inference server to be ready (up to 15 minutes)
$timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"Qwen/Qwen3-0.6B\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "vLLM not ready after 15min"; exit 1; }
$
$# Wait for DCGM Exporter metrics to be available (up to 2 minutes after vLLM is ready)
$echo "vLLM ready, waiting for DCGM metrics to be available..."
$timeout 120 bash -c 'while true; do OUTPUT=$(curl -s localhost:9401/metrics); if echo "$OUTPUT" | grep -q "DCGM_FI_DEV_GPU_UTIL"; then break; fi; echo "Waiting for DCGM metrics..."; sleep 5; done' || { echo "GPU utilization metrics not found after 2min"; exit 1; }
$echo "DCGM GPU metrics are now available"

Run AIPerf Benchmark

$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --endpoint /v1/chat/completions \
> --streaming \
> --url localhost:8000 \
> --synthetic-input-tokens-mean 100 \
> --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 200 \
> --output-tokens-stddev 0 \
> --extra-inputs min_tokens:200 \
> --extra-inputs ignore_eos:true \
> --concurrency 4 \
> --request-count 64 \
> --warmup-request-count 1 \
> --num-dataset-entries 8 \
> --random-seed 100 \
> --gpu-telemetry

The dashboard keyword enables a live terminal UI for real-time GPU telemetry visualization. Press 5 to maximize the GPU Telemetry panel during the benchmark run.


3: Using pynvml (Local GPU Monitoring)

For simple local GPU monitoring without DCGM infrastructure, AIPerf supports direct GPU metrics collection using NVIDIA’s nvidia-ml-py Python library (commonly known as pynvml). This approach requires no additional containers, HTTP endpoints, or DCGM setup.

Prerequisites

  • NVIDIA GPU with driver installed
  • nvidia-ml-py package: pip install nvidia-ml-py

When to Use pynvml

ScenarioRecommended Approach
Local development/testingpynvml
Single-node inference serverpynvml or DCGM
Multi-node distributed setupDCGM (HTTP endpoints required)
Production with existing DCGMDCGM
Quick GPU monitoring without setuppynvml

Run AIPerf with pynvml

$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --endpoint /v1/chat/completions \
> --streaming \
> --url localhost:8000 \
> --synthetic-input-tokens-mean 100 \
> --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 200 \
> --output-tokens-stddev 0 \
> --extra-inputs min_tokens:200 \
> --extra-inputs ignore_eos:true \
> --concurrency 4 \
> --request-count 64 \
> --warmup-request-count 1 \
> --num-dataset-entries 8 \
> --random-seed 100 \
> --gpu-telemetry pynvml
Add dashboard after pynvml for the real-time terminal UI: --gpu-telemetry pynvml dashboard

Metrics Collected via pynvml

The nvidia-ml-py library (pynvml) collects the following metrics directly from the NVIDIA driver:

MetricDescriptionUnit
GPU Power UsageCurrent power drawW
Energy ConsumptionTotal energy since bootMJ
GPU UtilizationGPU compute utilization%
Memory UtilizationMemory controller utilization%
GPU Memory UsedFramebuffer memory in useGB
GPU TemperatureGPU die temperature°C
SM UtilizationStreaming multiprocessor utilization%
Decoder UtilizationVideo decoder utilization%
Encoder UtilizationVideo encoder utilization%
JPEG UtilizationJPEG decoder utilization%
Power ViolationThrottling duration due to power limitsµs

Not all metrics are available on all GPU models. AIPerf gracefully handles missing metrics and reports only what the hardware supports.

Comparing DCGM vs pynvml

FeatureDCGMpynvml
Setup complexityRequires container/serviceJust install nvidia-ml-py Python package
Multi-node supportYes (via HTTP endpoints)No (local only)
Metrics granularityHigh (profiling-level metrics)Standard (driver-level metrics)
Kubernetes integrationNative with dcgm-exporterNot applicable
XID error reportingYesNo

4. Using amdsmi (Local AMD ROCm GPU Monitoring)

For inference workloads on AMD Instinct GPUs (MI300X, MI355X, etc.), use --gpu-telemetry amdsmi. This collects metrics directly from local AMD GPUs via the amdsmi Python library shipped with ROCm.

When to Use amdsmi

  • Benchmarking against vLLM-ROCm, SGLang-ROCm, TGI, or any ROCm-backed inference server running on the same machine as AIPerf.
  • Local single-node monitoring with no need for HTTP exporters.

Run AIPerf with amdsmi

$aiperf profile \
> --model meta-llama/Llama-3.1-8B-Instruct \
> --endpoint-type chat \
> --url http://localhost:8000 \
> --concurrency 16 --request-count 200 \
> --gpu-telemetry amdsmi

Metrics Collected via amdsmi

AMD signals are emitted under their own vendor-namespaced field names (not aliased onto NVML-shaped names) because the underlying sensors do not always measure the same physical quantity (e.g. amdsmi gfx_activity and NVML sm_utilization sample at different scopes).

MetricSourceNotes
amd_power (W)amdsmi_get_power_info().current_socket_powerAlready in W; no scaling. Falls back to average_socket_power if current_socket_power is 'N/A'.
amd_energy_consumption (MJ)amdsmi_get_energy_count()accumulator * counter_resolution (µJ) → MJ. Cumulative counter — accumulator computes a delta against the pre-profile baseline. Reads energy_accumulator first; falls back to the older power field name on ROCm < 6.2.
amd_gfx_activity (%)amdsmi_get_gpu_activity().gfx_activityGraphics engine activity.
amd_umc_activity (%)amdsmi_get_gpu_activity().umc_activityMemory controller activity.
amd_mm_activity (%)amdsmi_get_gpu_activity().mm_activityMultimedia engine activity. Generally 'N/A' on Instinct datacenter GPUs — field will be absent rather than emitted as zero.
amd_memory_used (GB)amdsmi_get_gpu_memory_usage(VRAM)bytes → GB.
amd_temperature (°C)amdsmi_get_temp_metric(JUNCTION)Falls back to HOTSPOT. EDGE is unsupported on Instinct GPUs. Unit conversion is gated on amdsmi.__version__ (≥ 26.x already returns Celsius; older bindings return millidegrees and are divided by 1000).
amd_ecc_uncorrectableamdsmi_get_gpu_total_ecc_count().uncorrectable_countCumulative uncorrectable ECC error count. Counter — accumulator computes a delta.
amd_throttle_statusamdsmi_get_gpu_metrics_info().throttle_status (and indep_throttle_status)0.0/1.0 snapshot per scrape — 1.0 if any throttle indicator is active. amdsmi exposes a state (bool/bitfield), not a duration counter; a fraction-throttled summary can be derived from the average. Field is left absent when both signals return 'N/A' (sensor unsupported), so “unsupported” is not silently reported as “not throttled”.

Comparing DCGM vs pynvml vs amdsmi

FeatureDCGMpynvmlamdsmi
HardwareNVIDIANVIDIAAMD ROCm
Setup complexityRequires container/servicepip install nvidia-ml-pyShips with ROCm; install wheel from /opt/rocm/share/amd_smi/ if missing
Multi-node supportYes (HTTP)No (local)No (local)
Field naminggpu_* (NVML-shaped)gpu_* (NVML-shaped)amd_* (vendor-namespaced)
Encoder/decoder utilYesYesNo (Instinct GPUs report 'N/A')
Error reportingXID errors(none)ECC uncorrectable count (amd_ecc_uncorrectable)
SM-level utilizationYes (DCGM_FI_PROF_SM_ACTIVE)Yes (GPM API)Aliased to gfx_activity

Multi-Node GPU Telemetry Example

For distributed setups with multiple nodes, you can collect GPU telemetry from all nodes simultaneously:

$# Example: Collecting telemetry from 3 nodes in a distributed setup
$# Note: The default endpoints http://localhost:9400/metrics and http://localhost:9401/metrics
$# are always attempted in addition to these custom URLs
$# URLs can be specified with or without the http:// prefix
$aiperf profile \
> --model Qwen/Qwen3-0.6B \
> --endpoint-type chat \
> --endpoint /v1/chat/completions \
> --streaming \
> --url localhost:8000 \
> --synthetic-input-tokens-mean 100 \
> --synthetic-input-tokens-stddev 0 \
> --output-tokens-mean 200 \
> --output-tokens-stddev 0 \
> --extra-inputs min_tokens:200 \
> --extra-inputs ignore_eos:true \
> --concurrency 4 \
> --request-count 64 \
> --warmup-request-count 1 \
> --num-dataset-entries 8 \
> --random-seed 100 \
> --gpu-telemetry node1:9400 node2:9400 http://node3:9400/metrics

This will collect GPU metrics from:

  • http://localhost:9400/metrics (default, always attempted)
  • http://localhost:9401/metrics (default, always attempted)
  • http://node1:9400 (custom node 1, normalized from node1:9400)
  • http://node2:9400 (custom node 2, normalized from node2:9400)
  • http://node3:9400/metrics (custom node 3)

All metrics are displayed on the console and saved to the output CSV and JSON files, with GPU indices and hostnames distinguishing metrics from different nodes.

Customizing Displayed Metrics

You can customize which GPU metrics are displayed in AIPerf by creating a custom metrics CSV file and passing it to --gpu-telemetry:

$aiperf profile --model MODEL ... --gpu-telemetry custom_gpu_metrics.csv
$
$aiperf profile --model MODEL ... --gpu-telemetry localhost:9400 dashboard custom_gpu_metrics.csv

Custom Metrics CSV Format

The CSV format is identical to DCGM exporter configuration. See the vLLM setup section above (Step 1: Create a custom metrics configuration) for the complete CSV format example with all available DCGM fields.

Behavior: Custom metrics extend (not replace) the 7 core default metrics:

  • GPU Power Usage
  • Energy Consumption
  • GPU Utilization
  • GPU Memory Used
  • GPU Temperature
  • XID Errors
  • Power Violation

The file path can be absolute or relative. Use .csv extension so AIPerf can distinguish it from DCGM endpoint URLs.

You can start with the example CSV from the vLLM setup section and customize it by commenting out metrics you don’t need or adding new DCGM metrics.

Example Console Display:

NVIDIA AIPerf | GPU Telemetry Summary
1/1 DCGM endpoints reachable
• localhost:9401 ✔
localhost:9401 | GPU 0 | NVIDIA H100 80GB HBM3
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┓
┃ Metric ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p50 ┃ std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━┩
│ GPU Power Usage (W) │ 348.69 │ 120.57 │ 386.02 │ 386.02 │ 386.02 │ 378.34 │ 85.97 │
│ Energy Consumption (MJ) │ 0.24 │ 0.23 │ 0.25 │ 0.25 │ 0.25 │ 0.23 │ 0.01 │
│ GPU Utilization (%) │ 45.82 │ 0.00 │ 66.00 │ 66.00 │ 66.00 │ 66.00 │ 24.52 │
│ Memory Copy Utilization (%) │ 21.10 │ 0.00 │ 29.00 │ 29.00 │ 29.00 │ 29.00 │ 10.11 │
│ GPU Memory Used (GB) │ 92.70 │ 92.70 │ 92.70 │ 92.70 │ 92.70 │ 92.70 │ 0.00 │
│ GPU Memory Free (GB) │ 9.39 │ 9.39 │ 9.39 │ 9.39 │ 9.39 │ 9.39 │ 0.00 │
│ SM Clock Frequency (MHz) │ 1,980.00 │ 1,980.00 │ 1,980.00 │ 1,980.00 │ 1,980.00 │ 1,980.00 │ 0.00 │
│ Memory Clock Frequency (MHz) │ 2,619.00 │ 2,619.00 │ 2,619.00 │ 2,619.00 │ 2,619.00 │ 2,619.00 │ 0.00 │
│ Memory Temperature (°C) │ 45.99 │ 41.00 │ 48.00 │ 48.00 │ 48.00 │ 46.00 │ 2.08 │
│ GPU Temperature (°C) │ 38.87 │ 33.00 │ 41.00 │ 41.00 │ 41.00 │ 39.00 │ 2.38 │
│ XID Errors (count) │ 0.00 │ 0.00 │ 0.00 │ 0.00 │ 0.00 │ 0.00 │ 0.00 │
└──────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴───────┘

Example CSV Export

Endpoint,GPU_Index,GPU_Name,GPU_UUID,Metric,avg,min,max,p1,p5,p10,p25,p50,p75,p90,p95,p99,std
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,GPU Power Usage (W),348.69,120.57,386.02,120.57,120.57,,378.34,378.34,386.02,386.02,386.02,386.02,85.97
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,Energy Consumption (MJ),0.24,0.23,0.25,0.23,0.23,,0.23,0.23,0.25,0.25,0.25,0.25,0.01
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,GPU Utilization (%),45.82,0.00,66.00,0.00,0.00,,27.00,66.00,66.00,66.00,66.00,66.00,24.52
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,Memory Copy Utilization (%),21.10,0.00,29.00,0.00,0.00,,15.00,29.00,29.00,29.00,29.00,29.00,10.11
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,GPU Memory Used (GB),92.70,92.70,92.70,92.70,92.70,,92.70,92.70,92.70,92.70,92.70,92.70,0.00
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,GPU Memory Free (GB),9.39,9.39,9.39,9.39,9.39,,9.39,9.39,9.39,9.39,9.39,9.39,0.00
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,SM Clock Frequency (MHz),1980.00,1980.00,1980.00,1980.00,1980.00,,1980.00,1980.00,1980.00,1980.00,1980.00,1980.00,0.00
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,Memory Clock Frequency (MHz),2619.00,2619.00,2619.00,2619.00,2619.00,,2619.00,2619.00,2619.00,2619.00,2619.00,2619.00,0.00
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,Memory Temperature (°C),45.99,41.00,48.00,41.00,41.00,,46.00,46.00,48.00,48.00,48.00,48.00,2.08
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,GPU Temperature (°C),38.87,33.00,41.00,33.00,33.00,,39.00,39.00,41.00,41.00,41.00,41.00,2.38
localhost:9401,0,NVIDIA H100 80GB HBM3,GPU-afc3c15a-48a5-d669-0634-191c629f95fa,XID Errors (count),0.00,0.00,0.00,0.00,0.00,,0.00,0.00,0.00,0.00,0.00,0.00,0.00

Example JSON Export

1"telemetry_data": {
2 "summary": {
3 "endpoints_configured": [
4 "http://localhost:9401/metrics"
5 ],
6 "endpoints_successful": [
7 "http://localhost:9401/metrics"
8 ],
9 "start_time": "2025-10-13T01:48:03.689885",
10 "end_time": "2025-10-13T01:48:55.971544"
11 },
12 "endpoints": {
13 "localhost:9401": {
14 "gpus": {
15 "gpu_0": {
16 "gpu_index": 0,
17 "gpu_name": "NVIDIA H100 80GB HBM3",
18 "gpu_uuid": "GPU-afc3c15a-48a5-d669-0634-191c629f95fa",
19 "hostname": "69450c620e4d",
20 "metrics": {
21 "gpu_power_usage": {
22 "avg": 348.6908823529412,
23 "min": 120.57,
24 "max": 386.022,
25 "p1": 120.57,
26 "p5": 120.57,
27 "p10": null,
28 "p25": 378.343,
29 "p50": 378.343,
30 "p75": 386.022,
31 "p90": 386.022,
32 "p95": 386.022,
33 "p99": 386.022,
34 "std": 85.96769288258695,
35 "count": 153,
36 "unit": "W"
37 },
38 "energy_consumption": {
39 "avg": 0.23782271866013072,
40 "min": 0.229901671,
41 "max": 0.246497393,
42 "p1": 0.229901671,
43 "p5": 0.229901671,
44 "p10": null,
45 "p25": 0.23499845600000002,
46 "p50": 0.23499845600000002,
47 "p75": 0.246497393,
48 "p90": 0.246497393,
49 "p95": 0.246497393,
50 "p99": 0.246497393,
51 "std": 0.005916380392210164,
52 "count": 153,
53 "unit": "MJ"
54 },
55 "gpu_utilization": {
56 "avg": 45.8235294117647,
57 "min": 0.0,
58 "max": 66.0,
59 "p1": 0.0,
60 "p5": 0.0,
61 "p10": null,
62 "p25": 27.0,
63 "p50": 66.0,
64 "p75": 66.0,
65 "p90": 66.0,
66 "p95": 66.0,
67 "p99": 66.0,
68 "std": 24.51706559093709,
69 "count": 153,
70 "unit": "%"
71 },
72 "memory_copy_utilization": {
73 "avg": 21.098039215686274,
74 "min": 0.0,
75 "max": 29.0,
76 "p1": 0.0,
77 "p5": 0.0,
78 "p10": null,
79 "p25": 15.0,
80 "p50": 29.0,
81 "p75": 29.0,
82 "p90": 29.0,
83 "p95": 29.0,
84 "p99": 29.0,
85 "std": 10.109702002863262,
86 "count": 153,
87 "unit": "%"
88 },
89 "gpu_memory_used": {
90 "avg": 92.69685977516342,
91 "min": 92.69621555200001,
92 "max": 92.698312704,
93 "p1": 92.69621555200001,
94 "p5": 92.69621555200001,
95 "p10": null,
96 "p25": 92.69621555200001,
97 "p50": 92.69621555200001,
98 "p75": 92.698312704,
99 "p90": 92.698312704,
100 "p95": 92.698312704,
101 "p99": 92.698312704,
102 "std": 0.0009674763104592773,
103 "count": 153,
104 "unit": "GB"
105 },
106 "gpu_memory_free": {
107 "avg": 9.387256704836602,
108 "min": 9.385803776000001,
109 "max": 9.387900928,
110 "p1": 9.385803776000001,
111 "p5": 9.385803776000001,
112 "p10": null,
113 "p25": 9.385803776000001,
114 "p50": 9.387900928,
115 "p75": 9.387900928,
116 "p90": 9.387900928,
117 "p95": 9.387900928,
118 "p99": 9.387900928,
119 "std": 0.0009674763104633748,
120 "count": 153,
121 "unit": "GB"
122 },
123 "sm_clock_frequency": {
124 "avg": 1980.0,
125 "min": 1980.0,
126 "max": 1980.0,
127 "p1": 1980.0,
128 "p5": 1980.0,
129 "p10": null,
130 "p25": 1980.0,
131 "p50": 1980.0,
132 "p75": 1980.0,
133 "p90": 1980.0,
134 "p95": 1980.0,
135 "p99": 1980.0,
136 "std": 0.0,
137 "count": 153,
138 "unit": "MHz"
139 },
140 "memory_clock_frequency": {
141 "avg": 2619.0,
142 "min": 2619.0,
143 "max": 2619.0,
144 "p1": 2619.0,
145 "p5": 2619.0,
146 "p10": null,
147 "p25": 2619.0,
148 "p50": 2619.0,
149 "p75": 2619.0,
150 "p90": 2619.0,
151 "p95": 2619.0,
152 "p99": 2619.0,
153 "std": 0.0,
154 "count": 153,
155 "unit": "MHz"
156 },
157 "memory_temperature": {
158 "avg": 45.99346405228758,
159 "min": 41.0,
160 "max": 48.0,
161 "p1": 41.0,
162 "p5": 41.0,
163 "p10": null,
164 "p25": 46.0,
165 "p50": 46.0,
166 "p75": 48.0,
167 "p90": 48.0,
168 "p95": 48.0,
169 "p99": 48.0,
170 "std": 2.081655738762016,
171 "count": 153,
172 "unit": "°C"
173 },
174 "gpu_temperature": {
175 "avg": 38.869281045751634,
176 "min": 33.0,
177 "max": 41.0,
178 "p1": 33.0,
179 "p5": 33.0,
180 "p10": null,
181 "p25": 39.0,
182 "p50": 39.0,
183 "p75": 41.0,
184 "p90": 41.0,
185 "p95": 41.0,
186 "p99": 41.0,
187 "std": 2.383748929780352,
188 "count": 153,
189 "unit": "°C"
190 },
191 "xid_errors": {
192 "avg": 0.0,
193 "min": 0.0,
194 "max": 0.0,
195 "p1": 0.0,
196 "p5": 0.0,
197 "p10": null,
198 "p25": 0.0,
199 "p50": 0.0,
200 "p75": 0.0,
201 "p90": 0.0,
202 "p95": 0.0,
203 "p99": 0.0,
204 "std": 0.0,
205 "count": 153,
206 "unit": "count"
207 }
208 }
209 }
210 }
211 }
212 }
213 }