NVIDIA-Certified Storage#
Introduction#
As the artificial intelligence landscape shifts from experimental pilot projects to full-scale production, the underlying infrastructure must evolve. We have entered the era of the “AI Factory”—data centers dedicated not just to storing data, but to manufacturing intelligence. In this new paradigm, data is the fuel, and the storage system is the fuel pump. If the pump cannot keep up with the engine, performance stalls.
The NVIDIA-Certified™ Storage program is a comprehensive validation framework designed to bridge the gap between high-performance storage architectures and NVIDIA’s accelerated computing platform. By rigorously testing partner solutions against real-world AI workloads and synthetic benchmarks at scale, we provide enterprise customers with a verified path to deployment.
Overview#
NVIDIA-Certified Storage Certification is more than throughput validation. It is a comprehensive framework designed to validate and certify storage architectures for a general AI Factory as well as for special-purpose systems such as the AI Data Platform (AIDP). It uses a variety of tools, from GPU-based workloads for compatibility to synthetic CPU benchmarks for massive scale testing of thousands of GPUs. The certification ensures that storage solutions can meet the demanding requirements of large-scale AI training, fine-tuning, inference, KV cache, and retrieval-augmented generation (RAG) pipelines. It also validates enterprise features and functional requirements.
Key objectives are:
Validate Storage Performance: The certification benchmarks storage platforms for critical AI workloads, focusing on multiple I/O patterns, throughput, latency, reliability, scalability, multitenancy, security, and data services. This is essential for environments with traffic from hundreds to thousands of GPUs.
Support Multiple AI Use Cases: The framework covers training, fine-tuning, inference, KV cache, and query/retrieval workloads, each with distinct I/O profiles and object sizes.
Benchmarking at Scale: Certification is structured across three levels—Foundation, Enterprise, and NCP—each simulating increasing GPU counts (up to 10,000 GPUs) and progressively harder benchmarks.
Certification Levels#
Three levels have been designated for certification to ensure storage partners can support and be certified for the various system sizes as well as additional requirements and features.
Foundation#
The Foundation level is designed to validate that a storage platform can meet essential AI workload demands in smaller or mid-scale environments. The tests at this level are configured to simulate workloads involving up to 128 GPUs.
Enterprise#
The Enterprise Certification level is designed for storage solutions targeting large-scale, high-performance AI workloads. The tests at this level are configured to simulate workloads involving up to 1000 GPUs.
NVIDIA Cloud Partner#
The NVIDIA Cloud Partner (NCP) Certification level is the highest tier in the NVIDIA-Certified Storage (NVCS) program, focused on validating storage platforms for multi-tenant, cloud-scale AI workloads. This level is tailored for storage providers aiming to support very large-scale NVIDIA GPU and infrastructure deployments in dynamic, cloud-native environments demanding performance, high availability, and isolation. The tests at this level are configured to simulate workloads involving up to 10,000 GPUs.
Certification Types#
General-Purpose Performance Certification#
The general-purpose performance certification is designed to validate storage architectures for training, fine-tuning, inference, and key-value cache workloads. It focuses on I/O patterns critical to these environments, including sequential and random I/O, small I/O for inference, and single-node/single-GPU performance that simulates traffic from hundreds to thousands of GPUs.
The tests run various benchmark packages that include GPU-specific tests and synthetic scale-out tests to generate load from thousands of GPUs that are indicative of an AI Factory workload.
The certification also evaluates reliability, scale-out capabilities, QoS, multitenancy, security, and data services for each certified platform—all essential for AI Factory deployments.
Below are the various AI use cases and typical I/O profiles:
Workload |
Primary I/O Pattern |
Typical Object Size |
|---|---|---|
Training |
Large sequential reads for model/data; periodic large sequential writes for checkpoints; many parallel streams |
256 KiB – 4 MiB |
Fine-tuning |
Frequent reads of base weights; iterative small/medium writes of adapter deltas and checkpoints |
64 KiB – 1 MiB |
Inference |
Hot reads of model weights, token embeddings, prompt/context features; metadata-heavy lookups in serving graph |
4 KiB – 256 KiB |
KV Cache |
Random reads/writes of attention keys/values; heavy temporal locality; evict/persist between sessions |
1–5000 KiB |
Query and Retrieval |
Random reads for ANN/vector lookups; small writes for ingestion/updates |
1–120 KiB |
The sections below provide more detail for each workload.
Model Training#
Training large-scale deep learning models requires sustained throughput and parallel I/O across multiple nodes. Any storage bottleneck can reduce GPU utilization, slow down training cycles, and increase operational costs.
A key consideration for write throughput during training is checkpointing—the process of periodically saving model states to storage. This ensures recoverability in the event of failure and supports resuming long-running training jobs. Two checkpointing approaches can be leveraged: synchronous checkpointing or asynchronous checkpointing.
Synchronous checkpointing halts the training phase until the checkpoint data is successfully written to high-performance storage, which requires faster storage write speeds to minimize downtime. Asynchronous checkpointing speeds up the process by copying model parameters to the CPU first, then saving the checkpoint to high-performance storage in the background with minimal interruption to the training process. This approach can reduce the write bandwidth requirement and allows users to schedule the transfer of an asynchronous checkpoint from memory to HPS over a longer period of time without halting the training phase.
Model Fine-Tuning#
Fine-tuning adapts pre-trained foundation models to specific tasks or domains using custom datasets. While less resource-intensive than full-scale training, it still warrants fast access to model checkpoints and training data along with low-latency writes for saving updated model weights and supporting rapid experimentation.
Despite being more efficient than full-scale training, fine-tuning still demands a high-performance infrastructure. Fast, low-latency access to model checkpoints and training datasets is essential to maintain throughput during iterative training cycles. Equally important is the ability to perform rapid, low-latency writes for saving updated model weights, which supports agile experimentation and frequent model updates. These requirements make high-bandwidth, scalable storage systems a foundational component of any fine-tuning pipeline.
Inferencing#
AI inference pipelines, particularly for large language models (LLMs) and multimodal architectures, require extremely fast and consistent access to critical data components such as model weights, token embeddings, and feature data. These assets are accessed frequently in real time during inference, making high storage throughput and predictable low-latency performance essential for overall system responsiveness and user experience.
As these inference workloads scale to support real-time applications or multiple users, the demands on the underlying storage infrastructure intensify. The storage system must deliver high bandwidth and predictable low-latency performance that scales linearly with compute resources. This linear scalability ensures that inference pipelines remain performant and responsive as demand, model size, and complexity grow.
Key-Value Cache#
The Key-Value (KV) cache is a foundational optimization in transformer-based models, especially for autoregressive tasks like text generation. It stores intermediate key and value tensors from previous forward passes, allowing the model to reuse these computations instead of recalculating them for each new token. This dramatically reduces latency and compute load, which is especially important when serving large models with long context windows or in real-time applications. By leveraging the KV cache, systems can maintain high throughput and responsiveness without sacrificing model performance.
Persisting and retrieving the KV cache state is preferred over recomputation because it ensures both efficiency and consistency. Recomputing the cache for every interaction would be computationally expensive and could introduce inconsistencies due to changes in model parameters or input handling. Storing the cache allows for seamless continuation of sessions, supports speculative decoding, and enables efficient branching in generation workflows. This approach is particularly valuable in high-performance environments where minimizing resource usage and maximizing speed are critical.
Query and Retrieval#
Retrieval-Augmented Generation (RAG) pipelines and agentic workflows, such as AI Data Platform, enhance large language models (LLMs) by integrating them with enterprise-specific knowledge stored as vector embeddings in high-performance vector databases. These systems must deliver low-latency, high-throughput access to vector indexes during inference to ensure timely and relevant responses. By grounding LLM outputs in proprietary data, RAG pipelines enable more accurate and context-aware responses. Relevant documents, structured data, and contextual information are embedded as vectors and stored in the database. When a user submits a query, the system performs a semantic search to retrieve the most contextually relevant information. These retrieved vectors are then passed to the LLM, allowing them to generate responses that reflect the organization’s unique domain knowledge.
To support this process effectively, the underlying storage infrastructure must deliver low-latency, high-throughput access to vector indexes. As RAG pipelines scale to serve real-time applications—such as customer support, internal knowledge assistants, or decision support tools—storage performance becomes a critical bottleneck. High IOPS and minimal query latency are essential to ensure that semantic retrieval does not delay the overall response time. This is especially important when supporting multiple concurrent users or integrating with time-sensitive workflows.
In addition to raw performance, the storage system must support intelligent data services such as versioning of embeddings, secure access controls, and efficient update mechanisms for dynamic content. As enterprise knowledge evolves, the ability to incrementally update vector indexes without downtime is vital. Furthermore, integration with governance frameworks—such as role-based access control (RBAC) and encryption—ensures that sensitive data used in RAG pipelines remains protected and compliant with organizational policies. Together, these capabilities enable scalable, secure, and responsive semantic search within enterprise-grade AI systems.
Special-purpose Certifications#
As AI workloads become increasingly larger and more complex, there is a growing need for specialized storage systems that can improve performance even further. The NVIDIA-Certified Storage will be expanded over time to include these specialized systems as part of the certification.
Storage Certification for AIDP#
AIDP is the first special-purpose certification in the NVIDIA-Certified Storage program. AIDP use cases differ from AI Factory workflows and require a distinct set of tests to validate performance and scalability. The approved certification process focuses on two primary areas: KV cache performance and the extraction performance of text and multimodal pages from unstructured data. KV-cache certification uses nixlbench scale-out testing, with results evaluated based on performance and latency across representative large-model configurations. Document extraction tests validate storage throughput to ensure bulk ingestion performance meets AIDP workflow requirements.