NVIDIA-Certified Storage#

Introduction#

As the artificial intelligence landscape shifts from experimental pilot projects to full-scale production, the underlying infrastructure must evolve. We have entered the era of the “AI Factory”—data centers dedicated not just to storing data, but to manufacturing intelligence. In this new paradigm, data is the fuel, and the storage system is the fuel pump. If the pump cannot keep up with the engine, performance stalls.

The NVIDIA-Certified™ Storage program is a comprehensive validation framework designed to bridge the gap between high-performance storage architectures and NVIDIA’s accelerated computing platform. It gives storage providers a structured path to validate file and object storage solutions against NVIDIA best practices, reference architectures, and functional and performance testing. By rigorously testing partner solutions against real-world AI workloads and synthetic benchmarks at scale, certified storage help reduce deployment risk, improves confidence in design choices, and gives teams a clearer path from AI infrastructure planning to production use.

Overview#

Standard storage benchmarks are useful, but they were not built to answer the full set of questions raised by AI factories. Today, there are limited broadly accepted benchmarks that represent the behavior of modern AI workloads end to end. Traditional tests usually focus on performance, often measuring a single dimension such as throughput, latency, or IOPS under controlled conditions. Those numbers can be informative, but they rarely capture the diverse workload characteristics, concurrency patterns, stress conditions, and full-stack interactions that determine whether storage is truly ready for accelerated AI infrastructure. A storage platform that performs well in a narrow benchmark may still fall short if it cannot support scale, availability, client behavior, multitenancy, quality of service, or lifecycle changes.

NVIDIA-Certified Storage Certification evaluates storage through a broader AI lens. It examines how a platform behaves across different workload patterns, how it responds under stress, and how well it works with NVIDIA-Certified systems, NVIDIA networking, NVIDIA AI Enterprise software, and the operational patterns customers use every day. NVIDIA-Certified Storage helps vendors demonstrate that their technology is ready for these demands. The program evaluates technical merit, validates reference configurations, and connects storage performance to full-stack AI deployment patterns.

This program has multiple benefits. Feedback from the certification process can help vendors improve their products in ways that are not always understood or prioritized in standard development programs. For customers, the result is a more well-rounded view of storage options for AI factories. They can compare certified storage solutions, weigh each one’s unique value proposition, and choose the best fit with greater confidence.

Certification Levels#

Technology Evaluation Phase#

The certification journey for a storage solution starts with a technology evaluation. This is a comprehensive deep dive of the technical merits of the solution to establish the readiness of the solution to be deployed in a large-scale AI Factory. It consists of information gathering via questionnaires, business case evaluation, and a few technical deep dive sessions. This phase helps ensure that both the partner and NVIDIA are aligned on market fit, technical fit, and customer value. A storage solution can proceed to the certification activities only after getting vetted in this phase.

Foundation#

The first set of tests are for Foundation level certification. This focuses on essential AI workload demands in smaller and mid-scale environments. It tests both object and file paths across node counts, thread counts, I/O sizes, access modes, and repeated runs The tests at this level are configured to simulate workloads involving up to 128 GPUs. Partners run NVIDIA latency and throughput benchmark testing in an approved lab cluster, then submit results for NVIDIA review. This level helps establish that a storage platform can support practical AI performance expectations.

Enterprise#

Enterprise certification is designed to validate storage solutions for large-scale, high-performance AI workloads, such as would be found in large enterprise AI factories. This level introduces a series of specialized test packages meant to simulate real-world AI use cases, described below. It also covers storage architecture, performance, availability, client software, scalability, data services, backup, migration, recovery, APIs, and reduction capabilities. Testing is performed in an NVIDIA lab with NVIDIA-Certified servers and NVIDIA Spectrum-X networking, on clusters with up to 1000 GPUs. For customers, this is valuable because it validates the storage platform in an environment that better reflects enterprise AI production needs.

NVIDIA Cloud Partner#

NVIDIA Cloud Partner (NCP) certification is focused on the needs of public-cloud AI providers, where performance, multi-tenant isolation, security, availability, and quality of service are critical. In addition to the use-case tests at the Enterprise level, storage platforms pursuing NCP certification must demonstrate tenant-level isolation, tenant-specific security, encryption, key management, network support for multitenancy, and quality-of-service controls. For cloud providers and customers building shared AI infrastructure, this validation helps address hard operational questions before deployment begins.

AI Use Case Tests#

The AI use-case tests are designed to validate storage architectures for training, fine-tuning, inference, and key-value cache workloads. They focus on I/O patterns critical to these environments, including sequential and random I/O, small I/O for inference, and single-node/single-GPU performance that simulates traffic from hundreds to thousands of GPUs.

Below are the various AI use cases and typical I/O profiles:

Table 1: AI Use Cases#
Workload	Primary I/O Pattern	Typical Object Size
Training	Large sequential reads for model/data; periodic large sequential writes for checkpoints; many parallel streams	256 KiB – 4 MiB
Fine-tuning	Frequent reads of base weights; iterative small/medium writes of adapter deltas and checkpoints	64 KiB – 1 MiB
Inference	Hot reads of model weights, token embeddings, prompt/context features; metadata-heavy lookups in serving graph	4 KiB – 256 KiB
KV Cache	Random reads/writes of attention keys/values; heavy temporal locality; evict/persist between sessions	1–5000 KiB
Query and Retrieval	Random reads for ANN/vector lookups; small writes for ingestion/updates	1–120 KiB
Agentic AI Workflows	Many small framework-visible state, checkpoint, memory, trace, tool-output, and artifact operations across concurrent agent sessions	1 KiB – 4 MiB

The sections below provide more detail for each workload.

Model Training#

Training large-scale deep learning models requires sustained throughput and parallel I/O across multiple nodes. Storage performance directly affects GPU utilization, training cycle time, and operational efficiency. If the storage layer cannot keep pace with the compute layer, GPUs may idle while waiting for data, reducing overall training efficiency and increasing cost.

Checkpointing is a critical storage operation during training because model state must be saved periodically for recovery and job resumption. Synchronous checkpointing pauses training until checkpoint data is written to storage, requiring high write throughput to minimize downtime. Asynchronous checkpointing reduces disruption by first copying model parameters to CPU memory and then writing the checkpoint to storage in the background. This can lower immediate write-bandwidth pressure and allow checkpoint data to be transferred over a longer period without halting training.

Model Fine-Tuning#

Fine-tuning adapts pre-trained foundation models to specific tasks or domains using custom datasets. While less resource-intensive than full-scale training, it still warrants fast access to model checkpoints and training data along with low-latency writes for saving updated model weights and supporting rapid experimentation.

Despite being more efficient than full-scale training, fine-tuning still demands a high-performance infrastructure. Fast, low-latency access to model checkpoints and training datasets is essential to maintain throughput during iterative training cycles. Equally important is the ability to perform rapid, low-latency writes for saving updated model weights, which supports agile experimentation and frequent model updates. These requirements make high-bandwidth, scalable storage systems a foundational component of any fine-tuning pipeline.

Inferencing#

AI inference pipelines, particularly for large language models (LLMs) and multimodal architectures, require extremely fast and consistent access to critical data components such as model weights, token embeddings, and feature data. These assets are accessed frequently in real time during inference, making high storage throughput and predictable low-latency performance essential for overall system responsiveness and user experience.

As these inference workloads scale to support real-time applications or multiple users, the demands on the underlying storage infrastructure intensify. The storage system must deliver high bandwidth and predictable low-latency performance that scales linearly with compute resources. This linear scalability ensures that inference pipelines remain performant and responsive as demand, model size, and complexity grow.

Key-Value Cache#

The Key-Value (KV) cache is a foundational optimization in transformer-based models, especially for autoregressive tasks like text generation. It stores intermediate key and value tensors from previous forward passes, allowing the model to reuse these computations instead of recalculating them for each new token. This dramatically reduces latency and compute load, which is especially important when serving large models with long context windows or in real-time applications. By leveraging the KV cache, systems can maintain high throughput and responsiveness without sacrificing model performance.

Persisting and retrieving the KV cache state is preferred over recomputation because it ensures both efficiency and consistency. Recomputing the cache for every interaction would be computationally expensive and could introduce inconsistencies due to changes in model parameters or input handling. Storing the cache allows for seamless continuation of sessions, supports speculative decoding, and enables efficient branching in generation workflows. This approach is particularly valuable in high-performance environments where minimizing resource usage and maximizing speed are critical.

Query and Retrieval#

Retrieval-Augmented Generation (RAG) pipelines and agentic workflows enhance large language models by connecting them to enterprise-specific knowledge stored as vector embeddings in high-performance vector databases. During a query, relevant documents, structured data, and contextual information are retrieved through semantic search and provided to the model, enabling more accurate, context-aware responses grounded in an organization’s proprietary data.

To support these workflows at scale, the underlying storage infrastructure must provide low-latency, high-throughput access to vector indexes while maintaining predictable performance under concurrency. As enterprise knowledge evolves, storage platforms must also support secure access controls, encryption, embedding versioning, and efficient index updates without disrupting service. Together, these capabilities help keep semantic search responsive, scalable, and aligned with governance requirements.

Agentic AI Workflows#

Agentic AI workflows introduce a distinct class of storage behavior driven by real agent runtimes. As agents plan, reason, call tools, and produce outputs, they continuously interact with storage to checkpoint session state, restore prior context, append execution traces, update memory, write intermediate tool results, and generate final artifacts. These operations are often small, frequent, and latency-sensitive. When many agent sessions run concurrently, this storage activity can become part of the application’s critical path and directly affect responsiveness, throughput, and reliability.

Reliable storage for agentic AI requires more than high aggregate bandwidth. Agent frameworks often depend on consistent access to state, memory, logs, and artifacts across both isolated and shared namespace patterns. Validating agent framework I/O helps determine whether a storage platform can sustain these framework-visible operations as agent sessions scale, while preserving predictable performance, correctness, and availability across the agentic application layer.

Special-purpose Certifications#

As AI workloads become increasingly larger and more complex, there is a growing need for specialized storage systems that are purpose built for AI workloads, such as NVIDIA AI Data Platform. The NVIDIA-Certified Storage will be expanded over time to include these specialized systems as part of the certification.

Continued Engagement and Recertification#

Ongoing engagement and periodic recertification are essential to ensuring that storage solutions continue to meet the rigorous performance and reliability standards of the NVIDIA-Certified Storage program. As software or hardware architectures evolve—whether driven by NVIDIA platform updates or partner-led enhancements—recertification maintains alignment with validated performance expectations. This process can be triggered by either NVIDIA or the storage partner. To facilitate this alignment, storage partners must submit solution updates every six months (January and July) for evaluation, allowing NVIDIA to determine if minor or major recertification is required.

Two levels of recertification have been defined based on the scope of changes:

Minor Recertification: Required for incremental updates such as changes to CPU core counts, memory adjustments, or minor software enhancements. Partners must submit targeted validation results using the nvlts benchmark suite on an approved Foundation cluster. Following a successful review, the existing Reference Architecture (RA) can be updated to reflect these changes.
Major Recertification: Applies to significant architectural shifts, such as the introduction of a new storage platform, transitions to a new CPU family, or modifications to the common network infrastructure. This process involves comprehensive lab testing, technical deep dives, and the submission of a full set of new benchmark results. Successful major recertification results in the publication of a new Reference Architecture for the platform.

In all instances, NVIDIA collaborates closely with partners to assess the impact of changes and determine the appropriate recertification path, ensuring sustained reliability for enterprise AI production environments.

Conclusion#

For enterprise technology leaders, certified storage provides more than technical validation. It offers an evidence-based foundation for infrastructure decisions by showing that a platform has been evaluated for real AI deployment requirements, including interoperability across compute, networking, software, and data infrastructure.

Just as important, ongoing review and recertification help ensure that validated designs remain aligned with evolving operational, performance, and governance requirements. In that sense, certification serves not simply as a technical designation, but as a practical indicator of infrastructure readiness for scaling AI with lower execution risk

As enterprise AI initiatives move from experimentation to production, that distinction matters. Certification can reduce integration uncertainty, simplify architecture choices, and give buyers greater confidence that the storage layer can support scalable, business-critical AI workloads.