CloudAI Benchmark Framework v1.6.1

NVIDIA Docs Hub Homepage NVIDIA Networking Networking Software Management Software CloudAI Benchmark Framework v1.6.1

Overview

CloudAI benchmark framework aims to develop an industry standard benchmark focused on grading Data Center (DC) scale AI systems in the cloud. The primary motivation is to provide automated benchmarking on various systems.

This document contains the following chapters:

Getting Started

Copy
Copied!

            
            git clone git@github.com:NVIDIA/cloudai.git
cd cloudai
uv run cloudai --help

Note

For more instructions on how to set up access for enroot, see Installation Requirements.

pip-based Installation

See the required Python version in the .python-version file and make sure you have it installed (For installation, see Custom Python Version Installation). Follow these steps:

Copy
Copied!

            
            git clone git@github.com:NVIDIA/cloudai.git
cd cloudai
python -m venv venv
source venv/bin/activate
pip install -e .

Custom Python Version Installation

If your system Python version is not supported, you can install a custom version using the uv tool:

Copy
Copied!

            
            curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
uv venv --seed  # picks the python version from .python-version
                # --seed installs pip and setuptools
source .venv/bin/activate

Key Concepts

CloudAI operates on three main schemas:

System Schema: Describes the system, including the scheduler type, node list, and global environment variables
Test Schema: An instance of a test template with custom arguments and environment variables
Test Scenario Schema: A set of tests with dependencies and additional descriptions about the test scenario

These schemas enable CloudAI to be flexible and compatible with different systems and configurations.

CloudAI Modes Usage Examples

The following are the global options for the cloudai command:

--log-file <path>: specify a file to log output; by default debug.log in the current directory is used. Contains log entries of level DEBUG and higher
--log-level <level>: specify logging level for standard output; default is INFO

Usage Examples

run

This mode runs workloads. It automatically installs prerequisites if they are not met.

Copy
Copied!

            
            cloudai run\
    --system-config conf/common/system/example_slurm_cluster.toml\
    --tests-dir conf/common/test\
    --test-scenario conf/common/test_scenario/sleep.toml

dry-run

This mode simulates running experiments without actually executing them. This is useful for verifying configurations and testing experiment setups.

Copy
Copied!

            
            cloudai dry-run\
    --system-config conf/common/system/example_slurm_cluster.toml\
    --tests-dir conf/common/test\
    --test-scenario conf/common/test_scenario/sleep.toml

generate-report

This mode generates reports under the scenario directory. It automatically runs as part of the run mode after experiments are completed.

Copy
Copied!

            
            cloudai generate-report\
    --system-config conf/common/system/example_slurm_cluster.toml\
    --tests-dir conf/common/test\
    --test-scenario conf/common/test_scenario/sleep.toml\
    --result-dir /path/to/result_directory

install

This mode installs test prerequisites. For more details, refer to the Installation Requirements guide. It automatically runs as part of the run mode if prerequisites are not met.

Copy
Copied!

            
            cloudai install\
    --system-config conf/common/system/example_slurm_cluster.toml\
    --tests-dir conf/common/test\
    --test-scenario conf/common/test_scenario/sleep.toml

uninstall

This mode is the opposite of the install mode. This mode removes installed test prerequisites.

Copy
Copied!

            
            cloudai uninstall\
    --system-config conf/common/system/example_slurm_cluster.toml\
    --tests-dir conf/common/test\
    --test-scenario conf/common/test_scenario/sleep.toml

list

This mode lists internal components available within CloudAI.

Copy
Copied!

            
            cloudai list <component_type>

verify-configs

This mode verifies the correctness of system, test, and test scenario configuration files.

Copy
Copied!

            
            # verify all at once
cloudai verify-configs conf

# verify a single file
cloudai verify-configs conf/common/system/example_slurm_cluster.toml

# verify all scenarios using specific folder with Test TOMLs
cloudai verify-configs --tests-dir conf/release/spcx/l40s/test conf/release/spcx/l40s/test_scenario

Documentation Update History

Version	Date	Description
v1.5.0	April xx, 2026	Initial release of this documentation version.