CloudAI benchmark framework aims to develop an industry standard benchmark focused on grading Data Center (DC) scale AI systems in the cloud. The primary motivation is to provide automated benchmarking on various systems.
Get Started
git clone git@github.com:NVIDIA/cloudai.git
cd cloudai
uv run cloudai --help
For instructions for setting up access for
enroot, see Workloads Requirements Installation.
pip
-based Installation
See the required Python version in the
.python-version file and make sure you have it installed (for installation, see Install Custom Python Version). Follow these steps:
git clone git@github.com:NVIDIA/cloudai.git
cd cloudai
python -m venv venv
source venv/bin/activate
pip install -e .
Install Custom Python Version
If your system Python version is not supported, you can install a custom version using the uv tool:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
uv venv --seed # picks the python version from .python-version
# --seed installs pip and setuptools
source .venv/bin/activate
Key Concepts
CloudAI operates on three main schemas:
System Schema: Describes the system, including the scheduler type, node list, and global environment variables.
Test Schema: An instance of a test template with custom arguments and environment variables.
Test Scenario Schema: A set of tests with dependencies and additional descriptions about the test scenario.
These schemas enable CloudAI to be flexible and compatible with different systems and configurations.
CloudAI Modes Usage Examples
Global options for
cloudai command:
--log-file <path>: specify a file to log output; by default
debug.login the current directory is used. Contains log entries of level
DEBUGand higher.
--log-level <level>: specify logging level for standard output; default is
INFO.
run
This mode runs workloads. It automatically installs prerequisites if they are not met.
cloudai run\
--system-config conf/common/system/example_slurm_cluster.toml\
--tests-dir conf/common/test\
--test-scenario conf/common/test_scenario/sleep.toml
dry-run
This mode simulates running experiments without actually executing them. This is useful for verifying configurations and testing experiment setups.
cloudai dry-run\
--system-config conf/common/system/example_slurm_cluster.toml\
--tests-dir conf/common/test\
--test-scenario conf/common/test_scenario/sleep.toml
generate-report
This mode generates reports under the scenario directory. It automatically runs as part of the
run mode after experiments are completed.
cloudai generate-report\
--system-config conf/common/system/example_slurm_cluster.toml\
--tests-dir conf/common/test\
--test-scenario conf/common/test_scenario/sleep.toml\
--result-dir /path/to/result_directory
install
This mode installs test prerequisites. For more details, refer to the Workloads Requirements Installation guide. It automatically runs as part of the
run mode if prerequisites are not met.
cloudai install\
--system-config conf/common/system/example_slurm_cluster.toml\
--tests-dir conf/common/test\
--test-scenario conf/common/test_scenario/sleep.toml
uninstall
The opposite to the install mode, this mode removes installed test prerequisites.
cloudai uninstall\
--system-config conf/common/system/example_slurm_cluster.toml\
--tests-dir conf/common/test\
--test-scenario conf/common/test_scenario/sleep.toml
list
This mode lists internal components available within CloudAI.
cloudai list <component_type>
verify-configs
This mode verifies the correctness of system, test, and test scenario configuration files.
# verify all at once
cloudai verify-configs conf
# verify a single file
cloudai verify-configs conf/common/system/example_slurm_cluster.toml
# verify all scenarios using specific folder with Test TOMLs
cloudai verify-configs --tests-dir conf/release/spcx/l40s/test conf/release/spcx/l40s/test_scenario