For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
GitHub
DocumentationREST API Reference
DocumentationREST API Reference
    • Home
  • Overview
    • What is NICo?
    • Key Capabilities
    • Operational Principles
    • Day 0 / Day 1 / Day 2 Lifecycle
    • Scope and Boundaries
  • Getting Started
    • Building NICo Containers
    • Quick Start Guide
  • Provisioning (Day 0 Operations)
    • Ingesting Hosts
    • Ingesting Hosts (REST API)
    • Machine Validation
    • SKU Validation
    • Measured Boot Attestation
  • DPU Management
    • DPU Lifecycle Management
    • DPU Configuration
    • BlueField DPU Operations
  • Configuration (Day 1 Operations)
    • Network Isolation
    • Tenant Management
    • Organization & Permissions
  • Architecture
    • Overview and Components
    • Redfish Workflow
    • Redfish Endpoints Reference
    • Reliable State Handling
    • Networking Integrations
    • Health Checks and Health Aggregation
    • Health Probe IDs
    • Health Alert Classifications
    • Key Group Synchronization
  • Operations
    • nicocli Reference
    • NVLink Partitioning
    • Rack-Level Administration (RLA)
    • IP Resource Pools
    • BGP Peering
  • Playbooks
    • Azure OIDC for Infra Controller Web UI
    • Force Deleting and Rebuilding Hosts
    • Rebooting a Machine
    • InfiniBand Setup
  • Development
    • Codebase Overview
    • Bootable Artifacts
    • Local Development
    • Running a PXE Client in a VM
    • TLS and SPIFFE Certificates
    • SPIFFE and casbin policies with admin-cli
    • Re-creating Issuer/CA in Local Dev
    • Visual Studio Code Remote Development
    • Adding Support for New Hardware
    • Build Guide
  • Reference
    • Hardware Compatibility List
    • Release Notes
    • FAQs
    • Glossary
GitHub
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogo
On this page
  • Summary
  • Audience
  • Prerequisites
  • How Machine Validation Fits Into NICo
  • Framework Concepts
  • Test Selection
  • Built-In Validation Coverage
  • Site Configuration
  • External Configuration
  • Managing the Test Catalog
  • List Tests
  • Enable or Disable Tests
  • Verify Tests
  • Adding Site-Specific Tests
  • Execution Models
  • Updating Site-Specific Tests
  • Extension Design Guidelines
  • Running On-Demand Validation
  • Viewing Runs and Results
  • Interpreting Results
  • Operational Guidance
  • Troubleshooting
Provisioning (Day 0 Operations)

Machine Validation

||View as Markdown|
Previous

Ingesting Hosts (REST API)

Next

SKU Validation

Machine Validation is NVIDIA Infra Controller’s in-band validation framework for checking a machine before it is made available to tenants. NICo uses Scout to run validation tests on the host, collect the results, and report them back to the site controller.

The framework is intended to be extensible. NICo provides a catalog of built-in hardware validation tests, and site administrators can add site-specific tests when the deployment enables test mutation workflows.

Summary

Machine Validation helps operators answer a simple question: is this machine healthy enough to enter or return to the tenant-ready pool?

NICo can run validation during lifecycle workflows such as discovery and release, and administrators can also start validation on demand for a specific machine. Each validation run selects tests based on context, platform support, test enablement, verification state, tags, and any allow list supplied by the operator.

In normal lifecycle validation, NICo runs only tests that are both enabled and verified. Unverified tests can be exercised through on-demand validation before they are promoted into the standard workflow.

Audience

This guide is written for site administrators, SREs, platform administrators, and developers who manage or extend NICo machine validation. The examples assume the operator has access to the target site through nico-admin-cli and has the permissions required to view or modify machine validation configuration.

Prerequisites

Before using Machine Validation, confirm the following:

  • Machine Validation is enabled for the site.
  • The operator has privileges to view validation runs and manage validation tests.
  • The target machine is under platform control and is not allocated to a tenant.
  • Required validation images, tools, and external configs are available for the selected tests.

How Machine Validation Fits Into NICo

Machine Validation runs while a machine is under platform control and before it is allocated to a tenant. Typical entry points include:

  • Initial discovery, before a newly discovered machine reaches Ready.
  • Cleanup or release workflows, before a returned machine is made available again.
  • On-demand validation, when an administrator explicitly starts validation for a machine that needs additional checks.

Machine Validation complements SKU validation. SKU validation checks that the machine inventory matches the expected hardware model. Machine Validation runs tests on the machine to prove that the hardware and relevant host-side software paths behave correctly.

Framework Concepts

ConceptDescription
Validation runOne execution of Machine Validation for a machine. A run contains the selected tests and their results.
Test definitionThe stored definition of a validation test, including command, arguments, image, contexts, supported platforms, timeout, tags, and version.
ContextThe lifecycle situation in which a test is eligible to run. Common contexts are Discovery, Cleanup, and OnDemand.
Platform mappingThe list of machine platforms on which a test is supported. Scout uses the discovered machine platform to select compatible tests.
Enabled flagControls whether the test is eligible for selection. Disabled tests are not selected for normal validation.
Verified flagIndicates that an administrator has validated the test itself. Normal lifecycle runs skip unverified tests.
TagsOptional selectors that allow administrators to group tests and run targeted suites.
External configA named configuration file, such as registry credentials, that can be referenced by a test without embedding secrets in the test command.
ResultThe recorded output for one test execution, including status, timing, exit code, and captured output.

Test Selection

When a validation run starts, NICo and Scout select tests using the following criteria:

  1. The Machine Validation feature must be enabled for the site.
  2. The test must be enabled, unless the site configuration explicitly overrides the catalog selection mode.
  3. The test must be verified for normal lifecycle runs.
  4. The test context must match the run context, such as Discovery, Cleanup, or OnDemand.
  5. The test must support the machine platform.
  6. If tags are supplied, the test must match the requested tags.
  7. If an allow list is supplied, the test must be included in the allow list.

On-demand validation can intentionally include unverified tests by using the current CLI flag --run-unverfied-tests. The spelling of unverfied is part of the current CLI interface and must be used exactly as shown.

Built-In Validation Coverage

The exact test IDs, versions, enabled state, and supported platforms are deployment and release specific. Use nico-admin-cli machine-validation tests show as the source of truth for the running site.

The built-in catalog commonly includes the following test groups:

AreaCommon testsWhat they validate
GPU healthCudaSample, DcgmFullShort, DcgmFullLongCUDA execution, DCGM diagnostics, and basic GPU health.
GPU performanceNvbandwidth, RaytracingVkGPU memory bandwidth and graphics or compute paths used by supported platforms.
CPUCPUTestShort, CPUTestLong, CpuBenchmarkingFp, CpuBenchmarkingIntCPU stress and benchmark coverage for short and long validation windows.
MemoryMemoryTestShort, MemoryTestLong, MmMemBandwidth, MmMemLatency, MmMemPeakBandwidth, MqStresserShort, MqStresserLongMemory stress, latency, bandwidth, and queue pressure.
StorageFioFile, FioPath, FioSSDFile, path, and device-level I/O validation with fio-based tests.
Operational extensionsDefaultTestCase, runbook-style testsSite or release-specific checks used to extend the validation workflow.

Built-in tests that are delivered through NICo migrations are normally read-only. Site-specific tests can be added and modified by administrators when the deployment enables those mutation APIs.

Site Configuration

Machine Validation is controlled by the site configuration. A minimal configuration enables the feature:

1[machine_validation_config]
2enabled = true

A site can also control the catalog selection behavior:

1[machine_validation_config]
2enabled = true
3test_selection_mode = "Default"
4run_interval = "60s"
5tests = [
6 { id = "CudaSample", enable = true },
7]
SettingDescription
enabledEnables or disables Machine Validation for the site.
test_selection_modeControls how configured tests are selected. Default uses the catalog and per-test settings, EnableAll enables all configured tests, and DisableAll disables all configured tests.
run_intervalControls how often the controller processes pending validation work.
testsOptional per-test overrides. Use the test identifiers reported by tests show for the running site.

External Configuration

Some validation tests require external configuration, such as container registry credentials. Store those inputs as named external configs instead of embedding secrets in test definitions.

For example, to add or update the container authentication file:

1nico-admin-cli machine-validation external-config add-update \
2 --name container_auth \
3 --description "Container registry credentials for machine validation" \
4 --file-name /tmp/config.json

To view or remove external configuration:

1nico-admin-cli machine-validation external-config show --name container_auth
2nico-admin-cli machine-validation external-config remove --name container_auth

Managing the Test Catalog

List Tests

Use the test catalog to see the tests available in the site:

1nico-admin-cli machine-validation tests show

Show a specific test:

1nico-admin-cli machine-validation tests show --test-id <test_id>

Filter by platform or context:

1nico-admin-cli machine-validation tests show --platforms <platform>
2nico-admin-cli machine-validation tests show --contexts Discovery

Show unverified tests:

1nico-admin-cli machine-validation tests show --show-un-verfied

The current CLI flag is spelled --show-un-verfied; use the spelling shown above.

Enable or Disable Tests

Enable a test when it should be eligible for selection:

1nico-admin-cli machine-validation tests enable \
2 --test-id <test_id> \
3 --version <version>

Disable a test when it should not be selected:

1nico-admin-cli machine-validation tests disable \
2 --test-id <test_id> \
3 --version <version>

Use the test_id and version values returned by tests show.

Verify Tests

Verify a test after it has been proven safe and correct for the target site:

1nico-admin-cli machine-validation tests verify \
2 --test-id <test_id> \
3 --version <version>

Verification is a promotion step. A newly added test should be run on demand first, reviewed, and then marked verified before it is allowed into normal lifecycle validation.

Adding Site-Specific Tests

When test mutation workflows are enabled for the deployment, administrators can add tests to extend the validation framework. A site-specific test should be small, deterministic, non-interactive, and safe to run under platform control.

The following example adds a host-side smoke test:

1nico-admin-cli machine-validation tests add \
2 --name AcmeGpuSmoke \
3 --description "Runs the Acme GPU smoke validation" \
4 --command /usr/local/bin/acme-gpu-smoke \
5 --args "--quick" \
6 --contexts OnDemand \
7 --supported-platforms <platform> \
8 --timeout 1800 \
9 --is-enabled false

The following example adds a container-based test:

1nico-admin-cli machine-validation tests add \
2 --name AcmeContainerHealth \
3 --description "Runs Acme container health checks" \
4 --command /opt/acme/health-check \
5 --args "" \
6 --img-name nvcr.io/nvidian/nvforge/acme-health:1.0.0 \
7 --contexts Discovery \
8 --supported-platforms <platform> \
9 --external-config-file container_auth \
10 --timeout 3600 \
11 --is-enabled false

After adding a test:

  1. Run it on demand with --run-unverfied-tests.
  2. Review the result output.
  3. Enable the test if it should be selected.
  4. Verify the test after it is accepted for normal validation.

Execution Models

Machine Validation tests can be implemented as host commands or container-based commands.

ModelWhen to use itCommon fields
Host commandUse when the test tool is already present in the discovery environment or host filesystem.--command, --args, --timeout
Container commandUse when the test needs a packaged dependency set or must run from a validation image.--img-name, --container-arg, --external-config-file
Host filesystem executionUse when a containerized test must execute against the host filesystem.--execute-in-host true

Tests can also declare output file locations with --extra-output-file and --extra-err-file when a command writes important diagnostics outside stdout or stderr. Keep those outputs concise. Scout records command output for result review, but Machine Validation is not a replacement for long-term log storage.

Updating Site-Specific Tests

Use tests update to change a mutable test definition:

1nico-admin-cli machine-validation tests update \
2 --test-id <test_id> \
3 --version <version> \
4 --timeout 3600 \
5 --description "Updated validation timeout"

Use updates for site-specific tests only. Built-in tests may be read-only, depending on how they were delivered to the site.

Extension Design Guidelines

Use the following guidelines when designing a new validation test:

AreaRecommendation
NamingUse a stable, descriptive PascalCase name such as GpuFabricSmoke or StorageFioPath. Avoid embedding temporary incident names or one-off ticket IDs.
ScopeKeep each test focused on one hardware or software concern. Prefer separate tests over a large script that hides multiple failure modes.
ContextsUse Discovery for pre-allocation checks, Cleanup for checks after release, and OnDemand for operator-triggered validation or test qualification.
Platform supportMap tests only to platforms where the command, devices, firmware, and drivers are expected to exist.
VerificationTreat verification as a release gate for the test definition. Do not verify a test until it has passed on representative hardware.
TimeoutsSet an explicit timeout that matches the expected runtime. Long tests should be intentional and documented.
SecretsUse external config files for credentials and sensitive inputs. Do not pass secrets directly in command arguments.
OutputWrite concise stdout and stderr that explains what failed. Use extra output files only for diagnostics that cannot be emitted directly.
Pre-conditionsUse pre-conditions to skip tests that do not apply to a machine rather than failing unrelated platforms.
TagsAdd tags when operators need to run a targeted suite such as gpu-smoke, storage, or burn-in.

Running On-Demand Validation

Start validation for a specific machine:

1nico-admin-cli machine-validation on-demand start --machine <machine_id>

Run only selected contexts:

1nico-admin-cli machine-validation on-demand start \
2 --machine <machine_id> \
3 --contexts OnDemand

Run selected tests:

1nico-admin-cli machine-validation on-demand start \
2 --machine <machine_id> \
3 --allowed-tests <test_id_1> \
4 --allowed-tests <test_id_2>

Run a tagged suite:

1nico-admin-cli machine-validation on-demand start \
2 --machine <machine_id> \
3 --tags gpu-smoke

Run unverified tests during qualification:

1nico-admin-cli machine-validation on-demand start \
2 --machine <machine_id> \
3 --allowed-tests <test_id> \
4 --run-unverfied-tests

The current CLI flag is spelled --run-unverfied-tests; use the spelling shown above.

Viewing Runs and Results

Show validation runs:

1nico-admin-cli machine-validation runs show

Show runs for one machine:

1nico-admin-cli machine-validation runs show --machine <machine_id>

Include historical runs:

1nico-admin-cli machine-validation runs show --machine <machine_id> --history

Show validation results for a machine:

1nico-admin-cli machine-validation results show --machine <machine_id>

Show results for a specific validation run:

1nico-admin-cli machine-validation results show --validation-id <validation_id>

Show a specific test result from a run:

1nico-admin-cli machine-validation results show \
2 --validation-id <validation_id> \
3 --test-name <test_name>

Interpreting Results

Each test result records the command execution outcome, timing, exit code, and captured output. A non-zero exit code indicates failure unless the test command implements a documented skip or pre-condition behavior.

Scout captures stdout and stderr after the command exits. Captured output is bounded, so tests should print useful progress and final diagnostic information without producing unbounded logs. Live log streaming should not be assumed unless the deployment has additional logging integration.

When a validation run fails, review:

  • Whether the selected test is supported on the machine platform.
  • Whether the test was verified and enabled intentionally.
  • The command exit code and captured output.
  • Any referenced external configuration.
  • Whether the test timed out.
  • Whether a pre-condition skipped or changed the intended execution path.

Operational Guidance

Use Machine Validation as a controlled pre-allocation gate. Do not enable or verify a new test in the standard lifecycle until it has been qualified with on-demand runs on representative hardware.

For production sites:

  • Keep the built-in catalog enabled according to the site’s hardware and release policy.
  • Use short tests for routine lifecycle validation and long tests for burn-in, repair validation, or targeted on-demand workflows.
  • Prefer tags and allow lists for targeted validation instead of modifying the global catalog for temporary needs.
  • Keep site-specific test names stable across releases so operators can compare historical results.
  • Store registry credentials and sensitive test inputs as external config.
  • Review the output format of extension tests so failures are actionable from the CLI and admin UI.

Troubleshooting

SymptomCommon causesNext step
No tests are selectedFeature disabled, tests disabled, tests unverified, context mismatch, platform mismatch, tags do not match, or allow list excludes all tests.Run tests show with the relevant platform and context, and confirm enabled and verified state.
A new test does not run in lifecycle validationThe test is unverified or disabled.Run the test on demand with --run-unverfied-tests, review the result, then enable and verify it.
After modifying or updating a test, I no longer see the testThe updated test may have become unverified, or the current tests show filter may exclude its new context, platform, or version.Run tests show --show-un-verfied --test-id <test_id>, review the updated definition, then re-enable or re-verify the test as needed.
A test fails only on one platformPlatform mapping is too broad, platform-specific dependency is missing, or the test command assumes hardware that is not present.Restrict supported-platforms or add a pre-condition.
A container test cannot startImage name, registry credentials, or external config are incorrect.Confirm the image exists and refresh container_auth.
A test times outTimeout is too short, the test is hung, or the machine is unhealthy.Review captured output and set a deliberate timeout for the test’s expected runtime.
Result output is incompleteThe test wrote too much output or logs outside captured stdout and stderr.Keep CLI output concise and write important diagnostics before exit.