Machine Validation is NVIDIA Infra Controller’s in-band validation framework for checking a machine before it is made available to tenants. NICo uses Scout to run validation tests on the host, collect the results, and report them back to the site controller.
The framework is intended to be extensible. NICo provides a catalog of built-in hardware validation tests, and site administrators can add site-specific tests when the deployment enables test mutation workflows.
Machine Validation helps operators answer a simple question: is this machine healthy enough to enter or return to the tenant-ready pool?
NICo can run validation during lifecycle workflows such as discovery and release, and administrators can also start validation on demand for a specific machine. Each validation run selects tests based on context, platform support, test enablement, verification state, tags, and any allow list supplied by the operator.
In normal lifecycle validation, NICo runs only tests that are both enabled and verified. Unverified tests can be exercised through on-demand validation before they are promoted into the standard workflow.
This guide is written for site administrators, SREs, platform administrators, and
developers who manage or extend NICo machine validation. The examples assume the
operator has access to the target site through nico-admin-cli and has the
permissions required to view or modify machine validation configuration.
Before using Machine Validation, confirm the following:
Machine Validation runs while a machine is under platform control and before it is allocated to a tenant. Typical entry points include:
Ready.Machine Validation complements SKU validation. SKU validation checks that the machine inventory matches the expected hardware model. Machine Validation runs tests on the machine to prove that the hardware and relevant host-side software paths behave correctly.
When a validation run starts, NICo and Scout select tests using the following criteria:
Discovery, Cleanup,
or OnDemand.On-demand validation can intentionally include unverified tests by using the
current CLI flag --run-unverfied-tests. The spelling of unverfied is part of
the current CLI interface and must be used exactly as shown.
The exact test IDs, versions, enabled state, and supported platforms are
deployment and release specific. Use nico-admin-cli machine-validation tests show as the source of truth for the running site.
The built-in catalog commonly includes the following test groups:
Built-in tests that are delivered through NICo migrations are normally read-only. Site-specific tests can be added and modified by administrators when the deployment enables those mutation APIs.
Machine Validation is controlled by the site configuration. A minimal configuration enables the feature:
A site can also control the catalog selection behavior:
Some validation tests require external configuration, such as container registry credentials. Store those inputs as named external configs instead of embedding secrets in test definitions.
For example, to add or update the container authentication file:
To view or remove external configuration:
Use the test catalog to see the tests available in the site:
Show a specific test:
Filter by platform or context:
Show unverified tests:
The current CLI flag is spelled --show-un-verfied; use the spelling shown
above.
Enable a test when it should be eligible for selection:
Disable a test when it should not be selected:
Use the test_id and version values returned by tests show.
Verify a test after it has been proven safe and correct for the target site:
Verification is a promotion step. A newly added test should be run on demand first, reviewed, and then marked verified before it is allowed into normal lifecycle validation.
When test mutation workflows are enabled for the deployment, administrators can add tests to extend the validation framework. A site-specific test should be small, deterministic, non-interactive, and safe to run under platform control.
The following example adds a host-side smoke test:
The following example adds a container-based test:
After adding a test:
--run-unverfied-tests.Machine Validation tests can be implemented as host commands or container-based commands.
Tests can also declare output file locations with --extra-output-file and
--extra-err-file when a command writes important diagnostics outside stdout or
stderr. Keep those outputs concise. Scout records command output for result
review, but Machine Validation is not a replacement for long-term log storage.
Use tests update to change a mutable test definition:
Use updates for site-specific tests only. Built-in tests may be read-only, depending on how they were delivered to the site.
Use the following guidelines when designing a new validation test:
Start validation for a specific machine:
Run only selected contexts:
Run selected tests:
Run a tagged suite:
Run unverified tests during qualification:
The current CLI flag is spelled --run-unverfied-tests; use the spelling shown
above.
Show validation runs:
Show runs for one machine:
Include historical runs:
Show validation results for a machine:
Show results for a specific validation run:
Show a specific test result from a run:
Each test result records the command execution outcome, timing, exit code, and captured output. A non-zero exit code indicates failure unless the test command implements a documented skip or pre-condition behavior.
Scout captures stdout and stderr after the command exits. Captured output is bounded, so tests should print useful progress and final diagnostic information without producing unbounded logs. Live log streaming should not be assumed unless the deployment has additional logging integration.
When a validation run fails, review:
Use Machine Validation as a controlled pre-allocation gate. Do not enable or verify a new test in the standard lifecycle until it has been qualified with on-demand runs on representative hardware.
For production sites: