Host Validation
Getting Started
Overview
This page provides a workflow for machine validation in NCX Infra Controller (NICo).
Machine validation is a process of testing and verifying the hardware components and peripherals of a machine before handing it over to a tenant. The purpose of machine validation is to avoid disruption of tenant usage and ensure that the machine meets the expected benchmarks and performance. Machine validation involves running a series of regression tests and burn-in tests to stress the machine to its maximum capability and identify any potential issues or failures. Machine validation provides several benefits for the tenant. By performing machine validation, NICo ensures that machine is in optimal condition and ready for tenant usage. Machine validation helps to detect and resolve any hardware issues or failures before they affect the tenant’s workloads
Machine validation is performed using a different tool, these are available in the discovery image. Most of these tools require root privileges and are non-interactive. The tool(s) runs tests and sends result to Site controller
Purpose
End to end user guide for usage of machine validation feature in NICo
Audience
SRE, Provider admin, Developer
Prerequisites
- Access to NICo sites
Features and Functionalities
Features
Feature gate
The NICo site controller has site settings. These settings provide mechanisms to enable and disable features. Machine Validation feature controlled using these settings. The feature gate enables or disables machine validation features at deploy time.
Test case management
Test Case Management is the process of adding, updating test cases. There are two types of test cases
- Test cases added during deploy- These are common across all the sites and these are read-only test cases. Test cases are added through NICo DB migration.
- Site specific test case - Added by site admin
Enable disable test
If the test case is enabled then forge-scout selects the test case for running.
Verify tests
If site admin adds a test case, by default the test case verified flag will be set to false. The term verify means test case added to NICo datastore but not actually verified on hardware. By default the forge-scout never runs unverified test cases. Using on-demand machine validation, admin can run unverified test cases.
View tests results
Once the forge-scout completes the test cases, the view results feature gives a detailed report of executed test cases.
On Demand tests
If the machine is not allocated for long and the machine remains in ready state, the site admin can run the On-Demand testing. Here the selected tests will run.
List of test cases
How to use Machine Validation feature
Initial setup
NICo has a Machine validation feature gate. By default the feature is disabled.
To enable this feature, add the following section in the API site config TOML file (/site/site-controller/files/carbide-api/carbide-api-site-config.toml):
Machine Validation allows site operators to configure the NGC container registry
Finally add the config to the site:
Note: You can copy the Imagepullsecret from Kubernetes with the command
kubectl get secrets -n forge-system imagepullsecret -o yaml | awk '$1==".dockerconfigjson:" {print $2}'.
Enable test cases
By default all the test cases are disabled.
To enable tests
Note: There is a bug, a workaround is to use two commands. Will be fixed in coming releases.
For example, to enable forge_CudaSample, execute following steps:
Enabling different tests cases
CPU Benchmarking test cases
- forge_CpuBenchmarkingFp
- forge_CpuBenchmarkingInt
Cuda sample test cases
- forge_CudaSample
FIO test cases
- forge_FioFile
- forge_FioPath
- forge_FioSSD
Memory test cases
- forge_MmMemBandwidth
- forge_MmMemLatency
- forge_MmMemPeakBandwidth
NV test cases
- forge_Nvbandwidth
Stress ng test cases
- forge_CPUTestLong
- forge_CPUTestShort
- forge_MemoryTestLong
- forge_MemoryTestShort
- forge_MqStresserLong
- forge_MqStresserShort
DCGMI test cases
- forge_DcgmFullShort
- forge_DcgmFullLong
Shoreline Agent test case
- forge_ForgeRunBook
Verify tests
If a test is modified or added by site admin by default the test case verify flag is set to false
To mark test as verified
Eg: To enable forge_CudaSample execute following steps
Add test case
Site admin can add test cases per site.
Add new test case
Usage: carbide-admin-cli machine-validation tests add [OPTIONS] --name <NAME> --command <COMMAND> --args <ARGS>
Options:
Eg: add test case which prints ‘newtest’
By default the test case’s verify flag is set to false. Set
Update test case
Update existing testcases
Update existing test case
Usage: carbide-admin-cli machine-validation tests update [OPTIONS] --test-id <TEST_ID> --version <VERSION>
Options:
We can selectively update fields of test cases. Once the test case is updated the verify flag is set to false. Site admin hs to explicitly set the flag as verified.
Run On-Demand Validation
Machine validation has 3 Contexts
- Discovery - Tests cases with this context will be executed during node ingestion time.
- Cleanup - Tests cases with context will be executed during node cleanup(between tenants).
- On-Demand - Tests cases with context will be executed when on demand machine validation is triggered.
Start on demand machine validation
Usage: carbide-admin-cli machine-validation on-demand start [OPTIONS] --machine <MACHINE>
Options:
Usecase 1 - Run tests whose context is on-demand
Usecase 2 - Run tests whose context is Discovery
Usecase 3 - Run a specific test case
Usecase 4 - Run un verified forge_CudaSample test case
View results
Feature shows progress of the on-going machine validation
Show Runs
Usage: carbide-admin-cli machine-validation runs show [OPTIONS]
Options:
To view individual completed test results, by default the result command shows only last run tests in each individual context**(Discovery,Ondemand, Cleanup)**.
Show results
Usage: carbide-admin-cli machine-validation results show [OPTIONS] <--validation-id <VALIDATION_ID>|--test-name <TEST_NAME>|--machine <MACHINE>>
Options:
How to add new platform support
To add a new platform for individual tests
- Get system sku id-