Nodewright and nodewright-customizations are two halves of the integration. Nodewright is a Kubernetes Operator that applies nodewright packages with consistent, repeatable, and tested lifecycles within a cluster. Nodewright-customizations are instances of the Skyhook Custom Resource that define one or more nodewright packages to deploy. These packages were selected to provide two main functions:
Uses tuned to apply a sequence of profiles to optimize primarily grub and sysctl settings. Your mileage may vary depending on the particulars of the virtualization if not running baremetal.
Package: nvidia-tuned
Configuration documentation:
A full configuration supplies: intent, accelerator, service. A minimal configuration is just accelerator
Supported accelerators: h100, gb200
Integration notes:
eks has overrides to remove setting kernel.sched_latency_ns and kernel.sched_min_granularity_ns as these are not available on AWS kernels. They cannot fail silently as the package will test to make sure the changes asked for actually happens and error if it does not.A second, more stripped down, optimizer is available for operating systems that are mostly read only such as GKE’s ContainerOptimizedOS. In this case the nvidia-tuning-gke is available to directly perform sysctl writes. Also note the change in Nodewright configuration to write to a different directory tree in order to have a writable FS and to re-apply changes every boot: recipes/overlays/gke-cos.yaml
Both of these packages (nvidia-tuned and nvidia-tuning-gke) extend other nodewright packages (tuned and tuning) and as such could directly use those and provide the configuration via configmaps. The choice was made to go with specific versioned packages in order to provide a more clear path for upgrades and understanding differences. However, the base packages are still useful to quickly iterate on configurations without requiring new versions of the extended packages used in AICR.
Uses a set of bash scripts to do the necessary actions to bring an ubuntu worker to the desired AICR spec.
Package: nvidia-setup
Currently supports: eks and gb200/h100. Each service must be added explicitly and the documentation for the addition is in the readme for how to make this update.
The version overview has all of the information about what each version for a service + accelerator pair will install or configure.
Includes the setup and optimizations for a specific service, accelerator and intent. Note that while it does have if statements around the service and intent, the inclusion of nvidia-setup (which requires service) means these are not optional today and would need to be split out to properly support that. Currently tested with:
To support non-service-specific tuning (for example, h100 + inference), the nvidia-tuned package would need to be separated out, or nvidia-setup updated to support additional services or make fewer assumptions about what it is installing — it is currently opinionated towards EKS.
See recipes/components/nodewright-customizations/manifests/tuning.yaml
A GKE + Container Optimized OS (COS) specific tuning that only sets some of the sysctl settings and does NOT require any interrupts due to being able to configure seamlessly while workloads are running.
See recipes/components/nodewright-customizations/manifests/tuning-gke.yaml
A no-op package may be used as a place holder until a full package suite can be tested. See recipes/components/nodewright-customizations/manifests/no-op.yaml