Autonomous Recovery Engine (ARE) v1.3.1 Release Notes#
New Features#
Version 1.3.1 of the Autonomous Recovery Engine (ARE) has the following new features:
Developed the FACT attribution service as the anomaly attribution engine
Added an anomaly attribution pipeline to process slow attribution signals
Enhanced the auto-resume policy to support temporary and sticky node exclusion based on anomaly attributions
Integrated with Shoreline break and fix for diagnosis and repair on confirmed problematic hardware
Extended internal and external data models and HTTP endpoints to support persisting and querying node-centric data
Added new components to the Cockpit UI to display the attribution summary banner and attribution details
Added new types of notifications for job action events and anomaly attribution
Developed a client logging library that produces training lifecycle events in a standardized logging format
Improvements#
ARE version 1.3.1 contains the following improvements:
Refined attribution categorization rulesets for common XID and SXID errors
Added a job state-based anomaly detection mechanism to continue the auto-resume process for unregistered failure patterns
Fixed corner cases of data inconsistencies between job state and job state reason
Migrated to the Kubernetes native leader election mechanism and removed etcd as a dependency
Extended the watcher ruleset to recognize training lifecycle events produced by the client logging library
Improved KPI service and dashboard based on the training lifecycle events
Fully deprecated DynamoDB support