Autonomous Recovery Engine (ARE) v1.3.1 Release Notes#

New Features#

Version 1.3.1 of the Autonomous Recovery Engine (ARE) has the following new features:

  • Developed the FACT attribution service as the anomaly attribution engine

  • Added an anomaly attribution pipeline to process slow attribution signals

  • Enhanced the auto-resume policy to support temporary and sticky node exclusion based on anomaly attributions

  • Integrated with Shoreline break and fix for diagnosis and repair on confirmed problematic hardware

  • Extended internal and external data models and HTTP endpoints to support persisting and querying node-centric data

  • Added new components to the Cockpit UI to display the attribution summary banner and attribution details

  • Added new types of notifications for job action events and anomaly attribution

  • Developed a client logging library that produces training lifecycle events in a standardized logging format

Improvements#

ARE version 1.3.1 contains the following improvements:

  • Refined attribution categorization rulesets for common XID and SXID errors

  • Added a job state-based anomaly detection mechanism to continue the auto-resume process for unregistered failure patterns

  • Fixed corner cases of data inconsistencies between job state and job state reason

  • Migrated to the Kubernetes native leader election mechanism and removed etcd as a dependency

  • Extended the watcher ruleset to recognize training lifecycle events produced by the client logging library

  • Improved KPI service and dashboard based on the training lifecycle events

  • Fully deprecated DynamoDB support