Changes and New Features
Feature/Change | Description |
Switches | Added support for NVIDIA Quantum-2 switches with NDR speed |
Adapter Cards | Added support for NVIDIA ConnectX-7 adapter card with 400 Gb/s speed |
SHARPD | sharpd daemon process has been removed. sharpd-related activity is now performed from the user application process |
AM | Upon restart of AM, it no longer needs to wait for all concurrent jobs to finish before being able to accept new jobs |
AM | Added a mechanism that periodically checks for errors in Aggregation Trees and attempts to fix them |
General | Added support for new data types BFLOAT16, INT8 and UNIT8 for performing reduction operations |
Parameter | Component | Description |
recovery_retry_interval | sharp_am | New parameter: A timeout in seconds for trees recovery retries. A value of 0 means do not try to recover trees. Default: 300 |
enable_seamless_restart | sharp_am | New parameter: A boolean flag. If enabled, AM tries to recover state from last AM run and continue the operation of the current jobs. Default: True |
seamless_restart_trees_file | sharp_am | New parameter: Set the SHARP trees file used in Seamless restart. Need to mention only the file name, full path is constructed using ‘dump_dir’. Default: sharp_am_trees_structure.dump |
seamless_restart_max_retries | sharp_am | New parameter: Set the number of consecutive retries of seamless restart. If seamless restart fails more times in a row, it will be disabled in the next run. Default: 3 |
max_tree_radix | sharp_am | Update: Change default to 252 |
Ib_sat_max_mtu | sharp_am | Update: Change default to 5, to support MAD value that represents 4K MTU. |
per_prio_default_quota | sharp_am | Update: Changed default to 3 instead of 20, enabling more SAT jobs to take place in parallel on each switch. |