Changes and New Features

Feature/Change

Description

Switches

Added support for NVIDIA Quantum-2 switches with NDR speed

Adapter Cards

Added support for NVIDIA ConnectX-7 adapter card with 400 Gb/s speed

SHARPD

sharpd daemon process has been removed. sharpd-related activity is now performed from the user application process

AM

Upon restart of AM, it no longer needs to wait for all concurrent jobs to finish before being able to accept new jobs

AM

Added a mechanism that periodically checks for errors in Aggregation Trees and attempts to fix them

General

Added support for new data types BFLOAT16, INT8 and UNIT8 for performing reduction operations

Parameter

Component

Description

recovery_retry_interval

sharp_am

New parameter: A timeout in seconds for trees recovery retries. A value of 0 means do not try to recover trees.

Default: 300

enable_seamless_restart

sharp_am

New parameter: A boolean flag. If enabled, AM tries to recover state from last AM run and continue the operation of the current jobs.

Default: True

seamless_restart_trees_file

sharp_am

New parameter: Set the SHARP trees file used in Seamless restart. Need to mention only the file name, full path is constructed using ‘dump_dir’.

Default: sharp_am_trees_structure.dump

seamless_restart_max_retries

sharp_am

New parameter: Set the number of consecutive retries of seamless restart. If seamless restart fails more times in a row, it will be disabled in the next run.

Default: 3

max_tree_radix

sharp_am

Update: Change default to 252

Ib_sat_max_mtu

sharp_am

Update: Change default to 5, to support MAD value that represents 4K MTU.

per_prio_default_quota

sharp_am

Update: Changed default to 3 instead of 20, enabling more SAT jobs to take place in parallel on each switch.

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.