Page History
Feature/Change | Description |
---|---|
Rev 1.8.1 | |
Aggregation Manager (sharp_am) | |
Resiliency | Added support for waiting for jobs to end prior to performing fabric reinitialization on AM startup. |
Mellanox SHARP Daemon (sharpd) | |
Out-of-Box Improvements | Socket-based is now activated by default when installed from RPM/MLNX_OFED. |
Rev 1.7.2 | |
Mellanox SHARP Daemon (sharpd) | |
Bug Fix | Fixed the issue where mcast distribution did not work with the IPoIB interface name format. |
Rev 1.7.1 | |
Aggregation Manager (sharp_am) | |
Resources Allocation | Added support for the following resources allocation features:
|
Mellanox SHARP Daemon (sharpd)/Job Scheduler | |
Resources Allocation | Added support for configuring job priority and requested resources in percentage in managed mode. |
Rev 1.5 | |
Aggregation Manager (sharp_am) | |
Fabric Topologies | Added support for DF+ topologies. |
Virtualization | Added support for running Mellanox SHARP software on virtual ports. |
Resiliency | Added the option to handle the reboot of a switch connected to AM. |
Mellanox SHARP Daemon (sharpd) | |
Quality of service | Added support for non-default SL for data path. |
Aggregation Manager (sharp_am) / Mellanox SHARP Daemon (sharpd) | |
Multi-rail | Added the following multirail-single subnet fabrics supports:
|
Rev 1.4 | |
Aggregation Manager (sharp_am) | |
Fabric Extension | Enabled adding/replacing new non-root aggregation nodes without restarting Aggregation Manager. |
Fabric Extension | Optimized root placement on tree topologies (improved the location of Mellanox SHARP trees roots on the tree topologies). |
Resiliency | Added the option to notify running jobs about Aggregation Manager (sharp_am) restart. |
Mellanox SHARP Daemon (sharpd) | |
Out-of-the-box Improvements | Added Systemd support. |
Out-of-the-box Improvements | Added Socket-Based-Activation support for Mellanox SHARP daemons on systems with Systemd. |
Out-of-the-box Improvements | Removed static binding to network IP interface in Mellanox SHARP daemons. |
Rev 1.3 | |
Aggregation Manager (sharp_am) | |
Out-of-the-box improvement | Added support for extended fabric format (SMDB). Note: This requires Subnet Manager 4.9 or later. |
Fabric extension | Compute hosts can be added/replaced without Aggregation Manager restart. |
Configuration | Added the ability to update some configuration parameters in runtime without application restart. |
Mellanox SHARP Daemon (sharpd) | |
Out-of-the-box improvement | Removed static binding to IB port. |
Configuration | Added the ability to update some configuration parameters in runtime without application restart. |
Rev 1.2 | |
Aggregation Manager (sharp_am) | Added support for IB fabric events (flapping links, switch/host reboot) |
Resiliency: Mellanox SHARP Tree QP Recovery | |
Added support for Hyper-cube topology (needs OpenSM 4.8.1 or later) | |
HCOLL | Added new non-blocking API for Mellanox SHARP collectives |
Job Scheduler | Added new API for integration with Job Scheduler |
UFM | Enabled Aggregation Manager integration with UFM |
Rev 1.1 | |
HCOLL | Enables UD MCAST result distribution |
Enables multiple group leaders per compute nodes | |
Delivers error to an application | |
Enables Mellanox SHARP Group trim | |
Added support for ppcle platform | |
Rev 1.0 | |
MPI 2.x | Barrier and Allreduce collective operations using Mellanox SHARP protocol are supported in Open MPI, MPICH, Scalable SHMEM with HCOLL library. |
HCOLL | Enables running of Mellanox SHARP collective with the mpirun utility. For the complete list of flags that can be used when running Mellanox SHARP software, please refer to the SHARP Deployment Guide. |
Parameters Changes
Parameter | Component | Description | |
---|---|---|---|
Rev 1.8.1 | |||
pending_mode_timeout | sharp_am | New parameter: Defines AM waiting time for jobs to complete prior to fabric re-initialization upon startup. | |
job_info_polling_interval | sharp_am | New parameter: Defines job status polling interval when waiting for jobs to complete upon startup. | |
Rev 1.7.1 | |||
max_quota | sharp_am | Deprecated by per_pri_max_quota | |
default_quota | sharp_am | Deprecated by per_pri_default_quota | |
per_pri_max_quota | sharp_am | New parameter: Defines maximum percentage of resources to allocate per job by priority | |
per_pri_default_quota | sharp_am | New parameter: Defines default percentage of resources to allocate per job by priority | |
low_prio_max_accumulated_quota | sharp_am | New parameter: Defines maximum accumulated quota for all low priority jobs. | |
max_trees_per_job | sharp_am | New parameter: Defines maximum number of trees allowed per job. | |
default_trees_per_job | sharp_am | New parameter: Defines default number of trees per job. | |
max_compute_ports_per_agg_node | sharp_am | New parameter: Defines number of compute ports per AN for the purpose of resource allocation | |
coll_job_quota_percentage | sharp_job_quota | New parameter: Set requested quota in percentage for job | |
job_priority | sharp_job_quota | New parameter: Set requested priority for the job | |
SHARP_COLL_JOB_PRIORITY | HCOLL | New parameter: Set requested priority for the job | |
SHARP_COLL_OSTS_PER_GROUP | HCOLL | New parameter: Set number of OSTs per group | |
Rev 1.5 | |||
config_file | sharp_am/sharpd | Modified behavior: This parameter now defines the path to a configuration file. If specified with '-' prefix, on configuration file read errors, ignore errors and use default configuration file instead. Note: No support for update runtime. | |
fabric_virt_file | sharp_am | New Parameter: Defines path to fabric virtualization info file. Note: No support for update runtime. | |
trimming_mode | sharp_am | New Parameter: Configures group trimming mode. Note: No support on update runtime. | |
Rev. 1.4 | |||
accumulate_log | sharp_am/sharpd | New Parameter: Accumulates log file over multiple sessions. If set to FALSE and log rotation is disabled, the log file is truncated on startup Note: No support on update runtime. | |
syslog_verbosity | sharp_am/sharpd | New Parameter: Syslog verbosity level: 1 - Errors, 2 - Warnings. Default value is "1". Note: Supported on update runtime. | |
persistent_dir | sharp_am | New Parameter: Path to persistent data directory. Note: No support on update runtime. | |
Rev. 1.3 | |||
ib_mad_timeout | sharp_am | Removed | |
ib_mad_retries | sharp_am | Removed | |
hyper_cube_coordinates_file | sharp_am | Deprecated (with Subnet Manager 4.9 and later). | |
root_guids_file | sharp_am | Deprecated (with Subnet Manager 4.9 and later). | |
ib_dev | sharpd | Removed | |
log_verbosity | sharp_am / sharpd | Modified behavior: Added the option to update on runtime. | |
lst_file_timeout | sharp_am | Modified behavior: Added the option to update on runtime. | |
lst_file_retries | sharp_am | Modified behavior: Added the option to update on runtime. | |
generate_dump_files | sharp_am | Modified behavior: Added the option to update on runtime. | |
max_quota | sharp_am | Modified behavior: Added the option to update on runtime. | |
default_quota | sharp_am | Modified behavior: Added the option to update on runtime. | |
span_all_agg_nodes | sharp_am | New Parameter: Generate trees that span all possible aggregation nodes Relevant only if "topology_type" is tree. | |
Rev 1.2 | |||
Environment variable: SMX_SOCK_PORT | AM / SD | Replaced by smx_sock_port parameter | |
Environment variable: SMX_SOCK_INTERFACE | AM / SD | Replaced by smx_sock_interface | |
SHARP_COLL_SHARP_ENABLE_MCAST_TARGET | HCOLL | Replaced by SHARP_COLL_ENABLE_MCAST_TARGET | |
smx_sock_interface | sharp_am / sharpd | New Parameter: Network interface to be used by SMX. Default: empty string - Use first interface found in UP state | |
smx_sock_port | sharp_am / sharpd | New Parameter: The external port to be used by SMX. Default - 6126 | |
lst_file_timeout | sharp_am | New Parameter: Length of timeout in seconds between attempts to load the LST file. Default - 3 seconds. | |
lst_file_retries | sharp_am | New Parameter: Max number of retry attempts when loading the LST file and encountering "No such file" errors. Default - 0 meaning no retries. | |
log_max_backup_files | sharpd | New Parameter: Number of backup log files. Used for log rotation | |
log_file_max_size | sharpd | New Parameter: Maximum size of a log file, in MBs. If value is 0, log rotation isn't used | |
mgmt_mode | sharpd | New Parameter: When running in managed mode, SHARPD expects notifications from the Resource manager (Job scheduler). The possible values are: 0 - Unmanaged mode; 1 - Managed mode | |
smx_sock_backlog | sharpd | New Parameter: Defines the maximum length to which the queue of pending connections for the SMX listen socket may grow | |
group_allocate_timeout | sharpd | New Parameter: Maximum time [in milliseconds] to wait for group allocation transaction to complete. |
API Updates
API | Category | Description |
---|---|---|
Rev 1.7.1 | ||
sharp_coll_config | HCOLL | Modified storage class of ib_dev_list to static. |
Rev 1.5 | ||
sharp_coll_init_spec | HCOLL | Modified struct sharp_coll_init_spec: 1. Add field: world_local_rank (int) 2. Add field: enable_thread_support (int) |
sharp_coll_do_allreduce | HCOLL | Removed maximum message size(8k) from sharp_coll_do_allreduce api |
Configuration | HCOLL | Added MPI MPI_THREAD_MULTIPLE support |
Configuration | HCOLL | Removed SHARP_COLL_ENABLE_GROUP_TRIM option. |
Rev 1.2 | ||
sharp_coll_do_allreduce_nb | HCOLL | Changed |
sharp_coll_do_barrier_nb | HCOLL | Changed |
sharp_coll_do_reduce_nb | HCOLL | Changed |
sharp_coll_req_test | HCOLL | Changed |
sharp_coll_req_wait | HCOLL | Changed |
sharp_coll_req_free | HCOLL | Changed |
sharp_job_quota | Job Scheduler | Added |