Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space SharpDEV and version 2.0.0
Feature/ChangeDescription
Rev 1.8.1
Aggregation Manager (sharp_am)

Resiliency

Added support for waiting for jobs to end prior to performing fabric reinitialization on AM startup.

Mellanox SHARP Daemon (sharpd)

Out-of-Box Improvements

Socket-based is now activated by default when installed from RPM/MLNX_OFED.

Rev 1.7.2
Mellanox SHARP Daemon (sharpd)
Bug Fix

Fixed the issue where mcast distribution did not work with the IPoIB interface name format.

Rev 1.7.1
Aggregation Manager (sharp_am)
Resources Allocation

Added support for the following resources allocation features:

  • Allocate resources in percentage
  • Allocate resources per priority level
  • Define maximum accumulated resources to low priority jobs
Mellanox SHARP Daemon (sharpd)/Job Scheduler
Resources AllocationAdded support for configuring job priority and requested resources in percentage in managed mode.
Rev 1.5

Aggregation Manager (sharp_am)

Fabric Topologies

Added support for DF+ topologies.

Virtualization

Added support for running Mellanox SHARP software on virtual ports.

Resiliency

Added the option to handle the reboot of a switch connected to AM.

Mellanox SHARP Daemon (sharpd)

Quality of service

Added support for non-default SL for data path.

Aggregation Manager (sharp_am) / Mellanox SHARP Daemon (sharpd)

Multi-rail

Added the following multirail-single subnet fabrics supports:

  • Allocating multiple trees for a single job
  • Allocating trees that span to multiple rails
Rev 1.4

Aggregation Manager (sharp_am)

Fabric Extension

Enabled adding/replacing new non-root aggregation nodes without restarting Aggregation Manager.

Fabric Extension

Optimized root placement on tree topologies (improved the location of Mellanox SHARP trees roots on the tree topologies).

Resiliency

Added the option to notify running jobs about Aggregation Manager (sharp_am) restart.

Mellanox SHARP Daemon (sharpd)

Out-of-the-box Improvements

Added Systemd support.

Out-of-the-box Improvements

Added Socket-Based-Activation support for Mellanox SHARP daemons on systems with Systemd.

Out-of-the-box Improvements

Removed static binding to network IP interface in Mellanox SHARP daemons.

Rev 1.3

Aggregation Manager (sharp_am)

Out-of-the-box improvement

Added support for extended fabric format (SMDB). Note: This requires Subnet Manager 4.9 or later.

Fabric extension

Compute hosts can be added/replaced without Aggregation Manager restart.

Configuration

Added the ability to update some configuration parameters in runtime without application restart.

Mellanox SHARP Daemon (sharpd)

Out-of-the-box improvement

Removed static binding to IB port.

Configuration

Added the ability to update some configuration parameters in runtime without application restart.

Rev 1.2

Aggregation Manager (sharp_am)

Added support for IB fabric events (flapping links, switch/host reboot)

Resiliency: Mellanox SHARP Tree QP Recovery

Added support for Hyper-cube topology (needs OpenSM 4.8.1 or later)

HCOLL

Added new non-blocking API for Mellanox SHARP collectives

Job Scheduler

Added new API for integration with Job Scheduler

UFM

Enabled Aggregation Manager integration with UFM

Rev 1.1

HCOLL

Enables UD MCAST result distribution

Enables multiple group leaders per compute nodes

Delivers error to an application

Enables Mellanox SHARP Group trim

Added support for ppcle platform

Rev 1.0

MPI 2.x

Barrier and Allreduce collective operations using Mellanox SHARP protocol are supported in Open MPI, MPICH, Scalable SHMEM with HCOLL library.

HCOLL

Enables running of Mellanox SHARP collective with the mpirun utility.

For the complete list of flags that can be used when running Mellanox SHARP software, please refer to the SHARP Deployment Guide.

Parameters Changes

ParameterComponentDescription
Rev 1.8.1

pending_mode_timeout

sharp_am

New parameter: Defines AM waiting time for jobs to complete prior to fabric re-initialization upon startup.

job_info_polling_interval

sharp_am

New parameter: Defines job status polling interval when waiting for jobs to complete upon startup.
Rev 1.7.1

max_quota

sharp_am

Deprecated by per_pri_max_quota

default_quota

sharp_am

Deprecated by per_pri_default_quota

per_pri_max_quota

sharp_am

New parameter: Defines maximum percentage of resources to allocate per job by priority

per_pri_default_quota

sharp_am

New parameter: Defines default percentage of resources to allocate per job by priority

low_prio_max_accumulated_quota

sharp_am

New parameter: Defines maximum accumulated quota for all low priority jobs.

max_trees_per_job

sharp_am

New parameter: Defines maximum number of trees allowed per job.

default_trees_per_job

sharp_am

New parameter: Defines default number of trees per job.

max_compute_ports_per_agg_node

sharp_am

New parameter: Defines number of compute ports per AN for the purpose of resource allocation

coll_job_quota_percentage

sharp_job_quota

New parameter: Set requested quota in percentage for job

job_priority

sharp_job_quota

New parameter: Set requested priority for the job

SHARP_COLL_JOB_PRIORITY

HCOLL

New parameter: Set requested priority for the job

SHARP_COLL_OSTS_PER_GROUP

HCOLL

New parameter: Set number of OSTs per group

Rev 1.5

config_file

sharp_am/sharpd

Modified behavior: This parameter now defines the path to a configuration file. If specified with '-' prefix, on configuration file read errors, ignore errors and use default configuration file instead.

Note: No support for update runtime.

fabric_virt_file


sharp_am

New Parameter: Defines path to fabric virtualization info file.

Note: No support for update runtime.

trimming_mode


sharp_am

New Parameter: Configures group trimming mode.

Note: No support on update runtime.

Rev. 1.4

accumulate_log

sharp_am/sharpd

New Parameter: Accumulates log file over multiple sessions. If set to FALSE and log rotation is disabled, the log file is truncated on startup

Note: No support on update runtime.

syslog_verbosity

sharp_am/sharpd

New Parameter: Syslog verbosity level: 1 - Errors, 2 - Warnings. Default value is "1".

Note: Supported on update runtime.

persistent_dir

sharp_am

New Parameter: Path to persistent data directory.

Note: No support on update runtime.

Rev. 1.3

ib_mad_timeout

sharp_am

Removed

ib_mad_retries

sharp_am

Removed

hyper_cube_coordinates_file

sharp_am

Deprecated (with Subnet Manager 4.9 and later).

root_guids_file

sharp_am

Deprecated (with Subnet Manager 4.9 and later).

ib_dev

sharpd

Removed

log_verbosity

sharp_am / sharpd

Modified behavior: Added the option to update on runtime.

lst_file_timeout

sharp_am

Modified behavior: Added the option to update on runtime.

lst_file_retries

sharp_am

Modified behavior: Added the option to update on runtime.

generate_dump_files

sharp_am

Modified behavior: Added the option to update on runtime.

max_quota

sharp_am

Modified behavior: Added the option to update on runtime.

default_quota

sharp_am

Modified behavior: Added the option to update on runtime.

span_all_agg_nodes

sharp_am

New Parameter: Generate trees that span all possible aggregation nodes

Relevant only if "topology_type" is tree.

Rev 1.2

Environment variable: SMX_SOCK_PORT

AM / SD

Replaced by smx_sock_port parameter

Environment variable: SMX_SOCK_INTERFACE 

AM / SD

Replaced by smx_sock_interface

SHARP_COLL_SHARP_ENABLE_MCAST_TARGET

HCOLL

Replaced by SHARP_COLL_ENABLE_MCAST_TARGET

smx_sock_interface

sharp_am / sharpd

New Parameter: Network interface to be used by SMX.

Default: empty string - Use first interface found in UP state

smx_sock_port

sharp_am / sharpd

New Parameter: The external port to be used by SMX. Default - 6126

lst_file_timeout

sharp_am

New Parameter: Length of timeout in seconds between attempts to load the LST file. Default - 3 seconds.

lst_file_retries

sharp_am

New Parameter: Max number of retry attempts when loading the LST file and encountering "No such file" errors. Default - 0 meaning no retries.

log_max_backup_files

sharpd

New Parameter: Number of backup log files. Used for log rotation

log_file_max_size

sharpd

New Parameter: Maximum size of a log file, in MBs. If value is 0, log rotation isn't used

mgmt_mode

sharpd

New Parameter: When running in managed mode, SHARPD expects notifications from the Resource manager (Job scheduler). The possible values are: 0 - Unmanaged mode; 1 - Managed mode

smx_sock_backlog

sharpd

New Parameter: Defines the maximum length to which the queue of pending connections for the SMX listen socket may grow

group_allocate_timeout

sharpd

New Parameter: Maximum time [in milliseconds] to wait for group allocation transaction to complete.

API Updates

APICategoryDescription
Rev 1.7.1

sharp_coll_config

HCOLL

Modified storage class of ib_dev_list to static.

Rev 1.5

sharp_coll_init_spec

HCOLL

Modified struct sharp_coll_init_spec:

1. Add field: world_local_rank (int)

2. Add field: enable_thread_support (int)

sharp_coll_do_allreduce

HCOLL

Removed maximum message size(8k) from sharp_coll_do_allreduce api

Configuration

HCOLL

Added MPI MPI_THREAD_MULTIPLE support

Configuration

HCOLL

Removed SHARP_COLL_ENABLE_GROUP_TRIM option.

Rev 1.2

sharp_coll_do_allreduce_nb

HCOLL

Changed

sharp_coll_do_barrier_nb

HCOLL

Changed

sharp_coll_do_reduce_nb

HCOLL

Changed

sharp_coll_req_test

HCOLL

Changed

sharp_coll_req_wait

HCOLL

Changed

sharp_coll_req_free

HCOLL

Changed

sharp_job_quota

Job Scheduler

Added