NVIDIA UFM Enterprise User Manual v6.23.1

Troubleshooting

Showing UFM Processes Status

This functionality allows users to view the current status of main processes handled by the UFM.

  • To view the main UFM processes, run the script show_ufm_status.sh under the /opt/ufm/scripts.Example: /opt/ufm/scripts/show_ufm_status.sh

  • To view the UFM main and child processes, run the script show_ufm_status.sh with–e(extended_processes).

    Example: /opt/ufm/scripts/show_ufm_status.sh -e

UFM_STATUS1-version-1-modificationdate-1762688267267-api-v2.png

UFM_STATUS2-version-1-modificationdate-1762688266973-api-v2.png

UFM Startup Logging

UFM provides comprehensive startup logging capabilities that track the initialization progress of all UFM services in real-time. When enabled via the startup_logging_enabled configuration flag, the system creates detailed logs of each startup phase and stage, including timestamps, service names, and status updates.

This option generates several key files: /opt/ufm/files/log/ufm_startup.log (detailed startup logs), /opt/ufm/files/log/ufm_startup_progress_ufm-ha-cluster.json (only relevant in UFM HA mode), /opt/ufm/files/log/ufm_startup_progress_ufm-enterprise.json and /opt/ufm/files/log/ufm_startup_progress_ufm-modelmain.json for persistent progress tracking across all services.

This logging system provides administrators with complete visibility into the UFM startup process, making it easier to diagnose startup issues, monitor service initialization progress, and ensure all components are starting correctly.

The logging level can be configured using the startup_logging_log_level parameter to control the verbosity of startup messages.

For more information about the configuration, refer to Configuring UFM Logging.

Example for the content of /opt/ufm/files/log/ufm_startup.log:

Copy
Copied!
            

2025-08-28 19:18:01 [ufm-ha-cluster] [INFO] ufm-ha-cluster Startup Log. Host: c-141-137-180-181, IP Address: 10.141.137.181 2025-08-28 19:18:01 [ufm-ha-cluster] [INFO] [Service Startup] ufm-ha-cluster process started [START] 2025-08-28 19:18:01 [ufm-ha-cluster] [INFO] [System Preparation] System Preparation started [START] 2025-08-28 19:18:01 [ufm-ha-cluster] [INFO] [System Preparation - Check HA configurations] Check HA configurations completed, duration: 4ms [SUCCESS] 2025-08-28 19:18:01 [ufm-ha-cluster] [INFO] [System Preparation - Validate DRBD Storage state & connectivity] Validate DRBD Storage state & connectivity completed, duration: 51ms [SUCCESS] 2025-08-28 19:18:01 [ufm-ha-cluster] [INFO] [System Preparation - PCS Resources Cleanup] PCS Resources Cleanup completed, duration: 635ms [SUCCESS] 2025-08-28 19:18:01 [ufm-ha-cluster] [INFO] [System Preparation - Refresh File System (DRBD Only)] Refresh File System (DRBD Only) completed, duration: 25ms [SUCCESS] 2025-08-28 19:18:01 [ufm-ha-cluster] [INFO] [System Preparation] System Preparation completed (4/4 stages), duration: 727ms [COMPLETED] 2025-08-28 19:18:02 [ufm-ha-cluster] [INFO] [Pcs Resource Activation] Pcs Resource Activation started [START] 2025-08-28 19:18:04 [ufm-ha-cluster] [INFO] [Pcs Resource Activation - Enable PCS resources] Enable PCS resources completed - Enabled resources ufm-ha-watcher ufm-enterprise, duration: 529ms [SUCCESS] 2025-08-28 19:18:04 [ufm-ha-cluster] [INFO] [Pcs Resource Activation] Pcs Resource Activation completed (1/1 stages), duration: 530ms [COMPLETED] 2025-08-28 19:18:04 [ufm-ha-cluster] [INFO] [Service Startup] ufm-ha-cluster startup process completed successfully [COMPLETED] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] ufm-enterprise Startup Log. Host: c-141-137-180-181, IP Address: 10.141.137.181 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Service Startup] ufm-enterprise process started [START] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Pre Initialization Checks - Check if UFM is already running] Check if UFM is already running completed - UFM is not running, duration: 0ms [SUCCESS] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Pre Initialization Checks] Pre Initialization Checks started [START] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Pre Initialization Checks - Verify log directories exist] Verify log directories exist completed, duration: 1ms [SUCCESS] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Pre Initialization Checks - Validate UFM license] Validate UFM license completed, duration: 9ms [SUCCESS] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Pre Initialization Checks - Verify log directories permissions] Verify log directories permissions completed, duration: 50ms [SUCCESS] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Pre Initialization Checks - Validate UFM configuration files] Validate UFM configuration files completed, duration: 275ms [SUCCESS] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Pre Initialization Checks] Pre Initialization Checks completed (5/5 stages), duration: 345ms [COMPLETED] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Environment Setup] Environment Setup started [START] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Environment Setup - Check for other running Subnet Managers] Check for other running Subnet Managers completed, duration: 78ms [SUCCESS] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Environment Setup - Check IB interface status] Check IB interface status completed, duration: 168ms [SUCCESS] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Environment Setup - Check disk space on UFM partitions] Disk space check completed successfully on /opt/ufm [SUCCESS] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Environment Setup - Check disk space on UFM partitions] Disk space check completed successfully on /opt/ufm/files [SUCCESS] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Environment Setup - Check disk space on UFM partitions] Disk space check completed successfully on /opt/ufm/tmp [SUCCESS] 2025-08-28 19:18:10 [ufm-enterprise] [INFO] [Environment Setup - Check disk space on UFM partitions] Check disk space on UFM partitions completed, duration: 93ms [SUCCESS] 2025-08-28 19:18:11 [ufm-enterprise] [INFO] [Environment Setup - Update logrotate configuration] Update logrotate configuration completed, duration: 258ms [SUCCESS] 2025-08-28 19:18:11 [ufm-enterprise] [INFO] [Environment Setup - Sync UFM web client files] Sync UFM web client files completed, duration: 41ms [SUCCESS] 2025-08-28 19:18:11 [ufm-enterprise] [INFO] [Environment Setup - Check multisubnet mode] Check multisubnet mode completed - Multisubnet mode is disabled, continuing with normal startup, duration: 0ms [SUCCESS] 2025-08-28 19:18:11 [ufm-enterprise] [INFO] [Environment Setup] Environment Setup completed (6/6 stages), duration: 661ms [COMPLETED] 2025-08-28 19:18:11 [ufm-enterprise] [INFO] [Core Services Startup] Core Services Startup started [START] 2025-08-28 19:18:11 [ufm-enterprise] [INFO] [Core Services Startup - Start MAD Limiter] Start MAD Limiter skipped - Madlimiter is disabled, duration: 0ms [SKIPPED] 2025-08-28 19:18:11 [ufm-enterprise] [INFO] [Core Services Startup - Start OpenSM] Start OpenSM completed, duration: 375ms [SUCCESS] 2025-08-28 19:18:11 [ufm-enterprise] [INFO] [Core Services Startup - Start SM Communicators Manager] Start SM Communicators Manager skipped - SM Communicators Manager should be enabled in UFM Infra, duration: 0ms [SKIPPED] 2025-08-28 19:18:11 [ufm-enterprise] [INFO] [Core Services Startup - Start SHArP Aggregation Manager] Start SHArP Aggregation Manager skipped - Sharp is disabled, duration: 0ms [SKIPPED] 2025-08-28 19:18:13 [ufm-enterprise] [INFO] [Core Services Startup - Start UFM Primary Telemetry] Start UFM Primary Telemetry completed, duration: 972ms [SUCCESS] 2025-08-28 19:18:15 [ufm-enterprise] [INFO] [Core Services Startup - Start UFM Secondary Telemetry] Start UFM Secondary Telemetry completed, duration: 140ms [SUCCESS] 2025-08-28 19:18:15 [ufm-enterprise] [INFO] [Core Services Startup] Core Services Startup completed (6/6 stages), duration: 22ms [COMPLETED] 2025-08-28 19:18:15 [ufm-enterprise] [INFO] [Web Infra Services Startup] Web Infra Services Startup started [START] 2025-08-28 19:18:15 [ufm-enterprise] [INFO] [Web Infra Services Startup - Configure SSL certificates] Configure SSL certificates completed, duration: 14ms [SUCCESS] 2025-08-28 19:18:15 [ufm-enterprise] [INFO] [Web Infra Services Startup - Generate UFM web configuration] Generate UFM web configuration completed, duration: 483ms [SUCCESS] 2025-08-28 19:18:15 [ufm-enterprise] [INFO] [Web Infra Services Startup - Restart web server] Restart web server completed, duration: 174ms [SUCCESS] 2025-08-28 19:18:15 [ufm-enterprise] [INFO] [Web Infra Services Startup] Web Infra Services Startup completed (3/3 stages), duration: 679ms [COMPLETED] 2025-08-28 19:18:15 [ufm-enterprise] [INFO] [Ufm Services Startup] Ufm Services Startup started [START] 2025-08-28 19:18:18 [ufm-enterprise] [INFO] [Ufm Services Startup - Start Authentication service] Start Authentication service completed, duration: 140ms [SUCCESS] 2025-08-28 19:18:18 [ufm-enterprise] [INFO] [Ufm Services Startup - Start UFM Model Main] Start UFM Model Main completed, duration: 242ms [SUCCESS] 2025-08-28 19:18:20 [ufm-modelmain] [INFO] ufm-modelmain Startup Log. Host: c-141-137-180-181, IP Address: 10.141.137.181 2025-08-28 19:18:20 [ufm-modelmain] [INFO] [Service Startup] ufm-modelmain process started [START] 2025-08-28 19:18:20 [ufm-modelmain] [INFO] [Startup] Startup started [START] 2025-08-28 19:18:20 [ufm-modelmain] [INFO] [Startup - SM client consumer] SM client consumer completed, duration: 3ms [SUCCESS] 2025-08-28 19:18:20 [ufm-modelmain] [INFO] [Startup - Sysinfo JSON agent] Sysinfo JSON agent completed, duration: 6ms [SUCCESS] 2025-08-28 19:18:20 [ufm-modelmain] [INFO] [Startup - Check network management IB interface] Check network management IB interface completed - Network management interface ib0 is running and has IP address, duration: 1ms [SUCCESS] 2025-08-28 19:18:20 [ufm-modelmain] [INFO] [Startup - Initialize core infrastructure] Initialize core infrastructure completed, duration: 70ms [SUCCESS] 2025-08-28 19:18:20 [ufm-modelmain] [INFO] [Startup - Initialize ModelMain license checker] Initialize ModelMain license checker completed, duration: 3ms [SUCCESS] 2025-08-28 19:18:20 [ufm-modelmain] [INFO] [Startup - Load data from database before discovery] Load data from database before discovery completed, duration: 27ms [SUCCESS] 2025-08-28 19:18:20 [ufm-enterprise] [INFO] [Ufm Services Startup - Start Daily Reports] Start Daily Reports skipped - Daily reports are disabled, duration: 0ms [SKIPPED] 2025-08-28 19:18:21 [ufm-enterprise] [INFO] [Ufm Services Startup - Start Unhealthy Ports] Start Unhealthy Ports completed, duration: 97ms [SUCCESS] 2025-08-28 19:18:21 [ufm-enterprise] [INFO] [Ufm Services Startup - Start Telemetry Sampling] Start Telemetry Sampling completed, duration: 111ms [SUCCESS] 2025-08-28 19:18:21 [ufm-enterprise] [INFO] [Ufm Services Startup - Check UFM hardware state] Check UFM hardware state completed, duration: 0ms [SUCCESS] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Startup - Initialize fabric site] Initialize fabric site skipped - Default site already exists, duration: 0ms [SKIPPED] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Startup - SM trap handler] SM trap handler completed, duration: 11ms [SUCCESS] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Startup - Telemetry agent manager] Telemetry agent manager completed, duration: 1ms [SUCCESS] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Startup - Network infiniband discovery] Network infiniband discovery completed, duration: 41ms [SUCCESS] 2025-08-28 19:18:22 [ufm-enterprise] [INFO] [Ufm Services Startup - Start UFM plugins] Start UFM plugins completed, duration: 766ms [SUCCESS] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Startup - REST API server] REST API server completed - UFM Main API started, duration: 379ms [SUCCESS] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Startup - Fabric model construction] Fabric model construction completed, duration: 5ms [SUCCESS] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Startup - Load model from database] Load model from database completed, duration: 59ms [SUCCESS] 2025-08-28 19:18:22 [ufm-enterprise] [INFO] [Ufm Services Startup - Start UFM health] Start UFM health completed, duration: 282ms [SUCCESS] 2025-08-28 19:18:22 [ufm-enterprise] [INFO] [Ufm Services Startup] Ufm Services Startup completed (8/8 stages), duration: 718ms [COMPLETED] 2025-08-28 19:18:22 [ufm-enterprise] [INFO] [Service Startup] ufm-enterprise startup process completed successfully [COMPLETED] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Startup - Load data from database after discovery] Load data from database after discovery completed, duration: 19ms [SUCCESS] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Startup - Construct API resources] Construct API resources completed, duration: 4ms [SUCCESS] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Startup] Startup completed (15/15 stages), duration: 758ms [COMPLETED] 2025-08-28 19:18:22 [ufm-modelmain] [INFO] [Service Startup] ufm-modelmain startup process completed successfully [COMPLETED]

Example for the content of /opt/ufm/files/log/ufm_startup_progress_ufm-enterprise.json

Copy
Copied!
            

{ "service_name": "ufm-enterprise", "start_time": "2025-08-20T14:00:44.284780", "end_time": "2025-08-20T14:01:10.200929", "status": "COMPLETE", "phases": { "pre_initialization_checks": { "status": "COMPLETE", "start_time": "2025-08-20T14:00:44.353450", "end_time": "2025-08-20T14:00:45.552568", "stages": { "check_already_running": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:44.329303", "end_time": "2025-08-20T14:00:44.329303", "description": "Check if UFM is already running", "message": "UFM is not running", "duration_ms": 0 }, "check_log_directories_existence": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:44.354393", "end_time": "2025-08-20T14:00:44.399704", "description": "Verify log directories exist", "message": "", "duration_ms": 45 }, "check_log_directories_permissions": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:44.422848", "end_time": "2025-08-20T14:00:45.105769", "description": "Verify log directories permissions", "message": "", "duration_ms": 682 }, "validate_license": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:45.130041", "end_time": "2025-08-20T14:00:45.165553", "description": "Validate UFM license", "message": "", "duration_ms": 35 }, "validate_config_files": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:45.188146", "end_time": "2025-08-20T14:00:45.551215", "description": "Validate UFM configuration files", "message": "", "duration_ms": 363 } }, "completed_stages": 5, "handled_stages": 5, "total_stages": 5, "duration_ms": 199 }, "environment_setup": { "status": "COMPLETE", "start_time": "2025-08-20T14:00:45.612348", "end_time": "2025-08-20T14:00:46.657803", "stages": { "check_other_sm": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:45.613700", "end_time": "2025-08-20T14:00:45.745571", "description": "Check for other running Subnet Managers", "message": "", "duration_ms": 131 }, "check_ib_interface": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:45.768596", "end_time": "2025-08-20T14:00:45.974154", "description": "Check IB interface status", "message": "", "duration_ms": 205 }, "check_disk_space": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:45.997901", "end_time": "2025-08-20T14:00:46.204390", "description": "Check disk space on UFM partitions", "message": "", "duration_ms": 206 }, "logrotate_config": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:46.236403", "end_time": "2025-08-20T14:00:46.534769", "description": "Update logrotate configuration", "message": "", "duration_ms": 298 }, "sync_web_client_files": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:46.565332", "end_time": "2025-08-20T14:00:46.610217", "description": "Sync UFM web client files", "message": "", "duration_ms": 44 }, "check_multisubnet_mode": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:46.656620", "end_time": "2025-08-20T14:00:46.656620", "description": "Check multisubnet mode", "message": "Multisubnet mode is disabled, continuing with normal startup", "duration_ms": 0 } }, "completed_stages": 6, "handled_stages": 6, "total_stages": 6, "duration_ms": 45 }, "core_services_startup": { "status": "COMPLETE", "start_time": "2025-08-20T14:00:46.725547", "end_time": "2025-08-20T14:00:55.173574", "stages": { "mad_limiter": { "status": "SKIPPED", "start_time": "2025-08-20T14:00:46.702630", "end_time": "2025-08-20T14:00:46.702630", "description": "Start MAD Limiter", "message": "Madlimiter is disabled", "duration_ms": 0 }, "opensm": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:46.827668", "end_time": "2025-08-20T14:00:47.428092", "description": "Start OpenSM", "message": "", "duration_ms": 600 }, "sharp": { "status": "SKIPPED", "start_time": "2025-08-20T14:00:47.469382", "end_time": "2025-08-20T14:00:47.469382", "description": "Start SHArP Aggregation Manager", "message": "Sharp is disabled", "duration_ms": 0 }, "communicators_mgr": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:47.494769", "end_time": "2025-08-20T14:00:47.534151", "description": "Start Communicators Manager", "message": "", "duration_ms": 39 }, "primary_telemetry": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:48.985180", "end_time": "2025-08-20T14:00:51.849616", "description": "Start UFM Primary Telemetry", "message": "", "duration_ms": 864 }, "secondary_telemetry": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:52.355037", "end_time": "2025-08-20T14:00:55.172100", "description": "Start UFM Secondary Telemetry", "message": "", "duration_ms": 817 } }, "completed_stages": 6, "handled_stages": 6, "total_stages": 6, "duration_ms": 448 }, "web_infra_services_startup": { "status": "COMPLETE", "start_time": "2025-08-20T14:00:55.202439", "end_time": "2025-08-20T14:00:58.032880", "stages": { "ssl_configuration": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:55.203511", "end_time": "2025-08-20T14:00:55.258061", "description": "Configure SSL certificates", "message": "", "duration_ms": 54 }, "web_config_generation": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:55.282592", "end_time": "2025-08-20T14:00:56.690531", "description": "Generate UFM web configuration", "message": "", "duration_ms": 407 }, "web_server_restart": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:56.715527", "end_time": "2025-08-20T14:00:58.031300", "description": "Restart web server", "message": "", "duration_ms": 315 } }, "completed_stages": 3, "handled_stages": 3, "total_stages": 3, "duration_ms": 830 }, "ufm_services_startup": { "status": "COMPLETE", "start_time": "2025-08-20T14:00:58.095138", "end_time": "2025-08-20T14:01:10.140721", "stages": { "auth_service": { "status": "SUCCESS", "start_time": "2025-08-20T14:00:58.114734", "end_time": "2025-08-20T14:01:00.306014", "description": "Start Authentication service", "message": "", "duration_ms": 191 }, "ufm_main": { "status": "SUCCESS", "start_time": "2025-08-20T14:01:00.330510", "end_time": "2025-08-20T14:01:01.187987", "description": "Start UFM Model Main", "message": "", "duration_ms": 857 }, "daily_reports": { "status": "SUCCESS", "start_time": "2025-08-20T14:01:03.231535", "end_time": "2025-08-20T14:01:03.279463", "description": "Start Daily Reports", "message": "", "duration_ms": 47 }, "unhealthy_ports": { "status": "SUCCESS", "start_time": "2025-08-20T14:01:03.304244", "end_time": "2025-08-20T14:01:04.478741", "description": "Start Unhealthy Ports", "message": "", "duration_ms": 174 }, "telemetry_sampling": { "status": "SUCCESS", "start_time": "2025-08-20T14:01:04.513890", "end_time": "2025-08-20T14:01:04.627906", "description": "Start Telemetry Sampling", "message": "", "duration_ms": 114 }, "check_ufm_hardware_state": { "status": "SUCCESS", "start_time": "2025-08-20T14:01:04.654889", "end_time": "2025-08-20T14:01:04.691246", "description": "Check UFM hardware state", "message": "", "duration_ms": 36 }, "plugins": { "status": "SUCCESS", "start_time": "2025-08-20T14:01:04.714797", "end_time": "2025-08-20T14:01:09.110861", "description": "Start UFM plugins", "message": "", "duration_ms": 396 }, "ufm_health": { "status": "SUCCESS", "start_time": "2025-08-20T14:01:09.390525", "end_time": "2025-08-20T14:01:10.114399", "description": "Start UFM health", "message": "", "duration_ms": 723 } }, "completed_stages": 8, "handled_stages": 8, "total_stages": 8, "duration_ms": 45 } }, "current_phase": null, "current_stage": null }


This script collects detailed system dumps from multiple UFM components, including:

  • UFM Enterprise (running on Docker or bare-metal)

  • Host system information

  • Cyber-AI system data (if installed)

  • UFM High Availability (HA) information (if configured)

All collected data is packaged into a .tar.gz archive and saved in the specified output directory or the default backup location.

This archive should be shared with the UFM Support team for troubleshooting any UFM-related issues.

Usage

Copy
Copied!
            

ufm_sysdump [OPTIONS]     Example: /opt/ufm/files/scripts/ufm_sysdump.sh


Options

Option

Description

Default / Notes

--output, -O, -o <DIR>

Specify the output folder for the sysdump

Default: /tmp/

--time, -t <TIME>

Collect data starting from the specified time

Format: YYYY-MM-DD_HH:MM:SS

--archives, -a <NUM>

Limit the number of archive files to keep

Default: 3

--container

Allow running inside a container

By default, the script exits if run in a container

--help, -h

Show this help message and exit


Examples

Copy
Copied!
            

/opt/ufm/files/scripts/ufm_sysdump.sh                                     # Collect sysdump with default prameters.     /opt/ufm/files/scripts/ufm_sysdump.sh  --output /opt/backups/             # Collect sysdump to specific folder    /opt/ufm/files/scripts/ufm_sysdump.sh  --time 2025-01-01_00:00:00   # Collect data from specific time    /opt/ufm/files/scripts/ufm_sysdump.sh  --archives 5                       # Keep up to 5 archive files    /opt/ufm/files/scripts/ufm_sysdump.sh  --container                        # Allow running inside container    /opt/ufm/files/scripts/ufm_sysdump.sh  --help                             # Show this help message

Note

One UFM Bare-Metal instance needs to run this script from the host. On UFM-HA, this script must be run from the host that is currently the master.

Note

By default, the maximum in --archives is 15.


Note

These instructions apply in high availability scenario only.

In the event of an in-service upgrade failure, the previous version of UFM's data will be safeguarded as a backup in the "/opt/ufm/BACKUP" directory, formatted as "ufm_upgrade_backup_<prev_version>-<new_version<_<date>.zip."

To restore the data on the un-upgraded node, follow these steps:

  1. Copy the backup file from the upgraded node to the un-upgraded node using the following command:

    Copy
    Copied!
                

    scp /opt/ufm/BACKUP/ufm_upgrade_backup_<prev_version>-<new_version<_<date>.zip root@<unupgraded_node_ip>:/opt/ufm/BACKUP/

  2. Perform a failover of UFM to the master node, which is mandatory for data mount migration (including '/opt/ufm/files') to the master node: On the Master node, execute:

    Copy
    Copied!
                

    ufm_ha_cluster takeover

  3. Stop UFM on the un-upgraded node:

    Copy
    Copied!
                

    ufm_ha_cluster stop

  4. Restore UFM configuration files from the backup:

    Copy
    Copied!
                

    /opt/ufm/scripts/ufm_restore.sh -f /opt/ufm/BACKUP/ufm_upgrade_backup_<prev_version>-<new_version<_<date>.zip

  5. Start UFM on the un-upgraded node (Note: Only the upgraded node can function until the upgrade issue is resolved, and failovers will not work).

Now, the issue that caused the upgrade failure can be addressed. If the problem is resolved, you can attempt the in-service upgrade again by failing UFM over to the upgraded node.

Alternatively, if needed, you can revert the changes made by reinstalling the old UFM version on the upgraded node.

Refer to the NVIDIA UFM High-Availability User Guide for HA monitoring and troubleshooting.

Split-Brain Recovery in HA Installation

The split-brain problem is a DRBD synchronization issue (HA status shows DUnknownin the DRBD disk state), which occurs when both HA nodes are rebooted. For example, in cases of electricity shut-down. To recover, please follow the below steps:

  • Step 1: Manually choose a node where data modifications will be discarded.

    It is called the split-brain victim. Choose wisely; all modifications will be lost! When in doubt, run a backup of the victim’s data before you continue.

    When running a Pacemaker cluster, you can enable maintenance mode. If the split-brain victim is in the Primary role, bring down all applications using this resource. Now switch the victim to the Secondary role:

    Copy
    Copied!
                

    victim# drbdadm secondary ha_data 

  • Step 2: Disconnect the resource if it’s in connection state WFConnection:

    Copy
    Copied!
                

    victim# drbdadm disconnect ha_data 

  • Step 3: Force discard of all modifications on the split-brain victim:

    Copy
    Copied!
                

    victim# drbdadm -- --discard-my-data connect ha_data 

    For DRBD 8.4.x:

    Copy
    Copied!
                

    victim# drbdadm connect --discard-my-data ha_data 

  • Step 4: Resync starts automatically if the survivor is in a WFConnection network state. If the split-brain survivor is still in a Standalone connection state, reconnect it:

    Copy
    Copied!
                

    survivor# drbdadm connect ha_data 

    Now the resynchronization from the survivor (SyncSource) to the victim (SyncTarget) starts immediately. There is no full sync initiated, but all modifications on the victim will be overwritten by the survivor’s data, and modifications on the survivor will be applied to the victim.

Performing Failover on Non-Master Node

The ufm_ha_clusterfailover action fails with the following error: "Cannot perform failover on non-master node". To fix, follow the below action:

  • Step 1: Verify that /etc/hosts file on both the master and standby UFM hosts contains the correct host names and IP addresses mapping.

  • Step 2: If necessary, fix the mapping and retry the failover command.

© Copyright 2025, NVIDIA. Last updated on Nov 20, 2025