NVSM Call Home
The NVIDIA System Manager (NVSM) Call Home, when enabled and with an internet connection, provides additional automation to NVSM health monitoring functionality. Instead of having to contact NVIDIA Enterprise Support to report critical alerts from NVSM, submit system logs, nvsm dump health files, and DGX serial numbers to create a support ticket, NVSM Call Home automates those tasks. This reduces overall turnaround time for resolving issues.
NVSM Call Home Overview
When NVSM raises a critical alert, NVSM Call Home performs the following actions:
Proactively pushes Critical level alerts to NVIDIA Enterprise Services.
Collects the system and nvsm dump, and system serial number, and uploads them to NVIDIA Enterprise Services..
Creates a Support Case on the Enterprise Support portal.
NVSM Call Home also clears resolved alerts and pushes the updated status to NVIDIA Enterprise Services.
The figure below illustrates the end-to-end flow of the NVSM Call Home feature.
When NVSM Call Home is enabled on the DGX system and a critical alert is raised, the NVSM daemon on the DGX system initiates an HTTPS connection to the secured NVIDIA Enterprise Services backend and communicates the alert details and logs using the RESTful interface. The information is validated and then a new Support Ticket is created on the NVIDIA Enterprise Support Portal. Communication of all alerts, including status changes for the alerts, is through REST calls.
NVSM Call Home operates in three different modes. It is not enabled by default, so to use NVSM Call Home you must enable one of the following modes:
Policy-enabled “automatic” Mode
This mode batches alert submissions at regular intervals and then pushes them to the NVIDIA Enterprise Support portal.
An internet connection is required.
To enable, issue
$ sudo nvsm set /policy callhome_enable=true
See the section Using NVSM Call Home in Automatic Mode for details as well as configuration options for automatic mode.
Policy-enabled Offline Mode
This mode is useful for air-gapped or highly-secured environments where access to the internet is limited.
Instead of sending batched submissions to NVIDIA Enterprise Support, the alert and system information are stored on the local system. Users need to manually provide the gathered information to NVIDIA Enterprise Services to create a support case.
To enable, issue
$ sudo nvsm set /policy offline_callhome_enable=true
See the section Using NVSM Offline Call Home for details as well as configuration options for offline mode.
On-demand Mode
This mode gathers alert information and creates a submission on-the-fly to the Enterprise Support Portal.
An internet connection is required.
To initiate a Call Home submission on-demand, issue
$ sudo nvsm set /callhome trigger=true
See the section Using NVSM On-Demand Mode for details as well as configuration options for on-demand mode.
Using NVSM Call Home
NVSM Call Home can operate in two modes:
Automatic Mode - NVSM Call Home operates automatically at regular intervals.
On-demand Mode - NVSM Call Home sequence is initiated manually.
You can also set up NVSM Call Home to run offline; for example, on air-gapped systems.
Prerequisites for Using NVSM Call Home in Automatic or On-Demand Mode
Enabling Ports
Since NVSM Call Home communicates with the external NVIDIA server, port 443 must be enabled prior to operating NVSM Call Home.
Enabling Access
You need to register your system for NVSM Call Home so that the NVIDIA Services Cloud recognizes the system. Contact NVIDIA Enterprise Services to set up NVIDIA Call Home for your DGX system.
Validating NVSM Call Home Readiness
Before using NVSM Call Home, make sure the server is ready to support NVSM Call Home by performing a diagnostic test. The test does not create a ticket with NVIDIA Enterprise Services, but does test that the system is able to communicate with the NVIDIA Enterprise Services infrastructure.
To run the diagnostic, issue the following:
$ sudo nvsm set /callhome trigger=true diagtest=true
Note
This uses the on-demand mode of NVIDIA Call Home, explained in more detail in the section Using NVSM On-demand Mode.
To see the result of the last diagnostic test run, issue the following:
$ sudo nvsm show /callhome
Example output confirming the setup is ready for Call Home operation. Lines of interest are identified in bold.
/callhome
Properties:
Trigger = False
Op_Description = User initiated call home operation.
Op_DiagTest = True
Op_CaseId = none
Op_State = Succeeded
Op_StartTime = 2019-06-24T06:10:17Z
Op_Message = Call Home operation Succeeded
Op_Email =
If the output reports errors or failures, contact NVIDIA technical support for assistance.
Using NVSM Call Home in Automatic Mode
When automatic mode is enabled, NVSM monitors the server continuously and pushes critical or cleared alerts to NVESC and creates a support case on behalf of the registered user.
Automatic Mode Syntax
To enable automatic mode, first configure the email contact.
$ sudo nvsm set /policy callhome_email_contact=<email>
then enable Call Home.
$ sudo nvsm set /policy callhome_enable=true [callhome_batch_interval=<time-in-seconds>]
You can also configure the email contact and enable Call Home in the same command.
$ sudo nvsm set /policy callhome_email_contact=<email> callhome_enable=true [callhome_batch_interval=<time-in-seconds>]
Automatic Mode Configuration Arguments
Configure NVSM Call Home using the following parameters:
callhome_email_contact
Sets the email-id. This should be a registered user of the NVIDIA Enterprise Support Portal. The email gets embedded in the case/ticket created in NVIDIA Enterprise Support Portal.
callhome_batch_interval
(Optional) Enabling automatic mode batches alert submissions at regular intervals and then pushes them to the NVIDIA Enterprise Support portal. Any raised alerts within that time frame will be sent (as individual Support Cases). By default, the interval is 600 seconds (10 minutes), but you can use this option to specify other intervals (in seconds).
Automatic Mode Example
The following example illustrates how to use these parameters.
$ sudo nvsm set /policy callhome_email_contact=123@example.com callhome_enable=true callhome_batch_interval=610
Verifying Automatic Mode Status
To verify the status of the current setup, issue the following.
$ sudo nvsm show /policy
Example output:
/policy
Properties:
callhome_batch_interval = 610
callhome_email_contact = 123@example.com
callhome_enable = True
email_recipients =
email_sender =
email_smtp_server_name =
email_smtp_server_port = 0
callhome_enable = True
indicates that Call Home automatic mode is enabled.
Disabling Automatic Mode
The Call Home automatic mode will start listening for alerts and raise support cases in the background. If there are any maintenance activities such as reseating or swapping components that would cause NVSM to generate critical alerts, Call Home will raise support cases as well.
To avoid raising support cases during intentional maintenance activities, disable call-home by issuing the following.
$ sudo nvsm set /policy callhome_enable=false
Using NVSM On-Demand Mode
NVSM Call Home On-Demand mode is a user-triggered call-home action. Triggering Call Home on-demand creates a Support Case with NVIDIA Enterprise Support Portal that includes a captured system dump (‘nvsm dump health’). NVSM Call Home On-Demand can be used whether or not automatic mode is enabled.
On-Demand Mode Syntax
To trigger an NVSM Call Home sequence on-demand, issue the following.
# sudo nvsm set /callhome trigger=true [description="<description>"] [email=<email>]
To cancel an on-demand Call Home in progress, issue the following.
# sudo nvsm set /callhome trigger=false
See the next section for an explanation of the optional parameters.
On-Demand Mode Configuration Options
You can configure NVSM Call Home triggered on-demand using the following parameters:
email
This option sets an email-id. The email gets embedded in the case/ticket created in NVIDIA Enterprise Support Portal.
description
This option lets you describe the purpose or the details for triggering On-Demand Call Home.
Examples of descriptive strings:
"Testing"
"System running low in performance, takes several minutes to peform an nvidia-smi command."
On-Demand Example
The following example illustrates how to use these optional parameters.
# sudo nvsm set /callhome trigger=true description="testing" email=123@example.com
Verifying On-Demand Status
To check the status of a Call Home sequence initiated on-demand, issue the following.
# sudo nvsm show /callhome
The following example output shows the progress of the Call Home sequence.
/callhome
Properties:
Trigger = True
Op_Description = testing
Op_CaseId = none
Op_State = Running
Op_StartTime = 2019-06-12T08:28:45Z
Op_Message = Collecting logs
Op_Email = 123@example.com
The following example output shows that a case ID was created in the NVIDIA Enterprise Support portal.
/callhome
Properties:
Trigger = False
Op_Description = testing
Op_CaseId = 0001XXX
Op_State = Succeeded
Op_StartTime = 2019-06-12T08:28:45Z
Op_Message = Call Home operation Succeeded
Op_Email = 123@example.com
Trigger = False
indicates that the on-demand sequence is not running - in this case because it has completed.
Using NVSM Offline Call Home
To support DGX systems installed in air-gapped or highly-secured environments where access to the internet is limited, NVSM Call Home can be operated in offline mode (Offline Call Home). Like standard Call Home, Offline Call Home software proactively monitors the health of the DGX system and automatically
Collects system dump and logs, and
Collects alerts and system information.
However, instead of sending the information to NVIDIA Enterprise Services, NVSM Offline Call Home stores the information in a user-specified directory on the DGX system. Also, unlike standard Call Home, Offline Call Home operates in automatic mode only; there is no on-demand mode in Offline Call Home.
Prerequisites
Offline Call Home and standard Call Home cannot be enabled at the same time. To ensure that standard automatic-mode Call Home is not enabled, issue the following before enabling Offline Call Home.
$ sudo nvsm set /policy callhome_enable=false
Enabling Offline Call Home
Like standard Call Home, use the nvsm set /policy
command to enable Offline Call Home.
$ sudo nvsm set /policy offline_callhome_enable=true \
offline_callhome_dump_destination_location=<path/to/location> \
offline_callhome_batch_interval=<batch-interval> \
offline_callhome_no_of_dumps_allowed=<number>
Offline Call Home Configuration Options
offline_callhome_dump_destination_location
By default, Offline Call Home stores the system logs at
/var/log/nvsm_offline_callhome
. You can set a different location using this option.offline_callhome_batch_interval
Enabling Offline Call Home creates a batch of alerts at regular intervals and then pushes them to local storage. By default, the interval is 600 seconds (10 minutes), but you can use this option to specify other intervals (in seconds).
offline_callhome_no_of_dumps_allowed
By default, NVSM Offline Call Home will store 9999999 different log files, but you can specify a smaller number for which to allocate space as needed.
Example of Enabling Offline Call Home
The following example illustrates how to use these parameters.
$ sudo nvsm set /policy \
offlinecallhome_enable=true \
offline_callhome_dump_destination_location=/tmp/offline_callhome_dump \
callhome_batch_interval=610 \
offline_callhome_no_of_dumps_allowed=10
Verifying the Offline Call Home Configurationn
To verify the status of the current setup, issue the following.
$ sudo nvsm show /policy
Example output showing the offline Call Home policy details.
/policy
Properties:
offline_callhome_batch_interval = 610
offline_callhome_enable = True
offline_callhome_dump_destination_location = /tmp/offline_callhome_dump
offline_callhome_no_of_dumps_allowed = 10
Verifying Contents of the Dump File
The contents of each batch is stored in a tar file.
Tar file naming format:
offlinecallhome-nvsm-health_<timestamp>_<serial-number>.tar.xz
The tar file contents -
System dump
JSON metadata file that lists the system and critical alerts.
JSON file naming format:
offlinecallhome_notifications_<timestamp>_<hostname>_<serial-number>.json
JSON file format showing type of data included :
{ "system_serial": "<serial number>", "system_name": "<hostname>", "notifications": [ { "alert_id": "", "clear_time": "-", "component_id": "", "description": "", "event_time": "", "message": "", "message_details": ".", "recommended_action": "", "severity": "", "system_name": "", "system_serial": "", "type": "" } ] }