NVSM Call Home

The NVIDIA System Manager (NVSM) Call Home, when enabled and with an internet connection, provides additional automation to NVSM health monitoring functionality. Instead of having to contact NVIDIA Enterprise Support to report critical alerts from NVSM, submit system logs, nvsm dump health files, and DGX serial numbers to create a support ticket, NVSM Call Home automates those tasks. This reduces overall turnaround time for resolving issues.

NVSM Call Home Overview

When NVSM raises a critical alert, NVSM Call Home performs the following actions:

  • Proactively pushes Critical level alerts to NVIDIA Enterprise Services.

  • Collects the system and nvsm dump, and system serial number, and uploads them to NVIDIA Enterprise Services..

  • Creates a Support Case on the Enterprise Support portal.

NVSM Call Home also clears resolved alerts and pushes the updated status to NVIDIA Enterprise Services.

The figure below illustrates the end-to-end flow of the NVSM Call Home feature.

_images/nvsm-call-home.png

When NVSM Call Home is enabled on the DGX system and a critical alert is raised, the NVSM daemon on the DGX system initiates an HTTPS connection to the secured NVIDIA Enterprise Services backend and communicates the alert details and logs using the RESTful interface. The information is validated and then a new Support Ticket is created on the NVIDIA Enterprise Support Portal. Communication of all alerts, including status changes for the alerts, is through REST calls.

NVSM Call Home operates in three different modes. It is not enabled by default, so to use NVSM Call Home you must enable one of the following modes:

Policy-enabled “automatic” Mode

  • This mode batches alert submissions at regular intervals and then pushes them to the NVIDIA Enterprise Support portal.

  • An internet connection is required.

  • To enable, issue

    $ sudo nvsm set /policy callhome_enable=true
    

See the section Using NVSM Call Home in Automatic Mode for details as well as configuration options for automatic mode.

Policy-enabled Offline Mode

  • This mode is useful for air-gapped or highly-secured environments where access to the internet is limited.

  • Instead of sending batched submissions to NVIDIA Enterprise Support, the alert and system information are stored on the local system. Users need to manually provide the gathered information to NVIDIA Enterprise Services to create a support case.

  • To enable, issue

    $ sudo nvsm set /policy offline_callhome_enable=true
    

See the section Using NVSM Offline Call Home for details as well as configuration options for offline mode.

On-demand Mode

  • This mode gathers alert information and creates a submission on-the-fly to the Enterprise Support Portal.

  • An internet connection is required.

  • To initiate a Call Home submission on-demand, issue

    $ sudo nvsm set /callhome trigger=true
    

See the section Using NVSM On-Demand Mode for details as well as configuration options for on-demand mode.

Using NVSM Call Home

NVSM Call Home can operate in two modes:

  • Automatic Mode - NVSM Call Home operates automatically at regular intervals.

  • On-demand Mode - NVSM Call Home sequence is initiated manually.

You can also set up NVSM Call Home to run offline; for example, on air-gapped systems.

Prerequisites for Using NVSM Call Home in Automatic or On-Demand Mode

Enabling Ports

Since NVSM Call Home communicates with the external NVIDIA server, port 443 must be enabled prior to operating NVSM Call Home.

Enabling Access

You need to register your system for NVSM Call Home so that the NVIDIA Services Cloud recognizes the system. Contact NVIDIA Enterprise Services to set up NVIDIA Call Home for your DGX system.

Validating NVSM Call Home Readiness

Before using NVSM Call Home, make sure the server is ready to support NVSM Call Home by performing a diagnostic test. The test does not create a ticket with NVIDIA Enterprise Services, but does test that the system is able to communicate with the NVIDIA Enterprise Services infrastructure.

To run the diagnostic, issue the following:

$ sudo nvsm set /callhome trigger=true diagtest=true

Note

This uses the on-demand mode of NVIDIA Call Home, explained in more detail in the section Using NVSM On-demand Mode.

To see the result of the last diagnostic test run, issue the following:

$ sudo nvsm show /callhome

Example output confirming the setup is ready for Call Home operation. Lines of interest are identified in bold.

/callhome
Properties:
    Trigger = False
    Op_Description = User initiated call home operation.
    Op_DiagTest = True
    Op_CaseId = none
    Op_State = Succeeded
    Op_StartTime = 2019-06-24T06:10:17Z
    Op_Message = Call Home operation Succeeded
    Op_Email =

If the output reports errors or failures, contact NVIDIA technical support for assistance.

Using NVSM Call Home in Automatic Mode

When automatic mode is enabled, NVSM monitors the server continuously and pushes critical or cleared alerts to NVESC and creates a support case on behalf of the registered user.

Automatic Mode Syntax

To enable automatic mode, first configure the email contact.

$ sudo nvsm set /policy callhome_email_contact=<email>

then enable Call Home.

$ sudo nvsm set /policy callhome_enable=true [callhome_batch_interval=<time-in-seconds>]

You can also configure the email contact and enable Call Home in the same command.

$ sudo nvsm set /policy callhome_enable=true callhome_email_contact=<email> [callhome_batch_interval=<time-in-seconds>]

Automatic Mode Configuration Arguments

Configure NVSM Call Home using the following parameters:

  • callhome_email_contact

    Sets the email-id. This should be a registered user of the NVIDIA Enterprise Support Portal. The email gets embedded in the case/ticket created in NVIDIA Enterprise Support Portal.

  • callhome_batch_interval

    (Optional) Enabling automatic mode batches alert submissions at regular intervals and then pushes them to the NVIDIA Enterprise Support portal. Any raised alerts within that time frame will be sent (as individual Support Cases). By default, the interval is 600 seconds (10 minutes), but you can use this option to specify other intervals (in seconds).

Automatic Mode Example

The following example illustrates how to use these parameters.

$ sudo nvsm set /policy callhome_enable=true callhome_email_contact=123@example.com callhome_batch_interval=610

Verifying Automatic Mode Status

To verify the status of the current setup, issue the following.

$ sudo nvsm show /policy

Example output:

/policy
Properties:
   callhome_batch_interval = 610
   callhome_email_contact = 123@example.com
   callhome_enable = True
   email_recipients =
   email_sender =
   email_smtp_server_name =
   email_smtp_server_port = 0

callhome_enable = True indicates that Call Home automatic mode is enabled.

Disabling Automatic Mode

The Call Home automatic mode will start listening for alerts and raise support cases in the background. If there are any maintenance activities such as reseating or swapping components that would cause NVSM to generate critical alerts, Call Home will raise support cases as well.

To avoid raising support cases during intentional maintenance activities, disable call-home by issuing the following.

$ sudo nvsm set /policy callhome_enable=false

Using NVSM On-Demand Mode

NVSM Call Home On-Demand mode is a user-triggered call-home action. Triggering Call Home on-demand creates a Support Case with NVIDIA Enterprise Support Portal that includes a captured system dump (‘nvsm dump health’). NVSM Call Home On-Demand can be used whether or not automatic mode is enabled.

On-Demand Mode Syntax

To trigger an NVSM Call Home sequence on-demand, issue the following.

# sudo nvsm set /callhome trigger=true [description="<description>"] [email=<email>]

To cancel an on-demand Call Home in progress, issue the following.

# sudo nvsm set /callhome trigger=false

See the next section for an explanation of the optional parameters.

On-Demand Mode Configuration Options

You can configure NVSM Call Home triggered on-demand using the following parameters:

  • email

    This option sets an email-id. The email gets embedded in the case/ticket created in NVIDIA Enterprise Support Portal.

  • description

    This option lets you describe the purpose or the details for triggering On-Demand Call Home.

    Examples of descriptive strings:

    "Testing"

    "System running low in performance, takes several minutes to peform an nvidia-smi command."

On-Demand Example

The following example illustrates how to use these optional parameters.

# sudo nvsm set /callhome trigger=true description="testing" email=123@example.com

Verifying On-Demand Status

To check the status of a Call Home sequence initiated on-demand, issue the following.

# sudo nvsm show /callhome

The following example output shows the progress of the Call Home sequence.

/callhome
Properties:
   Trigger = True
   Op_Description = testing
   Op_CaseId = none
   Op_State = Running
   Op_StartTime = 2019-06-12T08:28:45Z
   Op_Message = Collecting logs
   Op_Email = 123@example.com

The following example output shows that a case ID was created in the NVIDIA Enterprise Support portal.

/callhome
Properties:
   Trigger = False
   Op_Description = testing
   Op_CaseId = 0001XXX
   Op_State = Succeeded
   Op_StartTime = 2019-06-12T08:28:45Z
   Op_Message = Call Home operation Succeeded
   Op_Email = 123@example.com

Trigger = False indicates that the on-demand sequence is not running - in this case because it has completed.

Using NVSM Offline Call Home

To support DGX systems installed in air-gapped or highly-secured environments where access to the internet is limited, NVSM Call Home can be operated in offline mode (Offline Call Home). Like standard Call Home, Offline Call Home software proactively monitors the health of the DGX system and automatically

  • Collects system dump and logs, and

  • Collects alerts and system information.

However, instead of sending the information to NVIDIA Enterprise Services, NVSM Offline Call Home stores the information in a user-specified directory on the DGX system. Also, unlike standard Call Home, Offline Call Home operates in automatic mode only; there is no on-demand mode in Offline Call Home.

Prerequisites

Offline Call Home and standard Call Home cannot be enabled at the same time. To ensure that standard automatic-mode Call Home is not enabled, issue the following before enabling Offline Call Home.

$ sudo nvsm set /policy callhome_enable=false

Enabling Offline Call Home

Like standard Call Home, use the nvsm set /policy command to enable Offline Call Home.

$ sudo nvsm set /policy offline_callhome_enable=true \
offline_callhome_dump_destination_location=<path/to/location> \
offline_callhome_batch_interval=<batch-interval> \
offline_callhome_no_of_dumps_allowed=<number>

Offline Call Home Configuration Options

  • offline_callhome_dump_destination_location

    By default, Offline Call Home stores the system logs at /var/log/nvsm_offline_callhome. You can set a different location using this option.

  • offline_callhome_batch_interval

    Enabling Offline Call Home creates a batch of alerts at regular intervals and then pushes them to local storage. By default, the interval is 600 seconds (10 minutes), but you can use this option to specify other intervals (in seconds).

  • offline_callhome_no_of_dumps_allowed

    By default, NVSM Offline Call Home will store 9999999 different log files, but you can specify a smaller number for which to allocate space as needed.

Example of Enabling Offline Call Home

The following example illustrates how to use these parameters.

$ sudo nvsm set /policy \
offlinecallhome_enable=true \
offline_callhome_dump_destination_location=/tmp/offline_callhome_dump \
callhome_batch_interval=610 \
offline_callhome_no_of_dumps_allowed=10

Verifying the Offline Call Home Configurationn

To verify the status of the current setup, issue the following.

$ sudo nvsm show /policy

Example output showing the offline Call Home policy details.

/policy
Properties:
    offline_callhome_batch_interval = 610
    offline_callhome_enable = True
    offline_callhome_dump_destination_location = /tmp/offline_callhome_dump
    offline_callhome_no_of_dumps_allowed = 10

Verifying Contents of the Dump File

The contents of each batch is stored in a tar file.

  • Tar file naming format:

    offlinecallhome-nvsm-health_<timestamp>_<serial-number>.tar.xz

  • The tar file contents -

    • System dump

    • JSON metadata file that lists the system and critical alerts.

      JSON file naming format:

      offlinecallhome_notifications_<timestamp>_<hostname>_<serial-number>.json

  • JSON file format showing type of data included :

    {
        "system_serial": "<serial number>",
        "system_name": "<hostname>",
        "notifications": [
            {
                "alert_id": "",
                "clear_time": "-",
                "component_id": "",
                "description": "",
                "event_time": "",
                "message": "",
                "message_details": ".",
                "recommended_action": "",
                "severity": "",
                "system_name": "",
                "system_serial": "",
                "type": ""
            }
        ]
    }