RMA Process :: GPU Deployment and Management Documentation

Introduction

NVIDIA is committed to providing the highest level of quality, reliability, and support for the enterprise datacenter-class NVIDIA® Tesla® graphics processing unit (GPU) products. To that end, NVIDIA is focused on two primary goals with the Tesla RMA submission process:

Expeditious replacement of returned Tesla GPU products
Comprehensive understanding of the customer-observed issue and failure to allow for:
- NVIDIA replication and confirmation of the failure
- Root-cause analysis of the failure aimed at continuous improvement of the product and future Tesla offerings

NVIDIA has provided this guide to ensure that the RMA requestor is able to provide the information necessary to meet these goals with each RMA request, best ensuring that such requests are quickly approved and processed.

Tools and Diagnostics

NVIDIA provides a few tools to help diagnose issues and failures observed with Tesla GPU products. These tools are:

nvidia-bug-report
nvidia-healthmon
NVIDIA Field Diagnostic

nvidia-bug-report

nvidia-bug-report.sh is a shell script included with the NVIDIA Linux driver that gathers system data that is highly valuable to understanding any reported field issue. This includes information such as lspci and system message log files and also includes nvidia-smi information. It is installed with the NVIDIA driver and placed in /usr/bin/nvidia-bug-report.sh. Running nvidia-bug-report.sh will produce an output file, nvidia-bug-report.log.tgz, in the current working directory.

Ideally, nvidia-bug-report.sh should be run immediately after an issue is observed. This will collect the most recent information about the failure.

If the report hangs or does not create a complete report, power cycle the machine, save the file that was generated, and run nvidia-bug-report.sh one more time after the power cycle to complete the log. Both logs should be sent to NVIDIA as part of any RMA submission.

To run nvidia-bug-report on Linux systems, first log in to “root.”

At command line # Type nvidia-bug-report.sh
Nvidia-bug-report.sh will now collect information about your system and create the file, “nvidia-bug-report.log.gz” in the current directory

Note: This file should be included with any RMA request. Failure to include this log file may result in delays to the processing of the RMA request. For more information, see the section titled, “RMA Checklist and Flowchart.”

nvidia-healthmon

nvidia-healthmon detects and troubleshoots common problems affecting Tesla GPUs in a high performance computing environment. nvidia-healthmon contains limited hardware diagnostic capabilities and instead focuses on software and system configuration issues. nvidia-healthmon is designed to discover common problems that affect a GPU’s ability to run a compute job, including:

Software configuration issues
System configuration issues
System assembly issues, like loose cables
A limited number of hardware issues

To run nvidia-healthmon from the command line with default behavior on all supported GPUs:

user@hostname$ nvidia-healthmon

nvidia-healthmon will terminate once it completes the execution diagnostics on all specified devices. An exit code of zero will be used when nvidia-healthmon runs successfully. A non-zero exit code indicates that there was a problem with the nvidia-healthmon run. The output of the application must be read to determine the exact problem. nvidia-healthmon ’s output may include a troubleshooting report designed to address common problems, and will often suggest a number of possible solutions. These troubleshooting steps should be undertaken from the top down, as the most likely solution is listed at the top.

For more details, command lines arguments, configuration options, and instructions for interpreting the results of the tool, refer to the nvidia-healthmon User Guide.

NVIDIA Field Diagnostic

The NVIDIA Field Diagnostic is a comprehensive Linux based hardware diagnostic tool that provides confirmation of the numerical processing Linux engines in the GPU, integrity of data transfers to and from the GPU, and test coverage of the full onboard memory address space that is available to NVIDIA® CUDA® programs. In the event that any software or system configuration issue cannot be identified (for example, by nvidia-healthmon) and resolved, the NVIDIA Field Diagnostic should be run to determine whether the Tesla GPU may be faulty.

The NVIDIA Field Diagnostic can be run with the command “./fieldiag”

Note: NVIDIA Tesla GPU products have ECC memory protection enabled by default. The NVIDIA Field Diagnostic runs only on boards that have ECC enabled. If the user has previously disabled ECC on a suspect board, ECC must be re-enabled prior to running the NVIDIA Field Diagnostic on that board. NVIDIA will not accept RMA requests for failures that occur only with ECC disabled. Any failure must occur with ECC enabled to be eligible for RMA return.

For more details or product-specific command lines arguments, refer to the NVIDIA Field Diagnostic Quick Start Guide (DU-05711-001) and the NVIDIA Field Diagnostic Software Guide (DU-05363-001) included in the NVIDIA Field Diagnostic software package.

Upon completion of the diagnostic, a fieldiag.log file is generated.

Note: This file should be included with any RMA request. Failure to include this log file may result in delays to the processing of the RMA request. For more information, see the section titled, “RMA Checklist and Flowchart.”

A passing result with the NVIDIA Field Diagnostics is an indication that the NVIDIA Tesla GPU hardware is in good condition, and pointing to a potential software application-level issue.

Note: In the event that the NVIDIA Field Diagnostic returns a passing result, NVIDIA requests that data be provided illustrating that the failure follows the particular NVIDIA Tesla GPU board and details of the observed failures. Having this data will better allow NVIDIA to reproduce the issue and resolve any potential test weakness in the existing diagnostics.

Common System Level Issues

Depending on the type and severity of the observed issue, there may be situations where it may not be possible to run nvidia-bug-report, nvidia-healthmon, or the field diagnostic. In order to better ensure that the failure is attributable to the Tesla GPU, rather than a system-level issue, and avoid any potential delays to the processing of the RMA request as a result, NVIDIA recommends that the following steps be taken to further isolate the cause of the failure.

In addition to the power provided by the PCIe slot connector, Tesla GPU boards also require additional power from the host system. Ensure that the appropriate PCIe 8-pin and/or 6-pin auxiliary power cables are properly connected to the board. Consult the product specifications for the specific Tesla GPU in use to determine the auxiliary power requirements for that particular product.
Physically remove the Tesla GPU board from the system and reinstall it to ensure that it is fully seated in the PCIe slot.
If available, replace the suspect Tesla GPU with a known good board to confirm that the observed issue or failure does not occur with the replacement.
If possible, install the suspect Tesla GPU in a different system to determine whether the observed issue or failure follows the board (or system).

Note: The RMA submission process will request information demonstrating that common system-level causes have been eliminated. Submitting the RMA with the information as described in Step 1 through Step 4, indicating that system level issues were eliminated will help to accelerate the RMA approval process.

RMA Checklist and Flowchart

Table 1.RMA Checklist

Check Off	Item
	nvidia-bug-report log file (nvidia-bug-report.log.gz)
	NVIDIA Field Diagnostic log file (fieldiag.log)
	In the event the NVIDIA Field Diagnostic returns a passing result or that the observed failure is such that the NVIDIA tools and diagnostics cannot be run, the following information should be included with the RMA request: Steps taken to eliminate common system-level causes -Check PCIe auxiliary power connections -Verify board seating in PCIe slot -Determine whether the failure follows the board or the system Details of the observed failure: -The application running at the time of failure -Description of how the product failed -Step-by-step instructions to reproduce the issue -Frequency of the failure
	Is there any known or obvious physical damage to the board?
	Submit the RMA request at http://portal.nvidia.com

Note: NVIDIA Tesla GPU products have ECC memory protection enabled by default. NVIDIA will not accept RMA requests for failures that occur only with ECC disabled. Any failure must occur with ECC enabled to be eligible for RMA return.

RMA Process Flow

Notices

Notice

ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.

Trademarks

NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.