Introduction to the NVIDIA DGX A100 System

The NVIDIA DGX™ A100 System is the universal system purpose-built for all AI infrastructure and workloads, from analytics to training to inference. The system is built on eight NVIDIA A100 Tensor Core GPUs.

_images/dgxa100-front.png

This document is for users and administrators of the DGX A100 system.

Hardware Overview

This section provides information about the hardware in DGX A100.

DGX A100 Models and Component Descriptions

There are two models of the NVIDIA DGX A100 system: the NVIDIA DGX A100 640GB system and the NVIDIA DGX A100 320GB system.

Model Differentiation

Model Differentiation

Component

NVIDIA DGX A100 640GB System

NVIDIA DGX A100 320GB System

GPU

Qty 8 NVIDIA A100 GPUs

Third-generation NVLinks

Qty 8 NVIDIA A100 GPUs

Third-generation NVLinks

Total GPU Memory

640 GB

320 GB

NVIDIA NVSwitch

Qty 6

Second generation (2x faster than first generation)

Qty 6

Second generation (2x faster than first generation)

Networking

Up to 10 (Factory ship config)

NVIDIA ConnectX-6 or ConnectX-7

InfiniBand/200 Gb/s Ethernet

Up to 9 (Factory ship config)

NVIDIA ConnectX-6 or ConnectX-7 IB/200 Gb/s Ethernet

(Optional Add-on: Second dual- port 200 Gb/s Ethernet)

CPU

2 AMD Rome, 128 cores total

2 AMD Rome, 128 cores total

System Memory

2 TB (Factory ship config)

1 TB (Factory ship config)

(Optional Add-on: 1 TB to get 2 TB max.)

Storage

30 TB (Factory ship config)

U.2 NVMe Drives

(Optional drive upgrade to 60 TB)

15 TB (Factory ship config)

U.2 NVMe Drives

(Optional Add-on: 15 TB to get 30 TB max.

Optional drive upgrade to 60 TB)

Component Description

Component Description

Component

Description

GPU

NVIDIA A100 GPU

CPU

2x AMD EPYC 7742 CPU w/64 cores

NVSwitch

600 GB/s GPU-to-GPU bandwidth

Storage (OS)

1.92 TB NVMe M.2 SSD (ea) in RAID 1 array

Storage (Data Cache)

3.84 TB NVMe U.2 SED (ea) in RAID 0 array

(Optional 7.68 TB NVMe U.2. SEDs)

Network (Cluster) card

NVIDIA ConnectX-6 or ConnectX-7 Single Port InfiniBand (default): Up to 200Gbps

Ethernet: 200GbE, 100GbE, 50GbE, 40GbE, 25GbE, and 10GbE

Note

NVIDIA ConnectX-7 Single-Port network cards support InfiniBand protocol only.

Network (Storage) card

NVIDIA ConnectX-6 or ConnectX-7 Dual Port Ethernet (default): 200GbE, 100GbE, 50GbE, 40GbE, 25GbE, and 10GbE

InfiniBand: Up to 200Gbps

System Memory (DIMM)

1 TB per 16 DIMMs

BMC (out-of-band system management)

1 GbE RJ45 interface

Supports IPMI, SNMP, KVM, and Web user interface, and Redfish APIs.

In-band system management

1 GbE RJ45 interface

Power Supply

3 kW

Mechanical Specifications

Here is some information about mechanical specifications.

Mechanical Specifications

Feature

Description

Form Factor

6U Rackmount

Height

10.4” (264 mm)

Width

19” (482.3 mm) max

Depth

35.3” (897.1 mm) max

System Weight

271.5 lbs (123.16 kg) max

Power Specifications

The DGX A100 system contains six power supplies with balanced distribution of the power load.

Power Specifications :header

Input

Specification for Each Power Supply

200-240 volts AC

6.5 kW max.

3000 W @ 200-240 V, 16 A, 50-60 Hz

Support for N+N Redundancy

The DGX A100 includes six power supply units (PSU) configured for 3+3 redundancy. If three PSUs fail, the system will continue to operate at full power with the remaining three PSUs.

Note

  • If only two PSUs are working, the GPUs will not be available but the server will still boot. This is to allow you to gather debug or system logs or other data from the cache SSDs.

  • If only one PSU is working, troubleshoot the cause for the loss of power from the other PSUs and correct. If faulty PSUs need to be replaced, shut the system down and install working PSUs.

DGX A100 Locking Power Cord Specification

The DGX A100 is shipped with a set of six (6) locking power cords that have been qualified for use with the DGX A100 to ensure regulatory compliance.

The following locking power cord types are approved:

  • Switch-locking for the PSU side

  • Twist-locking for the PSU side

Warning

To avoid electric shock or fire, only use the NVIDIA-provided power cords to connect power to the DGX A100. For more details, see Electrical Precautions.

Important

Do not use the provided cables with any other product or for any other purpose.

Power Cord Specification

Power Cord Feature

Specification

Electrical

250VAC, 16A

Plug Standard

C19/C20

Dimension

1200mm length

Compliance

Cord: UL62, IEC60227

Connector/Plug: IEC60320-1

Using the Locking Power Cords

This section provides information about how to use the locking power cords.

Locking and Unlocking the PDU Side

Power Distribution Unit side

  • To INSERT, push the cable into the PDU socket.

  • To REMOVE, press the clips together and pull the cord out of the socket.

_images/lock-unlock-pdu-side.png

Locking/Unlocking the PSU Side (Cords with Switch-Lock Mechanism)

Power Supply (System) side - Switch locking

  • To INSERT or REMOVE make sure the cable is UNLOCKED and push/ pull into/out of the socket.

_images/lock-unlock-psu-side.png

Locking/Unlocking the PSU Side (Cords with Twist-Lock Mechanism)

Power Supply (System) side - Twist locking

  • To INSERT or REMOVE make sure the cable is UNLOCKED and push/ pull into/out of the socket.

_images/lock-unlock-psu-cord-twist-lock-mech.png

Environmental Specifications

Here are the environmental specifications for your DGX A100 system.

Feature

Specification

Operating Temperature

5° C to 30° C (41° F to 86° F)

Relative Humidity

20% to 80% non-condensing

Airflow

840 CFM @ 80% fan PWM

Heat Output

22,179 BTU/hr

Front Panel Connections and Controls

This section provides information about the front panel, connections, and controls of the DGX A100 system.

With a Bezel

Here is an image of the DGX A100 system with a bezel.

_images/dgxa100-with-bezel.png

Control

Description

Power Button

Press to turn the DGX A100 system On or Off.

  • Green flashing (1 Hz): Standby (BMC booted)

  • Green flashing (4 Hz): POST in progress

  • Green solid On: Power On

ID Button

Press to cause the button blue LED to turn On or blink (configurable through the BMC) as an identifier during servicing.

Also causes an LED on the back of the unit to flash as an identifier during servicing.

Fault LED

Amber On: System or component faulted

With the Bezel Removed

Here is an image of the DGX A100 system with a bezel.

_images/dgxa100-bezel-removed.png

Important

Refer to Turning DGX A100 On and Off for instructions on how to properly turn the system on or off.

Rear Panel Modules

Here is an image that shows the real panel modules on DGX A100.

_images/rear-panel-modules.png

Motherboard Connections and Controls

Here is an image that shows the motherboard connections and controls in a DGX A100 system.

_images/mboard-conn-controls.png
Motherboard Controls

Control

Description

Power Button

Press to turn the system On or Off.

ID LED Button

Blinks when ID button is pressed from the front of the unit as an aid in identifying the unit needing servicing.

BMC Reset button

Press to manually reset the BMC.

See Network Connections, Cables, and Adaptors for details on the network connections.

Motherboard Tray Components

Here is an image that shows the motherboard tray components in DGX A100.

_images/mboard-tray-comp.png

GPU Tray Components

Here is an image of the GPU tray components in a DGX A100 system.

_images/gpu-tray-comp.png

Network Connections, Cables, and Adaptors

This section provides information about network connections, cables, and adaptors.

Network Ports

Here is an image that shows the network ports on a DGX A100 system.

_images/network-ports.png
Network Port Mapping

Slot

PCI Bus

Port Designation

RDMA

Default

Optional

Slot 5 Not Populated

Slot 5 Populated

Pre-DGS OS 6

DGX OS 5 and later

0

4b:00.0

ib2

ibp75s0

enp75s0

mlx5_2

mlx5_2

1

54:00.0

ib3

ibp84s0

enp84s0

mlx5_3

mlx5_3

2

ba:00.0

ib6

ibp186s0

enp186s0

mlx5_6

mlx5_8

3

cc:00.0 1

ca:00.0 2

ib7

ibp204s0a 3

ibp202s0b 4

enp204s0a 5

enp202s0b 6

mlx5_7

mlx5_9

4 port 0 (top)

e1:00.0

enp225s0f0

(see note)

mlx5_8

mlx5_10

4 port 1 (bottom)

e1:00.1

enp225s0f1

(see note)

mlx5_9

mlx5_11

5 port 0 (left)

61:00.0

enp97s0f0

(see note)

mlx5_4

5 port 1 (right)

61:00.1

enp97s0f1

(see note)

mlx5_5

6

0c:00.0

ib0

ibp12s0

enp12s0

mlx5_0

mlx5_0

7

12:00.0

ib1

ibp18s0

enp18s0

mlx5_1

mlx5_1

8

8d:00.1

ib4

ibp141s0

enp141s0

mlx5_4

mlx5_6

9

94:00.0

ib5

ibp148s0

enp148s0

mlx5_5

mlx5_7

LAN

e2:00.0

enp226s0

N/A

Note

The enp37s0f3u1u3c2 interface or bmc_redfish0 is recognized by the OS and may be listed in response to such commands as ifconfig or ip addr. This interface is reserved for future support of BMC communication using Redfish APIs and is not available for configuration.

Note

The Optional column lists the port designations after reconfiguring the default InfiniBand ports to Ethernet. For DGX A100 systems configured with NVIDIA ConnectX-7 network cards, only the InfiniBand port designations are supported.

When switching from the default Ethernet to InfiniBand, the InfiniBand port designations will vary depending on changes made to the other ports.

BMC Port LEDs

The BCM RJ-45 port has two LEDs.

The LED on the left indicates the speed. Solid green indicates the speed is 100M. Solid amber indicates the speed is 1G.

The LED on the right is green and flashes to indicate activity.

Supported Network Cables and Adaptors

The DGX A100 system is not shipped with network cables or adaptors. You will need to purchase supported cables or adaptors for your network.

The ConnectX-6 or ConnectX-7 firmware determines which cables and adaptors are supported. For a list of cables and adaptors compatible with the NVIDIA ConnectX cards installed in the DGX A100 system,

  1. Visit the Mellanox Firmware Release page.

  2. From the left navigation menu, select the ConnectX model and corresponding firmware included in the DGX A100.

  3. Select Firmware Compatible Products.

DGX A100 System Topology

Here is an image of the DGX A100 system topology.

_images/dgxa100-system-topology.png

DGX OS Software

The DGX A100 system comes pre-installed with a DGX software stack incorporating the following components:

  • An Ubuntu server distribution with supporting packages.

  • The following system management and monitoring software:

    • NVIDIA System Management (NVSM)

      Provides active health monitoring and system alerts for NVIDIA DGX nodes in a data center. It also provides simple commands for checking the health of the DGX A100 system from the command line.

    • Data Center GPU Management (DCGM)

      This software enables node-wide administration of GPUs and can be used for cluster and data-center level management.

  • DGX A100 system support packages.

  • The NVIDIA GPU driver

  • Docker Engine

  • NVIDIA Container Toolkit

  • Mellanox OpenFabrics Enterprise Distribution for Linux (MOFED)

  • Mellanox Software Tools (MST)

  • cachefilesd (daemon for managing cache data storage)

Additional Documentation

This section provides links to additional documentation.

  • MIG User Guide

    The new Multi-Instance GPU (MIG) feature allows the NVIDIA A100 GPU to be securely partitioned into up to seven separate GPU Instances for CUDA applications.

  • NGC Container Registry for DGX

    How to access the NGC container registry for using containerized deep learning GPU- accelerated applications on your DGX A100 system.

  • NVSM Software User Guide

    Contains instructions for using the NVIDIA System Management software.

  • DCGM Software User Guide

    Contains instructions for using the Data Center GPU Manager software.

Customer Support

Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing problems with your DGX A100 system. Also contact NVIDIA Enterprise Support for assistance in moving the DGX A100 system.

Our support team can help collect appropriate information about your issue and involve internal resources as needed.

1

Based on systems updated with DGX A100 Firmware Update Container 20.10.9 or later

2

Based on systems updated with DGX A100 Firmware Update Container 20.05.12.3 or earlier

3

Based on systems updated with DGX A100 Firmware Update Container 20.10.9 or later

4

Based on systems updated with DGX A100 Firmware Update Container 20.05.12.3 or earlier

5

Based on systems updated with DGX A100 Firmware Update Container 20.10.9 or later

6

Based on systems updated with DGX A100 Firmware Update Container 20.05.12.3 or earlier