NVIDIA DGX Platform

External Page
NVIDIA DGX platform incorporates the best of NVIDIA software, infrastructure, and expertise for a modern, unified AI platform spanning clouds and on-premises.

NVIDIA Base Command powers the DGX Cloud and every NVIDIA DGX system, delivering the best of NVIDIA software innovation, making it faster, easier, and more cost-effective to realize the ROI on AI infrastructure.

NVIDIA Base Command Platform is a world-class infrastructure solution for businesses and their data scientists who need a premium AI development experience.
NVIDIA Base Command Manager streamlines cluster provisioning, workload management, and infrastructure monitoring. It provides all the tools you need to deploy and manage an AI data center. NVIDIA Base Command Manager Essentials comprises the features of NVIDIA Base Command Manager that are certified for use with NVIDIA AI Enterprise.
Base OS provides a stable and fully qualified software stack for running AI, machine learning, and analytics applications. It includes platform-specific configurations, drivers, and diagnostic and monitoring tools. The software stack is available for Ubuntu, Red Hat Enterprise Linux, and Rocky Linux, and is integrated in DGX OS, a customized Ubuntu installation.

Leadership-class AI infrastructure for on-premises and hybrid deployments from the AI data center to the AI supercomputer.

Deployment and management guides for NVIDIA DGX SuperPOD, an AI data center infrastructure platform that enables IT to deliver performance—without compromise—for every user and workload. DGX SuperPOD offers leadership-class accelerated infrastructure and agile, scalable performance for the most challenging AI and high-performance computing (HPC) workloads, with industry-proven results.
Deployment and management guides for DGX BasePOD, which provides a prescriptive AI infrastructure for enterprises, eliminating the design challenges, lengthy deployment cycle, and management complexity traditionally associated with scaling AI infrastructure.
System documentation for the DGX AI supercomputers that deliver world-class performance for large generative AI and mainstream AI workloads.
NVIDIA MAGNUM IO™ software development kit (SDK) enables developers to remove input/output (IO) bottlenecks in AI, high performance computing (HPC), data science, and visualization applications, reducing the end-to-end time of their workflows. Magnum IO covers all aspects of data movement between CPUs, GPUsns, DPUs, and storage subsystems in virtualized, containerized, and bare-metal environments.
NVIDIA NGC is the hub for GPU-optimized software for deep learning, machine learning, and HPC that provides containers, models, model scripts, and industry solutions so data scientists, developers and researchers can focus on building solutions and gathering insights faster.
NVIDIA Optimized Frameworks such as Kaldi, NVIDIA Optimized Deep Learning Framework (powered by Apache MXNet), NVCaffe, PyTorch, and TensorFlow (which includes DLProf and TF-TRT) offer flexibility with designing and training custom (DNNs for machine learning and AI applications.
Guide
NVIDIA Enterprise Support and Services Guide provides information for using NVIDIA Enterprise Support and services. This document is intended for NVIDIA’s potential and existing enterprise customers. This User Guide is a non-binding document and should be utilized to obtain information for NVIDIA Enterprise branded support and services.

Training to enable your team to make the most of DGX.

This course provides an overview of DGX POD components and related processes, including the NVIDIA DGX A100 System; InfiniBand and ethernet networks; tools for in-band and out-of-band management; NGC; the basics of running workloads; and specific management tools and CLI commands. This course includes instructions for managing vendor-specific storage per the architecture of your specific POD solution.
This course is designed to help IT professionals successfully administer all aspects of a DGX SuperPOD cluster including compute, storage, and networking.
This course provides an overview of the DGX A100 System and DGX A100 Stations' tools for inband and out-of-band management, the basics of running workloads, specific management tools and CLI commands.
This course is based on NVIDIA Bright Cluster Manager and gives an overview of the usage and components of the software. In addition, it gives a step-by-step guide through the methods to install NVIDIA Bright Cluster Manager on a head node and the steps necessary to bring up a functioning cluster managed by NVIDIA Bright Cluster Manager
This course is based on NVIDIA Bright Cluster Manager and gives an overview of the cluster management tools, Bright View and CMSH. This course also gives information on environment modules. The processes for provisioning nodes and creating software images are also covered. Learners in this course will also learn about Interacting with switches, NVIDIA GPU integration, and the Health Management Framework
This course is based on NVIDIA Bright Cluster Manager and gives an overview of extending the cluster to the cloud with Cluster as a service and cluster extension (i.e. Hybrid Cloud) The processes for deploying cluster as a service and cluster extensions for AWS and Azure are covered in detail. This course also shows how to setup and use of Bright auto scaler. The final topic is the setup and use of the automated cloud job data management (CMJOB) for improving flexibility and productivity of the cluster
This is an entry-level certification that validates foundational concepts of adopting artificial intelligence computing by NVIDIA in a data center environment
This is an intermediate level certification that validates core concepts for designing, deploying, and managing NVIDIA InfiniBand Fabrics