Reference Build - NVIDIA Docs

Hardware Configurations

CDP & AI Ready Systems

The following table outlines two recommended server configurations for deploying CDP Private Cloud Base for accelerated Apache Spark workloads. A cluster of “CDP Ready” servers is perfect for dedicated Apache Spark deployments, and a cluster of “AI Ready” servers is ideal for multi-purpose deployments that will include both Spark and other accelerated workloads (e.g., machine learning). NVIDIA and Cloudera recommend CDP Private Cloud Base users initially add (4) or (8) of these servers to begin experiencing the benefit of GPUs. Some server vendors have these bundles pre-configured and ready to buy; contact Cloudera or NVIDIA sales for details.

CDP & AI Ready Systems Configuration
Parameter	CDP-Ready Configuration	AI-Ready Configuration
GPU	A30	A100 40GB
GPU Configuration	2x GPUs per server	1 GPU ser server (option: 2x GPUs)
CPU	AMD EPYC (Rome or Milan) Intel Xeon (Skylake, Cascade Lake, Ice Lake)
CPU Sockets	2 per server
CPU Speeds	2.1 GHz minimum base clock
CPU Cores	20 physical cores minimum per socket
System Memory	512 GB
Network Adapter (NIC)	2 Mellanox ConnectX-6, ConnectX-6 DX, or BlueField-2
Storage	40TB of SSDs or NVMe devices for HDFS

Note

These are recommended configurations. Additional hardware configurations are available.

Network Topology Overview

Right-size Server Configurations

Cloudera recommends deploying three or four machine types into production:

Master Node Runs the Hadoop master daemons: NameNode, Standby NameNode, YARN Resource Manager and History Server, the HBase Master daemon, Sentry server, and the Impala StateStore Server and Catalog Server. Master nodes are also the location where Zookeeper and JournalNodes are installed. The daemons can often share single pool of servers. Depending on the cluster size, the roles can instead each be run on a dedicated server. Kudu Master Servers should also be deployed on master nodes.
Worker Node Runs the HDFS DataNode, YARN NodeManager, HBase RegionServer, Impala impalad, Search worker daemons and Kudu Tablet Servers. GPUs are installed in worker nodes.
Utility Node Runs Cloudera Manager and the Cloudera Management Services. It can also host a MySQL (or another supported) database instance, which is used by Cloudera Manager, Hive, Sentry and other Hadoop-related projects.
Edge Node Contains all client-facing configurations and services, including gateway configurations for HDFS, YARN, Impala, Hive, and HBase. The edge node is also a good place for Hue, Oozie, HiveServer2, and Impala HAProxy. HiveServer2 and Impala HAProxy serve as a gateway to external applications such as Business Intelligence (BI) tools

Deployment Topology

The graphic below depicts a cluster deployed across several racks (Rack 1, Rack 2, … Rack n)

Each host is networked to two top-of-rack (TOR) switches which are in turn connected to a collection of spine switches which are then connected to the enterprise network. This deployment model allows each host maximum throughput, minimize of latency, while encouraging scalability.

Network Topology Considerations

The preferred network topology is spine-leaf with as close to 1:1 over subscription between leaf and spine switches, ideally aiming for no over subscription. This is so we can ensure full line-rate between any combination of storage and compute nodes. Since this architecture calls for disaggregation of compute and storage, the network design must be more aggressive in order to ensure best performance. GPUs worker nodes can be added to the network topology with the addition of new servers in racks with 100Gb+ TOR switching infrastructure

There are two aspects to the minimum network throughput required, which will also determine the compute to storage nodes ratio.

The network throughput and IO throughput capabilities of the storage backend.
The network throughput and network over subscription between compute and storage tiers (NS).

Storage Topology Overview

HDFS ¹

Hadoop Distributed File system (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.

NameNode and DataNodes

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.

Data Replication

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.

Disaggregated Storage ²

Disaggregated storage is a form of scale-out storage, built with some number of storage devices that function as a logical pool of storage that can be allocated to any server on the network over a very high performance network fabric. Disaggregated storage solves the limitations of storage area networks or direct-attached storage. Disaggregated storage is dynamically reconfigurable and optimally reconfigures physical resources to maximize performance and limit latency.[8] Disaggregated storage provides the performance of local storage with the flexibility of storage area networks.

Storage area networks (SAN)

A storage area network (SAN) or storage network is a computer network which provides access to consolidated, block-level data storage. SANs are primarily used to access data storage devices, such as disk arrays and tape libraries from servers so that the devices appear to the operating system as direct-attached storage. A SAN typically is a dedicated network of storage devices not accessible through the local area network (LAN).

Network-attached storage (NAS)

A NAS unit is a computer connected to a network that provides only file-based data storage services to other devices on the network. Although it may technically be possible to run other software on a NAS unit, it is usually not designed to be a general-purpose server. For example, NAS units usually do not have a keyboard or display, and are controlled and configured over the network, often using a browser.

A full-featured operating system is not needed on a NAS device, so often a stripped-down operating system is used. For example, FreeNAS or NAS4Free, both open source NAS solutions designed for commodity PC hardware, are implemented as a stripped-down version of FreeBSD.

NAS systems contain one or more hard disk drives, often arranged into logical, redundant storage containers or RAID.

NAS uses file-based protocols such as NFS (popular on UNIX systems), SMB (Server Message Block) (used with MS Windows systems), AFP (used with Apple Macintosh computers), or NCP (used with OES and Novell NetWare). NAS units rarely limit clients to a single protocol.

Object storage (Private Cloud)

Object storage (also known as object-based storage) is a computer data storage architecture that manages data as objects, as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks. Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier. Object storage can be implemented at multiple levels, including the device level (object-storage device), the system level, and the interface level. In each case, object storage seeks to enable capabilities not addressed by other storage architectures, like interfaces that are directly programmable by the application, a namespace that can span multiple instances of physical hardware, and data-management functions like data replication and data distribution at object-level granularity.

Example Use Case

Predicting Customer Churn

What is Churn?

Predicting customer churn is a common data science workflow in many enterprises. Customer churn is the percentage of customers that stopped using your company’s product or service during a certain time frame. It’s important because it costs more to acquire new customers than it does to retain existing customers.
Example of Churn in Telecoms

This benchmarks uses a synthetic dataset of a representative telco company’s customer data which contains 70M account level records including gender, monthly charges, departure date, and much more. This end-to-end example walks through analytic processing and ETL of raw data, integrating data engineering pipelines with machine learning, and continuing all the way to deploying the model in production.
Value of Predicting Churn

Predicting churn increases customer retention. A 5% reduction in churn translates to 25% increase in profit with less spent acquiring new customers.
Ebook - https://www.nvidia.com/en-us/ai-data-science/resources/churn-prediction-blueprint/
Blog 1 - https://developer.nvidia.com/blog/end-to-end-blueprint-for-customer-churn-modeling-and-prediction-part-1/
Blog 2 - https://developer.nvidia.com/blog/an-end-to-end-blueprint-for-customer-churn-modeling-and-prediction-part-2/
Blog 3 - https://developer.nvidia.com/blog/an-end-to-end-blueprint-for-customer-churn-modeling-and-prediction-part-3/
Demo - Predicting Customer Churn
Download the Telco Churn Augmentation Benchmark

TCO - Performance / Cost
Benchmark	Platform	Speedup	Cost Relative
Data Preparation / Analytics (CPU vs GPU)	Spark Batch Job / Cloudera Data Platform on NVIDIA Certified EGX Servers	4.3x	3.2x

Note

Performance results are subject to change and are improving regularly as joint development of this solution continues to expand. Speedup is measured by comparing the same workload tested on servers with and without GPUs, and Cost Relative speedup accounts for the incremental cost of including the GPUs in the server. The results above were obtained with 8x NVIDIA Certified EGX Dell r7525 Severs, 1x A100 GPU, 1x CX-6 NIC, 100GbE with RoCE

Tuning

References

<a class="fn-backref" href="#id1" target="_self">1</a>: The HDFS content was sourced from http://hadoop.apache.org
<a class="fn-backref" href="#id2" target="_self">2</a>: The Dissagregated storage content was source from http://en.wikipedia.org