Cloudera Data Platform (CDP)
(Latest Version)

NVIDIA + Cloudera Platform Components

Why NVIDIA + Cloudera

Today, data processing and data engineering has become the world’s largest computing segment. Modest improvements in the accuracy of analytics models translate into billions to the bottom line. To build the best models, data scientists toil to train, evaluate, iterate, and retrain for highly accurate results and performant models. With NVIDIA RAPIDS, processes that took days now take minutes, making it easier and faster to build and deploy value generating models. Enterprises can easily leverage GPU-accelerated Apache Spark 3.0 on CDP to remove bottlenecks and quickly improve performance - significantly improving time to insight and the return on investment for data-driven enterprises.

Joint Solution Overview

Running data science workloads on an accelerated Cloudera Data Platform greatly improves time to value by enabling data scientists to collaborate in a single unified platform. With the latest release, accelerated Apache Spark 3.0 workloads now run seamlessly on CDP. With GPU acceleration, data science teams can leverage purpose-built tooling for agile experimentation, data analytics and machine learning 10x faster and at lower cost.

Cost-effective NVIDIA infrastructure empowers IT teams to deliver an accelerated CDP solution for intuitive, self-service ML — now and into the future. NVIDIA-Certified servers are available from leading OEM server vendors.

For companies looking to jumpstart their AI journey, Accelerated CDP Starter Solutions are available to confidently deploy scalable hardware and software solutions that securely and optimally run accelerated workloads.

Joint Solution Benefits

NVIDIA and Cloudera have tested and benchmarked workloads across a wide range of infrastructure configurations and boiled it down to two simple recommendations:

  • For companies buying servers dedicated for running Apache Spark for data analytics and ETL in CDP, a CDP-Ready configuration comprised of four NVIDIA-Certified servers with two NVIDIA A30 GPUs per server. This configuration offers over five times the performance at less than 50% incremental cost relative to modern CPU-only alternatives.

  • For companies buying servers for running not just Apache Spark but also machine learning in CDP, or if these servers may be used for other AI-related applications during their lifetime, upgrade to an AI-Ready configuration comprised of four NVIDIA-Certified servers with one NVIDIA A100 GPU per server. This configuration offers over eight times the performance at less than 50% incremental cost relative to modern CPU-only alternatives. And these numbers are just the Apache Spark benchmarks; acceleration on ML and AI training is even more significant.

NVIDIA-Certified Systems™ brings together NVIDIA GPUs and NVIDIA networking in servers from leading vendors. These systems conform to NVIDIA’s design best practices and have passed a set of certification tests that validate the best system configurations for performance, manageability, scalability, and security. With NVIDIA-Certified Systems, enterprises can confidently choose performance-optimized servers to power their Cloudera Data Platform workloads, both in smaller configurations and at scale.

These systems include:

  • NVIDIA Ampere architecture-based GPUs such as the NVIDIA A100 and A30 Tensor Core GPUs. The Tensor Core technology included in the Ampere architecture has brought dramatic speedups to AI operations, bringing down training times from weeks to hours and providing massive acceleration to inference.

  • NVIDIA® Mellanox® ConnectX® SmartNICs and the NVIDIA BlueField® data processing unit (DPU) provide a host of software-defined hardware engines for accelerating networking and security. These enable the best of both worlds: best-in-class AI training and inference performance, with all the necessary levels of enterprise data privacy, integrity, and reliability.

Cloudera Data Platform (CDP) is a data cloud built for the enterprise. With CDP, businesses manage and secure the end-to-end data lifecycle - collecting, enriching, analyzing, experimenting and predicting with their data - to drive actionable insights and data-driven decision making. The most valuable and transformative business use cases require multi-stage analytic pipelines to process enterprise data sets. CDP empowers businesses to unlock value from large-scale, complex, distributed, and rapidly changing data and compete in the age of digital transformation.

An Integrated Data Platform

CDP provides an integrated data platform that creates agility along lines-of-business while facilitating efficiency and security within IT, enabling the entire organization to be more productive. As organizations react quickly to changing business requirements, CDP delivers mission-critical advantages:

  • Designed for data engineers, data scientists, BI analysts, developers and enterprise IT

  • Simple to use cloud-native service and automatically secure by design

  • Best-in-class analytics and integrated data lifecycle

  • Self-serve and custom analytics

  • Public cloud and on-premises

CDP enables enterprise IT to embrace these seemingly opposing forces because it delivers the capabilities of an enterprise data cloud.

intro-cdp-marketecture.jpeg

Additional information can be found in the Cloudera Data Platform Datasheet

CDP Private Cloud is built for hybrid cloud, seamlessly connecting on-premises environments to public clouds with consistent, built-in security and governance.

The platform extends cloud-native speed, scale, and economics for the connected data lifecycle, enabling IT to:

  • Easily deliver analytics and machine learning services, up to 10 times faster than traditional data management solutions and cloud services to react faster to changing business requirements and eliminate shadow IT

  • Meet the exponential demand for analytics and machine learning services with a petabyte-scale hybrid data architecture that can flex to use private and public clouds, delivering faster time to value and supporting critical workloads at scale

  • Optimize and share compute infrastructure across the data lifecycle, increasing efficiency and lowering cost by reducing compute infrastructure requirements for analytics and eliminating data duplication

  • Consistently and easily enforce security and governance policies across hybrid and multi-cloud deployments to ensure regulatory compliance

  • Invest in a platform powered by open source, ensuring continual and rapid innovation to address evolving business requirements

cdp-private-cloud-diagram-vert-light.jpg

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

The RAPIDS Accelerator for Apache Spark leverages GPUs to accelerate processing via the RAPIDS libraries.

As data scientists shift from using traditional analytics to leveraging AI applications that better model complex market demands, traditional CPU-based processing can no longer keep up without compromising either speed or cost. The growing adoption of AI in analytics has created the need for a new framework to process data quickly and cost efficiently with GPUs.

The RAPIDS Accelerator for Apache Spark combines the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. The RAPIDS Accelerator library also has a built-in accelerated shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities.

Cloudera and NVIDIA have collaborated to integrate and optimize the RAPIDS Accelerator for Apache Spark in CDP Private Cloud Base. With Private Cloud Base 7.1.6 or later, Cloudera customers running Apache Spark 3.0 applications can benefit from transparent acceleration of their Spark jobs by simply running them on NVIDIA GPUs in NVIDIA-Certified servers. No application code change is required.

To get started building your NVIDIA enabled Cloudera Data Platform, Private Cloud solution Contact Cloudera Sales

© Copyright 2019-2021, NVIDIA. Last updated on Sep 21, 2021.