The Customer Churn lab is derived from the data-science-blueprints repo, where customer churn is modeled from federating data and performing exploratory query-based analytics to feature engineering, model training, and model operationalization. The data engineering portions of the blueprint are accelerated with the RAPIDS Accelerator for Apache Spark and the machine learning portions are accelerated with the RAPIDS Python libraries.This lab shows a realistic ETL workflow based on synthetic normalized data. It consists of two pieces:
An augmentation script, which synthesizes normalized (long-form) data from a wide-form input file, optionally augmenting it by duplicating records.
An ETL script, which performs joins and aggregations in order to generate wide-form data from the synthetic long-form data.
Connect to System Console using the left-hand menu link.
Find current IP address of Find current IP address of Accelerate Spark 3 pod.
kubectl describe pod sparkrunner-0 | grep IP
Connect to the sparkrunner pod.
kubectl exec --stdin --tty sparkrunner-0 -- /bin/bash
/home/spark/lp-runjupyter-etl-cpu.shscript and replace the following line with the correct IP.
Copy and paste is available on the Desktop VNC connection. You will see a sidebar on the left of the screen and once that is opened you can paste into the clipboard. Once you have pasted something it is immediately available to paste within the VNC desktop