In the previous step you should have updated the lp-runjupyter-etl-gpu.sh and lp-runjupyter-etl-cpu.sh
file with the correct IP address for your LaunchPad instance. If you have not done this refer to Setting up the environment.
First cd to /home/spark and execute the
/home/spark/lp-runjupyter-etl-gpu.sh
or/home/spark/lp-runjupyter-etl-cpu.sh
in the System Console.In the left menu open up the Desktop link and click the VNC connect button.
Open the web browser in the Linux desktop.
Browse to 172.16.0.10:30002.
You should see the list above.
Create dataset for use with ETL job.
Open another System Console and run the following commands.
mkdir -p /data/churn/input mkdir -p /data/churn/output chmod 777 /data/churn/*
Open a bash session into the running container.
kubectl exec --stdin --tty sparkrunner-0 -- /bin/bash
Copy the seed file.
cp /home/spark/WA_Fn-UseC_-Telco-Customer-Churn-.csv /data/churn/input exit
Click the
lp-churn-augment.ipynb
link to start the Jupyter notebook.Validate the creation of the Churn Benchmark pods with the following command
Open another System Console.
kubectl get pods | grep app-name
The output should look similar to this.
app-name-79d837808b2d2ba5-exec-1 1/1 Running 0 31m app-name-79d837808b2d2ba5-exec-2 1/1 Running 0 31m app-name-79d837808b2d2ba5-exec-3 1/1 Running 0 31m
If you see that your pods are in a PENDING status then the previous pods did not close properly. You can remove those pods with the following command:
kubectl delete pod app-name-XXXX
Run the notebook by clicking Cell -> Run All.
Confirm the creation of the Churn dataset.
Review output of notebook.
In the System Console run (should see approximately 21G of data).
du -hs /data/churn/output
Click the
lp-churn-etl.ipynb
link to start the Juypter notebook.Run the notebook by clicking Cell -> Run All
Note the timing for the benchmark so you can compare to your CPU run time.
Stop the notebook you started in step 1 by pressing ctrl-c in the System Console window that you started the notebook. Answer Y when asked if you want to “Shutdown this notebook server?”.
Run the same Churn Benchmark using only CPUs.
Execute the
lp-runjupyter-etl-cpu.sh
script.
Compare the differences between the two outputs.