Accelerating Apache Spark with Zero Code Changes
Accelerating Apache Spark with Zero Code Changes (Latest Version)

Running Mortgage Benchmark

  1. Connect to System Console using the left-hand menu link.

  2. Connect to the sparkrunner pod.

    Copy
    Copied!
                

    kubectl exec --stdin --tty sparkrunner-0 -- /bin/bash


  3. cd to /home/spark/spark-scripts and execute the /home/spark/spark-scripts/lp-runjupyter-etl-gpu.sh or /home/spark/spark-scripts/lp-runjupyter-etl-cpu.sh in the System Console.

  4. In the left menu open up the Desktop link and click the VNC connect button.

    spark-rapids-004.png

  5. Open the web browser in the Linux desktop.

    spark-rapids-005.png

  6. Browse to 172.16.0.10:30002.

    spark-rapids-006.png

  7. You should see the list above.

  8. Click the lp-mortgageETL.ipynb link and this should start the Jupyter notebook.

    spark-rapids-014.png

    Note

    Please “trust” the notebook before you run it.

    spark-rapids-011.png

  9. Validate the creation of the Mortgage Benchmark pods with the following command.

    • Open another System Console.

    Copy
    Copied!
                

    kubectl get pods | grep app-name

    • The output should look similar to this.

    Copy
    Copied!
                

    app-name-79d837808b2d2ba5-exec-1 1/1 Running 0 31m app-name-79d837808b2d2ba5-exec-2 1/1 Running 0 31m app-name-79d837808b2d2ba5-exec-3 1/1 Running 0 31m

  10. Create two directories for the Mortgage Dataset in the console session from the previous step.

    Copy
    Copied!
                

    cd `mount | awk -F ':' '/spark-rapids-claim/ {print $2}'|grep var | awk '{print $1}'` mkdir -p mortgage/input mkdir -p mortgage/output chmod 777 mortgage/output

  11. From the LaunchPad Desktop download the input dataset from the Fannie Mae website.

    • Go to Single-Family Loan Performance Data page.

      • Login or Register as a new user.

    • Select HP.

      • Click on Download Data and choose Single-Family Loan Performance Data. You will find a tabular list of Acquisition and Performance files sorted based on year and quarter. Click on the file to download. Eg: 2017Q1.zip

      • Unzip the downloaded file to extract the csv file: Eg: 2017Q1.csv

      • Copy the csv files to the GPU node.

      Copy
      Copied!
                  

      scp 2017Q1.csv nvidia@172.16.0.10:/data/${your-default-spark-rapids-claim-path}/mortgage/input/


  12. Run the notebook by clicking Cell -> Run All.

    spark-rapids-015.png

  13. Note the timing for the benchmark so you can compare to your CPU run time.

    spark-rapids-016.png

  14. Stop the notebook you started in step 1 by pressing ctrl-c in the System Console window that you started the notebook. Answer Y when asked if you want to “Shutdown this notebook server?”.

  15. Run the same Mortgage Benchmark using only CPUs.

    • Execute the lp-runjupyter-etl-cpu.sh script.

  16. Compare the differences between the two outputs.

Note

You must close the notebook tab and then shutdown the notebook to start another session. If this is not done you will not be able to start another spark session.

© Copyright 2022-2023, NVIDIA. Last updated on Jun 23, 2023.