Next Item Prediction
Next Item Prediction (Latest Version)

Step #1: Run the Workflow

In this lab, all of the workflow components have already been deployed, but some work is still needed from the user to replicate a production use case. In reality, a user would not explicitly run these steps; they’d run as scheduled in production.

  1. First, open the SSH Console from the left pane, and check to make sure that the synthetic data generation and preparation has finished using the command below.

    Copy
    Copied!
                

    kubectl get jobs -n merlin

    image5.png

  2. Next, we’ll manually create a training job from the cronjob. Currently, training is set to occur weekly each Sunday (a cronjob is setup to run). However, for our purposes of running the workflow as an example, we will run the cronjob manually.

    • First, get the name of the cronjob.

      Copy
      Copied!
                  

      kubectl get cronjobs -n merlin


    • Then, using the name of the cronjob run:

      Copy
      Copied!
                  

      kubectl create job --from=cronjob/<name-of-cronjob> train-job -n merlin


      The train job will take a few minutes to run. Run the following command to check that the job completed successfully:

      Copy
      Copied!
                  

      kubectl get jobs -n merlin


    image6.png

  3. Once the train job has completed, you can open MinIO to view the models that were created. To do so, click the MinIO link in the left pane, and log in using the credentials below:

    MinIO:

    • User: minioadmin

    • Password: {{ minio_password }}

  4. Once you’ve logged in, click the Browse button for bucket-2. You should see a few model files inside of the bucket:

    image7.png

  5. Now we’re ready to run the models using Triton Inference server. Follow the steps below to load the models into Triton, using the SSH Console.

    • First, get the name of the inference deployment.

      Copy
      Copied!
                  

      kubectl get deployments -o=jsonpath='{range .items[*]}{@.metadata.name}{"\n"}{end}' -n merlin | grep 'infer$'


    • Next, restart the deployment.

      Copy
      Copied!
                  

      kubectl rollout restart deployment <deployment_name> -n merlin


    • You can see if Triton loaded the models by looking at the logs of the pod that was deployed. First, get the name of the infer pod:

      Copy
      Copied!
                  

      kubectl get pods -n merlin

      image8.png

    • Next, run the below command using the name of the infer pod.

      Copy
      Copied!
                  

      kubectl logs <pod-name> -n merlin


      You should see logs indicating the model was loaded successfully.

      image9.png

  6. Next, we’ll need to generate an access token to use for the client to authenticate with the Triton service. Run the following command in the SSH Console:

    Copy
    Copied!
                

    curl -k -L -X POST '{{ keycloak_token_endpoint }}' -H 'Content-Type: application/x-www-form-urlencoded' --data-urlencode 'client_id={{ keycloak_client_id }}' --data-urlencode 'grant_type=password' --data-urlencode 'client_secret={{ keycloak_client_secret }}' --data-urlencode 'scope=openid' --data-urlencode 'username={{ keycloak_realm_user }}' --data-urlencode 'password={{ keycloak_realm_user_password }}'

    If needed, the information below can be used to manually configure Keycloak and/or generate the curl request according to the template below. A link to Keycloak is also provided in the left pane.

    Copy
    Copied!
                

    curl -k -L -X POST 'TOKEN_ENDPOINT' -H 'Content-Type: application/x-www-form-urlencoded' --data-urlencode 'client_id=CLIENT_ID' --data-urlencode 'grant_type=password' --data-urlencode 'client_secret=CLIENT_SECRET' --data-urlencode 'scope=openid' --data-urlencode 'username=USERNAME' --data-urlencode 'password=PASSWORD'

    • Admin User: admin

    • Admin Password: {{ keycloak_admin_password }}

    • Realm Name: {{ keycloak_realm }}

    • Realm User: {{ keycloak_realm_user }}

    • Realm Password: {{ keycloak_realm_user_password }}

    • Client ID: {{ keycloak_client_id }}

    • Client Secret: {{ keycloak_client_secret }}

    • Token Endpoint: {{ keycloak_token_endpoint }}

  7. Once you’ve run the curl request, you should receive a JSON string like below

    Copy
    Copied!
                

    {"access_token":"eyJhbGc...","expires_in":54000,"refresh_expires_in":108000,"refresh_token":"eyJhbGci...","not-before-policy":0,"session_state":"e7e23016-2307-4290-af45-2c79ee79d0a1","scope":"openid email profile"}

    Note down the access_token, this field will be required later on in the workflow, within the Jupyter notebook.

  8. Now we’ll use the Jupyter notebook client to send a request to Triton. Open the Inference Client link from the left pane, then run through the steps in the notebook, making sure to note down the access token generated in the previous step.

  9. After completing the notebook, we can check our monitoring dashboards to see the requests that were submitted. Click the Grafana link in the left pane, and log in using the credentials below.

    Grafana:

    • User: admin

    • Password: {{ grafana_password }}

  10. Select the Dashboards icon in the left navigation pane, select browse, expand the merlin folder, and click NVIDIA Triton Inference Server Dashboard.

    image10.png

  11. Review the Triton metrics that are reported from the dashboard. Note the following metrics:

    • Average (per minute) queue time (in ms): Average cumulative time requests spend waiting in the scheduling queue (includes cached requests). Averaged across 1 minute.

    • Successful inference requests per minute: Number of successful inference requests received by Triton (each request is counted as 1, even if the request contains a batch).

    • # failed inference requests per minute: Number of failed inference requests received by Triton (each request is counted as 1, even if the request contains a batch).

    • P99 latency (per minute) (in seconds): 99th percentile request latency. Latency is computed for the total time spent in model inference backends.

    • P95 latency (per minute) (in seconds): 99th percentile request latency. Latency is computed for the total time spent in model inference backends.

    • # Triton Instances used

    • GPU memory (GB)

    • GPU power utiliation (watts)

      image11.png

  12. Congrats! You’ve just completed running the Next Item Prediction AI Workflow. Feel free to explore the components deployed as a part of the solution, or proceed to the next section for next steps.

© Copyright 2022-2023, NVIDIA. Last updated on Mar 20, 2023.