Step #6: Inference with BigNLP Model

The Triton Inference Server will be used to run a few inferencing examples.

We will convert the nemo format model created in the previous step into fast-transformer format for the Triton inference server deployment.

  1. Convert the model to fast-transformer format and configure for Triton server deployment.

    This is done using two commands: model-navigator convert and model-navigator triton-config-model. The ngc batch command below runs both.

    Copy
    Copied!
                

    ngc batch run \ --name "bignlp_triton_config_model_126m" \ --org nv-launchpad-bc \ --team no-team \ --ace nv-launchpad-bc-iad1 \ --instance dgxa100.80g.1.norm \ --image "nvcr.io/nv-launchpad-bc/bignlp-training:22.02-py3" \ --result /results \ --workspace jdoe_workspace:/mount_workspace:RW \ --total-runtime 30m \ --commandline "\ set -x && \ touch /opt/bignlp/bignlp-scripts/conf/inference/optimal_configurations/126m.yaml && \ tee /opt/bignlp/bignlp-scripts/conf/inference/optimal_configurations/126m.yaml << EOF launch_mode: local use_symlinks: true override_conversion_container: false framework_docker_image: nvcr.io/nvidia/pytorch:21.09-py3 target_formats: - fastertransformer onnx_opsets: [] tensorrt_precisions: [] tensorrt_precisions_mode: hierarchy tensorrt_explicit_precision: false tensorrt_strict_types: false tensorrt_sparse_weights: false ft_gpu_counts: - 4 atol: ? '' : 1.0e-05 rtol: ? '' : 1.0e-05 tensorrt_capture_cuda_graph: false max_batch_size: 256 preferred_batch_sizes: - 256 max_queue_delay_us: 0 engine_count_per_device: cpu: 1 triton_backend_parameters: is_half: '1' pipeline_para_size: '1' max_input_len: '200' max_seq_len: '400' triton_docker_image: nvcr.io/nvidia/tritonserver:21.09-py3 config_search_max_concurrency: 1024 config_search_max_instance_count: 5 config_search_max_preferred_batch_size: 32 config_search_concurrency: - 1 - 2 - 4 - 8 - 16 - 32 - 64 - 128 - 256 - 512 config_search_batch_sizes: - 1 config_search_instance_counts: cpu: - 1 triton_disable_model_reload: false max_shapes: INPUT_ID: - -1 - 1 - 200 REQUEST_INPUT_LEN: - -1 - 1 REQUEST_OUTPUT_LEN: - -1 - 1 value_ranges: INPUT_ID: - 50256 - 50256 REQUEST_INPUT_LEN: - 200 - 200 REQUEST_OUTPUT_LEN: - 200 - 200 dtypes: INPUT_ID: uint32 REQUEST_INPUT_LEN: uint32 REQUEST_OUTPUT_LEN: uint32 perf_analyzer_timeout: 1200 perf_measurement_mode: count_windows perf_measurement_request_count: 50 perf_measurement_interval: 10000 perf_measurement_relative_step: 0.5 EOF model-navigator convert \ --override-workspace \ --workspace-path /mount_workspace/infer_workspace/126m \ --config-path /opt/bignlp/bignlp-scripts/conf/inference/optimal_configurations/126m.yaml \ --model-name test_126m \ --model-path /mount_workspace/results/gpt3_126m/convert_nemo/megatron_gpt_126m_bf16.nemo \ --model-format nemo \ --output-path /mount_workspace/infer_workspace/126m/test_126m-converted.ft \ --verbose \ --use-symlinks \ > >(tee -a /results/model-navigator_convert.log) \ 2> >(tee -a /results/model-navigator_convert_stderr.log >&2) && \ cp /results/model-navigator_convert.log /mount_workspace/infer_workspace/126m && \ cp /results/model-navigator_convert_stderr.log /mount_workspace/infer_workspace/126m && \ model-navigator triton-config-model \ --override-workspace \ --workspace-path /mount_workspace/infer_workspace/126m \ --config-path /opt/bignlp/bignlp-scripts/conf/inference/optimal_configurations/126m.yaml \ --model-name test_126m \ --model-path /mount_workspace/infer_workspace/126m/converted/megatron_gpt_126m_bf16-nemo2ft_4.ft \ --model-repository /mount_workspace/infer_workspace/triton_repo \ --use-symlinks \ --verbose \ --max-batch-size 16 \ --preferred-batch-sizes 16 \ --triton-backend-parameters is_half=1 pipeline_para_size=2 max_input_len=60 max_seq_len=80 \ > >(tee -a /results/model-navigator_triton-config.log) \ 2> >(tee -a /results/model-navigator_triton-config_stderr.log >&2) && \ cp /results/model-navigator_triton-config.log /mount_workspace/infer_workspace/126m && \ cp /results/model-navigator_triton-config_stderr.log /mount_workspace/infer_workspace/126m"

    The Triton model repo is set up in /mount_workspace/infer_workspace/triton_repo.

  2. Start the Triton server using the command below. Note that we are also downloading gpt2 vocab and merge files to enable the inference workload we will run.

    Copy
    Copied!
                

    ngc batch run \ --name "bignlp_triton_inference_server" \ --org nv-launchpad-bc \ --team no-team \ --ace nv-launchpad-bc-iad1 \ --instance dgxa100.80g.8.norm \ --image "nvcr.io/nv-launchpad-bc/bignlp-inference:22.02-py3" \ --result /results \ --workspace jdoe_workspace:/mount_workspace:RW \ --total-runtime 2.5h \ --commandline "\ set -x && \ hostname -I && \ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P /mount_workspace/data/bpe && \ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P /mount_workspace/data/bpe && \ tritonserver --model-repository /mount_workspace/infer_workspace/triton_repo"

    The inference container nvcr.io/nv-launchpad-bc/bignlp-inference:22.02-py3 is being used because it has the tritonserver application installed. Once the bignlp_triton_inference_server job starts, note its job-id in the console output, or look it up in the web UI.

    Note

    The previous command which started the Triton Inference Server is intended to run in the backround and will accept inference requests until it automatically terminates after 2.5 hours. The remaining commands in this step need to be run during this window. If the inference server is terminated prior to running all commands, launch the server again using the command above and capture the new IP address for the server before running inferencing commands.

    base-command-057.png

    You need to figure out the IP address internal to BCP that the inference server is running on. We have not exposed the server ports as would normally be done. The complication is that on BCP the exposed ports are served via https and that requires setting up the Triton server with SSL authentication. Refer to the Triton SSL/TLS documentation. For the demo, the inference client applications will run on BCP over the internal BCP network. Run the following command to obtain the Triton inference server IP address.

    Copy
    Copied!
                

    # from above tritonserver get TRS_JOBID export TRS_JOBID=2714872 TRITONSERVER=($(ngc batch exec \ --org nv-launchpad-bc \ --team no-team \ --ace nv-launchpad-bc-iad1 \ --commandline "hostname -I" \ ${TRS_JOBID})) && \ export TRITONSERVER=${TRITONSERVER[0]}

    Alternatively lookup the IP address in the job log in the UI. To do so, view the first 10 lines of the job log in the web UI. You should see something like the following:

    Copy
    Copied!
                

    + hostname -I 10.233.68.237

    base-command-058.png

    Then export it via export TRITONSERVER=10.233.68.237 (use the correct IP address for your running Triton server). If you get the error message “export: TRITONSERVER: inconsistent type for assignment”, run unset TRITONSERVER and export the variable again. To verify the variable is set, run echo $TRITONSERVER.

    We are now ready to run inference applications. The inference applications are written in python code. The following will run a chatbot application. Be sure to update TRITONSERVER=<IP_ADDRESS_OF_TRITON_SERVER> in the command below with your IP address.

    Copy
    Copied!
                

    ngc batch run \ --name "bignlp_triton_inference_126m_apps" \ --org nv-launchpad-bc \ --team no-team \ --ace nv-launchpad-bc-iad1 \ --instance dgxa100.80g.1.norm \ --image "nvcr.io/nv-launchpad-bc/bignlp-training:22.02-py3" \ --result /results \ --workspace jdoe_workspace:/mount_workspace:RW \ --total-runtime 1h \ --commandline "\ set -x && \ export TRITONSERVER=<IP_ADDRESS_OF_TRITON_SERVER>; curl -v http://${TRITONSERVER}:8000/v2/health/ready && \ python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py \ --url${TRITONSERVER}:8001 --protocol grpc \ --datasets-dir /mount_workspace/data/bpe/ \ --model-name test_126m \ --output-len 40 \ --query \"Life\" && \ python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py \ --url${TRITONSERVER}:8001 --protocol grpc \ --datasets-dir /mount_workspace/data/bpe/ \ --model-name test_126m \ --output-len 40 \ --query \"Liberty\" && \ python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py \ --url${TRITONSERVER}:8001 --protocol grpc \ --datasets-dir /mount_workspace/data/bpe/ \ --model-name test_126m \ --output-len 40 \ --query \"Happiness\""

    You can find the output in joblog.log (abbreviated the curl command output).

    Copy
    Copied!
                

    File: joblog.log + curl -v http://10.233.68.237:8000/v2/health/ready . . . < HTTP/1.1 200 OK . . . + python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py --url 10.233.68.237:8001 --protocol grpc --datasets-dir /mount_workspace/data/bpe/ --model-name test_126m --output-len 40 --query Life . The first of the three, the first of the three, was a story of a young woman who was sent to a convent in the countryside of the Netherlands. She was a nun, + python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py --url 10.233.68.237:8001 --protocol grpc --datasets-dir /mount_workspace/data/bpe/ --model-name test_126m --output-len 40 --query Liberty , the court held that the trial court erred in denying the motion to dismiss. The court noted that the plaintiff had failed to establish that the defendants had acted with malice or reckless disregard for the plaintiff + python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py --url 10.233.68.237:8001 --protocol grpc --datasets-dir /mount_workspace/data/bpe/ --model-name test_126m --output-len 40 --query Happiness and Happiness The Happiness of Happiness The Happiness of Happiness The Happiness of Happiness The Happiness of Happiness The Happiness of Happiness The Happiness of Happiness

    base-command-059.png

    You could also run an interactive session and run the dialogue application (or prepare an input file to use via stdin). Use the --shell command to start an interactive session.

  3. Start a dialogue in the Base Command Platform shell.

    Copy
    Copied!
                

    ngc batch run \ --name "dialogue_126m_apps" \ --org nv-launchpad-bc \ --team no-team \ --ace nv-launchpad-bc-iad1 \ --instance dgxa100.80g.1.norm \ --image "nvcr.io/nv-launchpad-bc/bignlp-training:22.02-py3" \ --result /results \ --workspace jdoe_workspace:/mount_workspace:RW \ --shell


  4. At the resulting prompt (which may take a minute or two to appear), type in questions for the customer and you will get responses for support.

    Copy
    Copied!
                

    export TRITONSERVER=<IP_ADDRESS_OF_TRITON_SERVER> python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/dialogue.py \ --url ${TRITONSERVER}:8001 --protocol grpc \ --datasets-dir /mount_workspace/data/bpe/ \ --model-name test_126m \ --output-len 40 \ --customer "NVIDIA customer:" \ --support "NVIDIA machine learning expert:" NVIDIA customer: (END to FINISH): How big are you? NVIDIA machine learning expert: I'm a big guy. NVIDIA customer: (END to FINISH): Are you a cat? NVIDIA machine learning expert: I'm a cat. NVIDIA customer: (END to FINISH): Are you sure? NVIDIA machine learning expert: I don't understand. Can you rephrase your request? NVIDIA customer: (END to FINISH): How old are you? NVIDIA machine learning expert: I'm a little older than you. NVIDIA customer: (END to FINISH): Can AI do something? NVIDIA machine learning expert: I'm a little older than you. NVIDIA customer: (END to FINISH): Are you ready for prime time? NVIDIA machine learning expert: I'm ready for prime time. NVIDIA customer: (END to FINISH): END

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.