NVIDIA Docs Hub NVIDIA LaunchPad Train a Large-Scale NLP Model with NeMo Megatron Step #6: Inference with BigNLP Model

Step #6: Inference with BigNLP Model

The Triton Inference Server will be used to run a few inferencing examples.

We will convert the nemo format model created in the previous step into fast-transformer format for the Triton inference server deployment.

Convert the model to fast-transformer format and configure for Triton server deployment.

This is done using two commands: model-navigator convert and model-navigator triton-config-model. The ngc batch command below runs both.

Copy
Copied!

            
            ngc batch run \
  --name "bignlp_triton_config_model_126m" \
  --org nv-launchpad-bc \
  --team no-team \
  --ace nv-launchpad-bc-iad1 \
  --instance dgxa100.80g.1.norm \
  --image "nvcr.io/nv-launchpad-bc/bignlp-training:22.02-py3" \
  --result /results \
  --workspace jdoe_workspace:/mount_workspace:RW \
  --total-runtime 30m \
  --commandline "\
set -x && \
touch /opt/bignlp/bignlp-scripts/conf/inference/optimal_configurations/126m.yaml && \
tee /opt/bignlp/bignlp-scripts/conf/inference/optimal_configurations/126m.yaml << EOF
launch_mode: local
use_symlinks: true
override_conversion_container: false
framework_docker_image: nvcr.io/nvidia/pytorch:21.09-py3
target_formats:
- fastertransformer
onnx_opsets: []
tensorrt_precisions: []
tensorrt_precisions_mode: hierarchy
tensorrt_explicit_precision: false
tensorrt_strict_types: false
tensorrt_sparse_weights: false
ft_gpu_counts:
- 4
atol:
? ''
: 1.0e-05
rtol:
? ''
: 1.0e-05
tensorrt_capture_cuda_graph: false
max_batch_size: 256
preferred_batch_sizes:
- 256
max_queue_delay_us: 0
engine_count_per_device:
cpu: 1
triton_backend_parameters:
is_half: '1'
pipeline_para_size: '1'
max_input_len: '200'
max_seq_len: '400'
triton_docker_image: nvcr.io/nvidia/tritonserver:21.09-py3
config_search_max_concurrency: 1024
config_search_max_instance_count: 5
config_search_max_preferred_batch_size: 32
config_search_concurrency:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
- 512
config_search_batch_sizes:
- 1
config_search_instance_counts:
cpu:
- 1
triton_disable_model_reload: false
max_shapes:
INPUT_ID:
- -1
- 1
- 200
REQUEST_INPUT_LEN:
- -1
- 1
REQUEST_OUTPUT_LEN:
- -1
- 1
value_ranges:
INPUT_ID:
- 50256
- 50256
REQUEST_INPUT_LEN:
- 200
- 200
REQUEST_OUTPUT_LEN:
- 200
- 200
dtypes:
INPUT_ID: uint32
REQUEST_INPUT_LEN: uint32
REQUEST_OUTPUT_LEN: uint32
perf_analyzer_timeout: 1200
perf_measurement_mode: count_windows
perf_measurement_request_count: 50
perf_measurement_interval: 10000
perf_measurement_relative_step: 0.5
EOF
model-navigator convert \
--override-workspace \
--workspace-path /mount_workspace/infer_workspace/126m \
--config-path /opt/bignlp/bignlp-scripts/conf/inference/optimal_configurations/126m.yaml \
--model-name test_126m \
--model-path /mount_workspace/results/gpt3_126m/convert_nemo/megatron_gpt_126m_bf16.nemo \
--model-format nemo \
--output-path /mount_workspace/infer_workspace/126m/test_126m-converted.ft \
--verbose \
--use-symlinks \
> >(tee -a /results/model-navigator_convert.log) \
2> >(tee -a /results/model-navigator_convert_stderr.log >&2) && \
cp /results/model-navigator_convert.log /mount_workspace/infer_workspace/126m && \
cp /results/model-navigator_convert_stderr.log /mount_workspace/infer_workspace/126m && \
model-navigator triton-config-model \
--override-workspace \
--workspace-path /mount_workspace/infer_workspace/126m \
--config-path /opt/bignlp/bignlp-scripts/conf/inference/optimal_configurations/126m.yaml \
--model-name test_126m \
--model-path /mount_workspace/infer_workspace/126m/converted/megatron_gpt_126m_bf16-nemo2ft_4.ft \
--model-repository /mount_workspace/infer_workspace/triton_repo \
--use-symlinks \
--verbose \
--max-batch-size 16 \
--preferred-batch-sizes 16 \
--triton-backend-parameters is_half=1 pipeline_para_size=2 max_input_len=60 max_seq_len=80 \
> >(tee -a /results/model-navigator_triton-config.log) \
2> >(tee -a /results/model-navigator_triton-config_stderr.log >&2) && \
cp /results/model-navigator_triton-config.log /mount_workspace/infer_workspace/126m && \
cp /results/model-navigator_triton-config_stderr.log /mount_workspace/infer_workspace/126m"

The Triton model repo is set up in /mount_workspace/infer_workspace/triton_repo.

Start the Triton server using the command below. Note that we are also downloading gpt2 vocab and merge files to enable the inference workload we will run.

Copy
Copied!

            
            ngc batch run \
  --name "bignlp_triton_inference_server" \
  --org nv-launchpad-bc \
  --team no-team \
  --ace nv-launchpad-bc-iad1 \
  --instance dgxa100.80g.8.norm \
  --image "nvcr.io/nv-launchpad-bc/bignlp-inference:22.02-py3" \
  --result /results \
  --workspace jdoe_workspace:/mount_workspace:RW \
  --total-runtime 2.5h \
  --commandline "\
set -x && \
hostname -I && \
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P /mount_workspace/data/bpe && \
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P /mount_workspace/data/bpe && \
tritonserver --model-repository /mount_workspace/infer_workspace/triton_repo"

The inference container nvcr.io/nv-launchpad-bc/bignlp-inference:22.02-py3 is being used because it has the tritonserver application installed. Once the bignlp_triton_inference_server job starts, note its job-id in the console output, or look it up in the web UI.

Note

The previous command which started the Triton Inference Server is intended to run in the backround and will accept inference requests until it automatically terminates after 2.5 hours. The remaining commands in this step need to be run during this window. If the inference server is terminated prior to running all commands, launch the server again using the command above and capture the new IP address for the server before running inferencing commands.

You need to figure out the IP address internal to BCP that the inference server is running on. We have not exposed the server ports as would normally be done. The complication is that on BCP the exposed ports are served via https and that requires setting up the Triton server with SSL authentication. Refer to the Triton SSL/TLS documentation. For the demo, the inference client applications will run on BCP over the internal BCP network. Run the following command to obtain the Triton inference server IP address.

Copy
Copied!

            
            # from above tritonserver get TRS_JOBID
export TRS_JOBID=2714872
TRITONSERVER=($(ngc batch exec \
  --org nv-launchpad-bc \
  --team no-team \
  --ace nv-launchpad-bc-iad1 \
  --commandline "hostname -I" \
  ${TRS_JOBID})) && \
export TRITONSERVER=${TRITONSERVER[0]}

Alternatively lookup the IP address in the job log in the UI. To do so, view the first 10 lines of the job log in the web UI. You should see something like the following:

Copy
Copied!

            
            + hostname -I
10.233.68.237

Then export it via export TRITONSERVER=10.233.68.237 (use the correct IP address for your running Triton server). If you get the error message “export: TRITONSERVER: inconsistent type for assignment”, run unset TRITONSERVER and export the variable again. To verify the variable is set, run echo $TRITONSERVER.

We are now ready to run inference applications. The inference applications are written in python code. The following will run a chatbot application. Be sure to update TRITONSERVER=<IP_ADDRESS_OF_TRITON_SERVER> in the command below with your IP address.

Copy
Copied!

            
            ngc batch run \
  --name "bignlp_triton_inference_126m_apps" \
  --org nv-launchpad-bc \
  --team no-team \
  --ace nv-launchpad-bc-iad1 \
  --instance dgxa100.80g.1.norm \
  --image "nvcr.io/nv-launchpad-bc/bignlp-training:22.02-py3" \
  --result /results \
  --workspace jdoe_workspace:/mount_workspace:RW \
  --total-runtime 1h \
  --commandline "\
set -x && \
export TRITONSERVER=<IP_ADDRESS_OF_TRITON_SERVER>;
curl -v http://${TRITONSERVER}:8000/v2/health/ready && \
python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py \
--url${TRITONSERVER}:8001 --protocol grpc \
--datasets-dir /mount_workspace/data/bpe/ \
--model-name test_126m \
--output-len 40 \
--query \"Life\" && \
python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py \
--url${TRITONSERVER}:8001 --protocol grpc \
--datasets-dir /mount_workspace/data/bpe/ \
--model-name test_126m \
--output-len 40 \
--query \"Liberty\" && \
python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py \
--url${TRITONSERVER}:8001 --protocol grpc \
--datasets-dir /mount_workspace/data/bpe/ \
--model-name test_126m \
--output-len 40 \
--query \"Happiness\""

You can find the output in joblog.log (abbreviated the curl command output).

Copy
Copied!

            
            File: joblog.log
+ curl -v http://10.233.68.237:8000/v2/health/ready
. . .
< HTTP/1.1 200 OK
. . .
+ python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py --url 10.233.68.237:8001 --protocol grpc --datasets-dir /mount_workspace/data/bpe/ --model-name test_126m --output-len 40 --query Life
.
The first of the three, the first of the three, was a story of a young woman who was sent to a convent in the countryside of the Netherlands. She was a nun,
+ python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py --url 10.233.68.237:8001 --protocol grpc --datasets-dir /mount_workspace/data/bpe/ --model-name test_126m --output-len 40 --query Liberty
, the court held that the trial court erred in denying the motion to dismiss. The court noted that the plaintiff had failed to establish that the defendants had acted with malice or reckless disregard for the plaintiff
+ python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/chatbot.py --url 10.233.68.237:8001 --protocol grpc --datasets-dir /mount_workspace/data/bpe/ --model-name test_126m --output-len 40 --query Happiness
and Happiness
The Happiness of Happiness
The Happiness of Happiness
The Happiness of Happiness
The Happiness of Happiness
The Happiness of Happiness
The Happiness of Happiness

You could also run an interactive session and run the dialogue application (or prepare an input file to use via stdin). Use the --shell command to start an interactive session.

Start a dialogue in the Base Command Platform shell.

Copy
Copied!

            
            ngc batch run \
  --name "dialogue_126m_apps" \
  --org nv-launchpad-bc \
  --team no-team \
  --ace nv-launchpad-bc-iad1 \
  --instance dgxa100.80g.1.norm \
  --image "nvcr.io/nv-launchpad-bc/bignlp-training:22.02-py3" \
  --result /results \
  --workspace jdoe_workspace:/mount_workspace:RW \
  --shell

At the resulting prompt (which may take a minute or two to appear), type in questions for the customer and you will get responses for support.

Copy
Copied!

            
            export TRITONSERVER=<IP_ADDRESS_OF_TRITON_SERVER>

python3 /opt/bignlp/bignlp-scripts/bignlp/infer_scripts/dialogue.py \
--url ${TRITONSERVER}:8001 --protocol grpc \
--datasets-dir /mount_workspace/data/bpe/ \
--model-name test_126m \
--output-len 40 \
--customer "NVIDIA customer:" \
--support "NVIDIA machine learning expert:"

NVIDIA customer: (END to FINISH): How big are you?
NVIDIA machine learning expert: I'm a big guy.
NVIDIA customer: (END to FINISH): Are you a cat?
NVIDIA machine learning expert: I'm a cat.
NVIDIA customer: (END to FINISH): Are you sure?
NVIDIA machine learning expert: I don't understand. Can you rephrase your request?
NVIDIA customer: (END to FINISH): How old are you?
NVIDIA machine learning expert: I'm a little older than you.
NVIDIA customer: (END to FINISH): Can AI do something?
NVIDIA machine learning expert: I'm a little older than you.
NVIDIA customer: (END to FINISH): Are you ready for prime time?
NVIDIA machine learning expert: I'm ready for prime time.
NVIDIA customer: (END to FINISH): END