Triton Inference Server Linear model deployment#
This example, shows how to optimize a simple linear model and deploy it to Triton Inference Server.
Requirements#
The example requires the torch package. It can be installed in your current environment using pip:
pip install torch
Or you can use NVIDIA Torch container:
docker run -it --gpus 1 --shm-size 8gb -v ${PWD}:${PWD} -w ${PWD} nvcr.io/nvidia/pytorch:23.01-py3 bash
If you select to use container, we recommend installing NVIDIA Container Toolkit.
Run model optimization#
In the next step, the optimize process will be performed for the model.
python examples/triton/optimize.py
Once the process is done, the model_repository catalog is created in the current working directory.
At this point, it exits the container.
exit
Start Triton Inference Server#
Based on the created deployment in model repository, the Triton Inference Server can be executed. The following command starts the server in background mode and exposes the HTTP and gRPC ports.
docker run --gpus=1 --rm -d \
--name tritonserver \
-p8000:8000 \
-p8001:8001 \
-p8002:8002 \
-v ${PWD}/model_repository:/models \
nvcr.io/nvidia/tritonserver:23.01-py3 \
tritonserver --model-repository=/models
Use Perf Analyzer to profile the model#
Finally, you can run container with Perf Analyzer:
docker run -it --network=host nvcr.io/nvidia/tritonserver:23.01-py3-sdk bash
And profile the model:
perf_analyzer -m linear --concurrency-range 2:32:2
Remove containers#
After finishing running the example, remove the Triton container working in the background:
docker stop tritonserver