Building Complex Pipelines: Stable Diffusion#

Navigate to	Part 5: Building Model Ensembles	Part 7: Iterative Scheduling Tutorial	Documentation: BLS

Watch this explainer video with discusses the pipeline, before proceeding with the example. This example focuses on showcasing two of Triton Inference Server’s features:

Using multiple frameworks in the same inference pipeline. Refer this for more information about supported frameworks.
Using the Python Backend’s Business Logic Scripting API to build complex non linear pipelines.

Using Multiple Backends#

Building a pipeline powered by deep learning models is a collaborative effort which often involves multiple contributors. Contributors often have differing development environment. This can lead to issues whilst building a single pipeline with work from different contributors. Triton users can solve this challenge with the use of the Python or a C++ backend along with the Business Logic Scripting API (BLS) API to trigger model execution.

Pipeline

In this example, the models are being run on:

ONNX Backend
TensorRT Backend
Python Backend

Both the models deployed on a framework backend can be triggered using the following API:

encoding_request = pb_utils.InferenceRequest(
    model_name="text_encoder",
    requested_output_names=["last_hidden_state"],
    inputs=[input_ids_1],
)

response = encoding_request.exec()
text_embeddings = pb_utils.get_output_tensor_by_name(response, "last_hidden_state")

Refer to model.py in the pipeline model for a complete example.

Stable Diffusion Example#

Before starting, clone this repository and navigate to the root folder. Use three different terminals for an easier user experience.

Step 1: Prepare the Server Environment#

First, run the Triton Inference Server Container.

# Replace yy.mm with year and month of release. Eg. 22.08
docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash

Next, install all the dependencies required by the models running in the python backend and login with your huggingface token(Account on HuggingFace is required).

# PyTorch & Transformers Lib
pip install torch torchvision torchaudio
pip install transformers ftfy scipy accelerate
pip install diffusers==0.9.0
pip install transformers[onnxruntime]
huggingface-cli login

Step 2: Exporting and converting the models#

Use the NGC PyTorch container, to export and convert the models.

docker run -it --gpus all -p 8888:8888 -v ${PWD}:/mount nvcr.io/nvidia/pytorch:yy.mm-py3

pip install transformers ftfy scipy
pip install transformers[onnxruntime]
pip install diffusers==0.9.0
huggingface-cli login
cd /mount
python export.py

# Accelerating VAE with TensorRT
trtexec --onnx=vae.onnx --saveEngine=vae.plan --minShapes=latent_sample:1x4x64x64 --optShapes=latent_sample:4x4x64x64 --maxShapes=latent_sample:8x4x64x64 --fp16

# Place the models in the model repository
mkdir model_repository/vae/1
mkdir model_repository/text_encoder/1
mv vae.plan model_repository/vae/1/model.plan
mv encoder.onnx model_repository/text_encoder/1/model.onnx

Step 3: Launch the Server#

From the server container, launch the Triton Inference Server.

tritonserver --model-repository=/models

Step 4: Run the client#

Use the client container and run the client.

docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:yy.mm-py3-sdk bash

# Client with no GUI
python3 client.py

# Client with GUI
pip install gradio packaging
python3 gui/client.py --triton_url="localhost:8001"

Note: First Inference query may take more time than successive queries