NLP Server

The NLP server provides a single unified interface to integrate different NLP models in the dialog pipeline. It utilizes already production tested model servers like the NVIDIA Triton Inference Server while also allowing to integrate experimental custom models easily in the pipeline.

NLP Server Architecture

API Server

The API server has separate endpoints for different NLP tasks with predefined input and output schema. The API server is implemented using the FastAPI as Async web framework along with Gunicorn for managing multiple Uvicorn workers. You can check the list of all available task endpoints in the NLP Server API documentation.

The following tasks are supported by NLP:

  • Joint Intent & Slot Classification

  • Named Entity Recognition (NER)

  • Embedding Generation

  • Extractive QA

  • Language Translation (NMT)

  • Text to Speech (TTS)

More details are available for the task endpoints in the NLP Server API documentation.

Model Registry

The model registry maintains a list of all models currently available in the NLP server for inference. It will also keep all metadata required for managing the model lifecycle. A few required fields are as below:

  • model_name

  • model_version

  • api endpoint

  • model_server_url and type

  • metadata such as maximum_batch_size, language, and so on

Model Servers

Model servers provide an interface to externally deployed inference servers such as NVIDIA Triton Inference Server and NVIDIA Riva Skills. Each model server class has prebaked helper functions such as monitoring model health, listing available models, getting model configs, and so on. For NLP server supported model servers, we have already included generic model inference clients for different models.

Config Manager

Config Manager manages different model servers and does registration of available models in the model registry using the model_config.yaml file and the custom model clients directory. If some model server is specified in model_config.yaml and is not ready at server startup, the Config Manager will skip registration of the model server. Check the Model Configurations section for more information on supported model servers and schema for model_config.yaml.

Model Manager

The NLP server task API endpoint verifies the availability of the model in the model registry and forward requests to the Model Manager for execution. The Model Manager will identify the model specific inference clients using the model registry, execute the inference code, and return a response.

Custom Model Client

The NLP server allows you to easily deploy any Hugging Face, NeMo, or any other custom model by creating a model inference client with the same input and output specifications as the NLP server task API endpoint. It mainly relies on the @model_api and @pytriton decorator functions.

@model_api Decorator

You can go through the tutorial Custom Model Integration using the @model_api Decorator for understanding how you can utilize @model_api decorator. @model_api decorator function accepts the following keyword arguments:

Argument

Supported Types

Description

endpoint

String

NLP server API endpoint path. [Mandatory]

model_name

String or List of Strings

List of model names which are supported by the inference client. [Mandatory]

model_type

String

Name of model server. Available options are triton, riva, nemo_llm, openai_llm, and custom. [Default value custom]

model_version

String

Version tag for the model. [Default value blank string]

parameters

Dictionary

Dictionary of parameters and metadata which may get used by inference clients. [Optional]

Another feature of @model_api decorator is to easily get all metadata related to model servers and model itself in the inference function. For example, to access the Triton model server gRPC URL in inference function:

from nlp_server.decorators import model_api
@model_api(endpoint="/nlp/model/text_classification", model_name="test_model", model_type="triton")
def example_infer_func(input_request):
”””
example_infer_func.model_info returns all metadata for the given model_name in input request in the same format as returned by /model/list_models API endpoint. The NLP server identifies the Triton URL during the model registration process via model_name and model_type provided in @model_api decorator.

”””
      url=example_infer_func.model_info.url # GRPC endpoint for triton server hosting the model
      parameters=example_infer_func.model_info.parameters # Any parameters passed via model_api decorator
      

@pytriton Decorator

PyTriton is a Flask/FastAPI-like interface that simplifies Triton’s deployment in Python environments. The library allows serving machine learning models directly from Python through NVIDIA’s Triton Inference Server. You can check the tutorial Custom Model Integration using the @pytriton Decorator to understand how you can utilize the PyTriton library in the NLP server. You will need to embed your code in @pytriton decorator function similar to the following format and the NLP server will create a single Triton context and make a triton.serve call after loading all @pytriton decorated functions.

from pytriton.triton import Triton
from nlp_server.decorators import pytriton

@pytriton()
def custom_pytriton_model(triton: Triton):
# Embed your code in this function as required for hosting model with PyTriton library

You will also need to write @model_api decorated inference function for integrating the PyTriton hosted model.