NLP Server#
The NLP server provides a single unified interface to integrate different NLP models in the dialog pipeline. It utilizes already production tested model servers like the NVIDIA Triton Inference Server while also allowing to integrate experimental custom models easily in the pipeline.
API Server#
The API server has separate endpoints for different NLP tasks with predefined input and output schema. The API server is implemented using the FastAPI as Async web framework along with Gunicorn for managing multiple Uvicorn workers. You can check the list of all available task endpoints in the NLP Server API documentation.
The following tasks are supported by NLP:
Joint Intent & Slot Classification
Named Entity Recognition (NER)
Extractive QA
Language Translation (NMT)
Text to Speech (TTS)
More details are available for the task endpoints in the NLP Server API documentation.
Model Registry#
The model registry maintains a list of all models currently available in the NLP server for inference. It will also keep all metadata required for managing the model lifecycle. A few required fields are as below:
model_name
model_version
api endpoint
model_server_url
andtype
metadata such as
maximum_batch_size
,language
, and so on
Model Servers#
Model servers provide an interface to externally deployed inference servers such as NVIDIA Triton Inference Server and NVIDIA Riva Skills. Each model server class has prebaked helper functions such as monitoring model health, listing available models, getting model configs, and so on. For NLP server supported model servers, we have already included generic model inference clients for different models.
Config Manager#
Config Manager manages different model servers and does registration of available models in the model registry using the model_config.yaml
file and the custom model clients directory. If some model server is specified in model_config.yaml
and is not ready at server startup, the Config Manager will skip registration of the model server. Check the Model Configurations section for more information on supported model servers and schema for model_config.yaml
.
Model Manager#
The NLP server task API endpoint verifies the availability of the model in the model registry and forward requests to the Model Manager for execution. The Model Manager will identify the model specific inference clients using the model registry, execute the inference code, and return a response.
Custom Model Client#
The NLP server allows you to easily deploy any Hugging Face, NeMo, or any other custom model by creating a model inference client with the same input and output specifications as the NLP server task API endpoint. It mainly relies on the @model_api
and @pytriton
decorator functions.
@model_api
Decorator#
You can go through the tutorial Custom Model Integration using the @model_api Decorator for understanding how you can utilize @model_api
decorator. @model_api
decorator function accepts the following keyword arguments:
Argument |
Supported Types |
Description |
---|---|---|
|
|
NLP server API endpoint path. [Mandatory] |
|
|
List of model names which are supported by the inference client. [Mandatory] |
|
|
Name of model server. Available options are |
|
|
Version tag for the model. [Default value blank string] |
|
|
Dictionary of parameters and metadata which may get used by inference clients. [Optional] |
Another feature of @model_api
decorator is to easily get all metadata related to model servers and model itself in the inference function. For example, to access the Triton model server gRPC URL in inference function:
from nlp_server.decorators import model_api @model_api(endpoint="/nlp/model/text_classification", model_name="test_model", model_type="triton") def example_infer_func(input_request): ””” example_infer_func.model_info returns all metadata for the given model_name in input request in the same format as returned by /model/list_models API endpoint. The NLP server identifies the Triton URL during the model registration process via model_name and model_type provided in @model_api decorator. ””” url=example_infer_func.model_info.url # GRPC endpoint for triton server hosting the model parameters=example_infer_func.model_info.parameters # Any parameters passed via model_api decorator …
@pytriton
Decorator#
PyTriton is a Flask/FastAPI-like interface that simplifies Triton’s deployment in Python environments. The library allows serving machine learning models directly from Python through NVIDIA’s Triton Inference Server. You can check the tutorial Custom Model Integration using the @pytriton Decorator to understand how you can utilize the PyTriton library in the NLP server. You will need to embed your code in @pytriton
decorator function similar to the following format and the NLP server will create a single Triton context and make a triton.serve
call after loading all @pytriton
decorated functions.
from pytriton.triton import Triton from nlp_server.decorators import pytriton @pytriton() def custom_pytriton_model(triton: Triton): # Embed your code in this function as required for hosting model with PyTriton library …
You will also need to write @model_api
decorated inference function for integrating the PyTriton hosted model.