Technical Brief

Recommendation Systems grow and scale customer engagement, improving customer retention and boosting revenue. Recommender systems are the most important machine learning pipeline in the world today – as the engine of the internet, they drive every action that you take online, from the selection of this web page that you’re reading now to more obvious examples like online shopping. They play a critical role in driving user engagement on online platforms, predicting the next good or service from the exponentially growing number of available options. On some of the largest commercial platforms, recommendations account for as much as 30% of the revenue.

This reference AI workflow for session-based next-item prediction shows how to use NVIDIA Merlin, an end-to-end framework for building high-performing recommender systems at scale. A session-based recommendation is the next-generation AI method that predicts the next action – it predicts preferences from contextual user interactions for first-time, early, or anonymous online users. One of the biggest challenges of building a recommender is the lack of historical interaction data. With session-based recommender systems, little or no online user historical data is required. By providing recommendations based on very recent user interactions from the current session, it is much easier to provide accurate predictions, solve user cold-start, comply with privacy restrictions, and address real-time trends.

Building a recommendation system that suggests a next “item” for an end user (for example, suggesting a next workout on Peloton or suggesting the next advertisement to show on Snapchat) is much more complex than training a single model and deploying it. Full recommendation systems require a variety of components, and in this reference workflow, we implement the necessary pieces to build a session-based recommender system to serve user’s appropriate items.

This Next Item Prediction NVIDIA AI workflow contains:

  • NVIDIA Merlin for pre-processing and training

  • NVIDIA Merlin and Triton Inference Server for inference

  • MLflow Registry for model tracking

  • Prometheus and Grafana for metric collection and reporting


These components and instructions used in the workflow are intended to be used as examples for integration, and may not be sufficiently production-ready on their own as stated. The workflow should be customized and integrated into one’s infrastructure, using the workflow as a reference.

Using the above assets, this NVIDIA AI Workflow provides a reference for you to get started and build your own AI solution with minimal preparation and includes enterprise-ready implementation best practices which range from authentication, monitoring, reporting, and load balancing, helping you achieve the desired AI outcome more quickly while still allowing a path for you to deviate.

NVIDIA AI Workflows are designed as microservices, which means they can be deployed on Kubernetes alone or with other microservices to create a production-ready application for seamless scaling in your Enterprise environment.

The following cloud-native Kubernetes services are used with this workflow:

  • NVIDIA Merlin

  • MLflow

  • Prometheus

  • Grafana

  • MinIO for S3 Compatible Object Storage

The workflow components are packaged together into a deployable solution described in the diagram below:


These components are used to build and deploy training and inference pipelines, integrated together with the additional components as indicated in the diagram below:


More information about the components used can be found in the Next Item Prediction and the NVIDIA Cloud Native Service Add-on Pack Deployment Guide.

The NVIDIA Next Item Prediction workflow uses a variety of NVIDIA AI components to both train and deploy the recommendation system:

The following sections describe these NVIDIA AI components further.

Training the Recommendation System

The next-item prediction training pipeline includes both preprocessing with Merlin NVTabular and model training with Merlin Transformers4Rec. Both the preprocessing workflow and the trained model are stored within a model repository using MLflow, which will be used later for inference.


Data Preprocessing

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems. The preprocessing and feature engineering pipeline (ETL) is executed using NVTabular.


Merlin Transformers4Rec is designed to make state-of-the-art session-based recommendations available for recommender systems. It leverages HuggingFace’s Transformers NLP library to make it easy to use cutting-edge implementations of the latest NLP Transformer architectures in your recommendation systems. A session-based recommendation model is trained with a Transformer architecture (XLNET) with masked language modeling (MLM) using Merlin’s Transformers4Rec library.

Deploying the Recommendation System


Data Processing and Inference

Merlin simplifies the deployment of recommender systems to the Triton Inference Server. Both the pre-processing workflow and trained model are deployed to Nvidia’s Triton Inference server as an ensemble, so features can be transformed in real-time during inference requests. Not only does deployment become simpler, but Triton also maximizes the throughput of requests through the configuration of latency and GPU utilization.


The Transformers4Rec library comes with utility functions to use not only during training, but also in inference pipelines (for example, viewing the top-k recommendations for easy parsing of results).

Model Storage

The MLflow open-source platform is a key element to both the Training and Inferencing pipelines. The MLOps platform enables organizations to easily manage their end-to-end machine learning lifecycle. MLflow uses a centralized model store and has its own set of APIs and user interface for manageability. In this workflow, the trained model is managed with MLflow.


Prometheus is an open-source monitoring and alerting solution. In this workflow, it stores pipeline performance metrics from Triton, which enables System Administrators to understand the health and throughput of the system.

While the metrics are available in plain text, Grafana is also used for visualization of the metrics via a dashboard.

Some of the metrics available, for example, are shown below; depending on the usage metrics, the Merlin pods can be scaled manually or automatically.



The Next Item Prediction workflow shows how to integrate with a cloud native application level loadbalancer (Envoy) and an Identity Provider (Keycloak) for JSON Web Token Authentication as an example for how one might deploy and interact with Triton securely in their own environment. For more information about the authentication portion of the workflow, refer to the Authentication section in the Appendix.

© Copyright 2022-2023, NVIDIA. Last updated on May 23, 2023.