Sentiment Analysis Overview

Data Processing, Tokenization, & Sentiment Analysis (Latest Version)

At the end of this lab, you will fully understand how to train and deploy an end-to-end sentiment analysis model using NVIDIA AI Enterprise frameworks and applications.

Applications used in this lab:

  • RAPIDS

  • PyTorch/Huggingface

  • Triton Inference Server

What takes place in the lab:

  • Preprocessing with Pandas and RAPIDS

  • Tokenization with RAPIDS subword tokenizer

  • Train a hugging face PyTorch sentiment analysis model

  • Deploy an ensemble model on the Triton Inference server

  • Write a client application to send inference requests to the server

Sentiment analysis is the mining of text to extract subjective information, such as opinions, emotions, or attitudes about a brand or a product. Expressions can be classified as positive, neutral, or negative.

Many companies attempt to ascertain product feedback through customer reviews. Traditionally, Sentiment Analysis is performed on the text by grouping words of a sentence together and having a dictionary of “positive” and “negative” words. These words are then compared to the text/review within the dictionary and counted for positive and negative words. If the count of positive words is more than the negative words, then the whole sentence is considered to have a positive sentiment and vice versa. This process can pose numerous problems. First, the context of the sentence is lost. For example, “This has not been the greatest experience” is a negative sentiment. Still, the model would look at the term “greatest” and classify the entire sentence as a “positive” sentiment.

With recent advances in Deep Learning for Natural Language Processing (NLP) with Transformer models like BERT, the context of a text is taken into account to better classify the sentences for sentiment analysis and provide companies an AI-enhanced view of product feedback.

Within this guide, we will use a sample Amazon customer review dataset. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). We will use the 5-core version of data available here. The 5-core data is a subset of the data in which all users and items have at least 5 reviews (75.26 million reviews).

We will train a model with the customer reviews as inputs and ratings as the output using the review and the rating. If the rating is low, the review is bad or has a negative sentiment; otherwise, the sentiment is positive.

Preprocessing for NLP pipelines involves general data ingestion, filtration, and general reformatting. With the NVIDIA AI Enterprise RAPIDS ecosystem, each piece of the workflow is accelerated on GPUs within the NVIDIA-Certified System.

A deep learning model can’t consume text/sentences directly. The text first needs to be converted into a format that the model can understand. Tokenization is the process of breaking down the text into standard units that a model can understand. Traditional tokenization algorithms would split a sentence by a delimiter and assign each word a numerical value.

Example: “A quick fox jumps over a lazy dog” can be split into [“A”, “quick”, “fox”, “jumps”, “over”, “a”, “lazy”, “dog”] and can be assigned numerical values [1, 2,3, 4, 5, 6, 7, 8]. This vector can then be sent to a model as an input. The numeric values can be assigned by keeping a dictionary of all the words in the English language and giving each of them an ID. This dictionary is called a vocabulary in the NLP jargon.

Tokenizing words in this way (splitting by spaces) can pose the following issues:

  • A large vocabulary is needed as you will need to store all words in the dictionary.

  • Uncertainty of combined words like “check-in” (i.e., what exactly constitutes a word) is often ambiguous.

  • Certain languages don’t segment well by spaces.

A potential solution is to use subword tokenization, which helps break down unknown words into “subword units” so that models can make intelligent decisions on words that aren’t recognized. For example, words like check-in get further split into “check” and “in” or cycling into “cycle” and “ing” thereby reducing the number of words in the vocabulary.

The AI deployment pipeline has a preprocessing step (i.e., the tokenization step) above before the input is sent to the Deep learning model for inference. Traditionally, this process was done on CPUs. The GPUs are good at inference. As the GPUs got faster, this preprocessing step became the bottleneck. RAPIDS can now do the tokenization on the GPU itself, thereby removing the bottleneck. The current RAPIDS tokenizer is 270 times faster than current CPU-based implementations, thereby removing the bottleneck.

Let’s get started!

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.