Synthetic Data Generation

At the end of this lab, you will fully understand how to (1) construct, (2) train, and (3) evaluate a GPT model using Nemo, which incorporates the Megatron framework on tabular time series data. Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. We developed efficient, model-parallel (tensor, sequence, and pipeline) and multi-node pre-training of transformer-based models such as GPT, BERT, and T5 using mixed precision.

Your LaunchPad lab comes preloaded with the NVIDIA AI Enterprise components.

Applications used in this lab:

  • RAPIDS

  • PyTorch

  • Triton Inference Server

What takes place in the lab:

  • Examine dataset

  • Tabular tokenization and preprocessing data using RAPIDS

  • Train a GPT model

  • Visualize training using TensorBoard

  • Generate synthetic data by sending inference requests to the model

  • Evaluate the quality of the generated data by comparing it to real data.

Architecting models on tabular data gathered over time can be challenging because a user’s actions and behaviors change over time. Time series models constantly need more data to improve the accuracy of their predictions. Several methods are used today to increase time-dependent diversity that a model is exposed to, such as bagging, monte carlo, oversampling, K-folds, etc., or deep learning models such as LSTM, VAE, and GAN. All these methods, in one way or another, either independently or as an ensemble, have been used to generate time series data with varying results.

Preprocessing for AI pipelines involves general data ingestion, filtration, and general reformatting. With the NVIDIA AI Enterprise RAPIDS ecosystem, each piece of the workflow is accelerated on GPUs within the NVIDIA-Certified System.

Tokenization is breaking down data into standard units that a model can understand. The authors of the Transformer dataset constructed a tabular tokenizer designed to generate the time series data conditioned on the table’s structural information. However, since Transformer uses a single token for each column, it can cause either accuracy loss if the number of tokens is small for the column or weak generalization if the number of tokens is too large. Within this lab, we will improve it by using multiple tokens to code the columns.

Developing time series models is difficult since the data will take only one real path. To make models more robust, traditional methods such as bagging, monte carlo, boosting, bootstrap, stacking, oversampling, and using K-folds have been employed to increase the variety of paths a model is exposed to. Traditional methods such as oversampling will not enhance the variety of data, and K-folds, while increasing the path variety, will lose important user-level time-dependent data. Novel deep learning model architectures, such as LSTM, VAE, and GAN, were also developed to generate time series data with differing results. While state-of-the-art, these deep learning approaches are challenging to implement in practice. They suffer from either (1) mode-collapse, where all the generated data is the same, (2) long model architecture discovery times, iterating both the model and loss functions to obtain the desired output, which (3) requires hand tuning and expertise and (4) do not keep long term memory of the patterns in the data.

Within this lab, we will take advantage of the long attention span in transformer models to model a series of times accurately. We will use the NVIDIA Megatron-LM framework in conjunction with a special tabular tokenizer, which uses the inherent structure in a table to generate user-level time series data.

Let’s look at an example analogy to understand better what user-level time series data means. Imagine generating payment data for yourself, your parents, and a friend. Each person has different buying profiles, and those profiles change over time. Thus, methods like K-folds wouldn’t make sense because a younger version of yourself is less likely to purchase an expensive bottle of wine compared to yourself, who is now older. Generating user-level time series data using other approaches would require creating multimodal features conditioned on the user to account for the age, location, etc., and the merchant, merchant category, and associated prices of items the merchant is selling. We split transactions by user and use the user’s transactions as a document part of a corpus used to train the GPT model.

The GPT-based Transformer model using Nemo with the Megatron framework is a facile way of generating realistic user-level time series data and requires only supplying the column data types to preprocess the data. Megatron helps to scale training to any size GPU cluster environment to model a large quantity of data.

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.