Tabular Fine-Tuning#

Learn how NeMo Safe Synthesizer uses LLM-based tabular fine-tuning to generate private synthetic data that maintains statistical properties and business logic.

Overview#

Tabular Fine-Tuning is the core technology behind NeMo Safe Synthesizer’s synthetic data generation. It adapts large language models to understand and generate tabular data by converting structured data into text sequences, fine-tuning the model on these sequences, and then generating new structured data.

Tabular Fine-Tuning is an AI system combining a Large-Language Model pre-trained specifically on tabular datasets with learned schema based rules. It can train on datasets of various sizes (we recommend 1,000 or more records) and generate synthetic datasets with unlimited records.

The model excels at matching the correlations (both within a single record and across multiple records) and distributions in its training data across multiple tabular modalities, including numeric, categorical, free text, and event-driven values.

Limitations and Biases#

  1. The default context length for the underlying model in Tabular Fine-Tuning can handle datasets with roughly 50 columns (less if modeling inter-row correlations using group_training_examples_by). Similarly, the default context length can handle event-driven data with sequences up to roughly 20 rows. To go beyond that, increase rope_scaling_factor. Note that the exact threshold (where the job will crash) depends on the number of tokens needed to encode each row, so decreasing the length of column names, abbreviating values, or reducing the number of columns can also help.

  2. Given the model is an LLM, mappings from the training data often persist in the synthetic output, but there is no guarantee. If you require mappings across columns to persist, we recommend doing pre-processing to concatenate the columns or post-processing to filter out rows where the mappings did not persist.

  3. Pre-trained models such as the underlying model in Tabular Fine-Tuning may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.

Troubleshooting#

Tabular Fine-Tuning is recommended when:

  • You want to generate synthetic data and have a sample of at least 500 rows of real data

  • Your dataset has (or can be reduced to) relatively few columns (<30)

  • There are relatively few events per sequence (<10) if you have event-driven data

Common errors#

Once common error you might face when running Tabular Fine-Tuning is limited context window.

Limited context window#

When you train the Tabular Fine-Tuning model, we pass data into the LLM’s context window repeatedly. All of the data related to a single example (i.e. one record, or all records for event-driven data) must fit inside the context window so that it can get passed in together.

When your data has many columns (>30), a single row can often exceed the context window, especially if any columns contain long free text. Similarly, if you have many events per sequence (>10) for event-driven data, passing in all of those at one time (required to learn the event sequences) can exceed the context window.

It is important to highlight that these are all rough guidelines. If your data has fewer columns, you may be able to fit more events per sequence. If your data has long free text in some columns, you may be limited at far fewer than 30 columns.

If any record (or set of records within a sequence for event-driven data) exceeds the context window, the job will not even start fine-tuning.

Possible configuration and data changes that can help this error:

  • You can increase the rope_scaling_factor in order to scale up the context window size (integer between 1 and 6; maximum value is 6)

    • This can be effective, but note that it typically increases the runtime.

  • If you have sequenced data, you can try reducing the number of rows in each sequence (<8-10). Each sequence with all its included rows is assumed as a single example for the LLM. Hence, with more rows, you are more likely to exceed the context window size limit. (Note: Having a high number of columns can make this situation worse!)

  • You can reduce the number of columns (try <20). In particular, columns with long text tend to eat up a good chunk of the context window and are great candidates for removal. If there are columns that are not correlated with any other column, consider removing them as a pre-processing step and adding them back as a post-processing step if needed.