Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
Language Modeling
A language model (LM) estimates the joint probability of a given text corpus \((x_1,\dots,x_T)\) by factorizing it with a chain rule \(P(x_1,\dots,x_T) = \prod_{t=1}^T P(x_t|x_1,\dots,x_{t-1})\) and sequentially modeling each conditional term in the product. To simplify modeling, it is often assumed that the context size (a number of preceding words) necessary to predict each word \(x_t\) in the corpus is limited to \(N:\;P(x_t|x_1,\dots,x_{t-1}) \approx P(x_t|x_{t-N},\dots,x_{t-1})\). This approximation is commonly referred to as N-gram LM.
Currently, we mainly support sentence-level LMs which do not consider long-term dependencies and model all sentences independently of each other. Our models are based on the Transformer sequence-to-sequence architecture [nlp-language_modeling2].
Data Format
Unsupervised LMs require the corpus which comprises many examples of sentences from a particular domain (Wikipedia, news, Pubmed abstracts, etc). We assume that the data is formatted as a text file where each line corresponds to a separate sentence:
Sentence-level LM coprus |
---|
in a silver cake basket as the panins had at their party |
let us pretermit that long comparison |
poverty contempt and sickness treading on my heels i easily resolve not to be affrighted |
It is common practice to apply data cleaning, normalization, and tokenization to the data prior to training LM and NeMo expects already cleaned, normalized, and tokenized data. The only data pre-processing NeMo does is subword tokenization with BPE [nlp-language_modeling1].
Note
If LM is intended to be used in a conjunction with another model (e.g. re-scoring of ASR, shallow fusion with NMT), make sure that the training data is preprocessed accordingly (lower-case no punctuation for ASR, Moses tokenization/normalization for NMT). Otherwise, it might introduce inadequate LM scores.
Tokenizer Training
Our LMs support all tokenizers available in NeMo, but require special beginning-of-string <bos>
and end-of-string <eos>
tokens.
Below is the example of training YouTokenToMe BPE tokenizer:
import youtokentome as yttm
data = # string, path to file with training data
model = # string, path to where the trained model will be saved
vocab_size = # int, number of tokens in the final vocabulary
yttm.BPE.train(data, model, vocab_size)
Sentence Dataset Construction
Given BPE tokenizer and a cleaned sentence-level text corpus, the following steps are applied to create a SentenceDataset object.
Text to IDs - Performs tokenization with the specified tokenizer model on an input sentence and maps it to a sequence of tokens.
Bucketing - Sentences vary in length and when creating minibatches, we’d like sentences in them to have roughly the same length to minimize the number of
<pad>
tokens and to maximize computational efficiency. This step groups sentences of roughly the same length into buckets.Batching and padding - Creates minibatches with a maximum number of tokens specified by
model.{train_ds,validation_ds,test_ds}.tokens_in_batch
from buckets and pads, so they can be packed into a tensor.
To use SentenceDataset
, specify path to the training data in file_name
in the experiment config file. Below is the list of all available configuration options:
Parameter |
Data Type |
Default |
Description |
model.{train_ds,validation_ds,test_ds}.file_name |
str |
|
Path to the file with sentences. |
model.{train_ds,validation_ds,test_ds}.tokens_in_batch |
int |
|
Maximum number of tokens per minibatch. |
model.{train_ds,validation_ds,test_ds}.max_seq_length |
int |
|
Maximum sequence length, to be used with the |
model.{train_ds,validation_ds,test_ds}.clean |
bool |
|
Whether to clean the dataset by discarding examples that are greater than |
model.{train_ds,validation_ds,test_ds}.shuffle |
bool |
|
Whether to shuffle minibatches in the PyTorch DataLoader. |
model.{train_ds,validation_ds,test_ds}.num_samples |
int |
|
Number of samples to use. |
model.{train_ds,validation_ds,test_ds}.pin_memory |
bool |
|
Whether to pin memory in the PyTorch DataLoader. |
model.{train_ds,validation_ds,test_ds}.num_workers |
int |
|
Number of workers for the PyTorch DataLoader. |
Model Configuration and Training
The overall model consists of an encoder and a classification head with the following configuration options:
Parameter |
Data Type |
Default |
Description |
---|---|---|---|
model.encoder.max_sequence_length |
int |
|
Maximum allowed sequence length. |
model.encoder.learn_positional_encodings |
bool |
|
If |
model.encoder.hidden_size |
int |
|
Size of the transformer hidden states. |
model.encoder.num_layers |
int |
|
Number of transformer layers. |
model.encoder.inner_size |
int |
|
Size of the hidden states within the feedforward layers. |
model.encoder.num_attention_heads |
int |
|
Number of attention heads. |
model.encoder.embedding_dropout |
float |
|
Dropout probability of the embedding layer. |
model.encoder.ffn_dropout |
float |
|
Dropout probability within the feedforward layers. |
model.encoder.attn_score_dropout |
float |
|
Dropout probability of the attention scores before softmax normalization. |
model.encoder.attn_layer_dropout |
float |
|
Dropout probability of the attention query, key, and value projection activations. |
model.encoder.hidden_act |
str |
|
Activation function throughout the network. |
model.encoder.mask_future |
bool |
|
Whether to mask future timesteps for attention. Defaults to |
model.encoder.pre_ln |
bool |
|
Whether to apply layer-normalization before ( |
Parameter |
Data Type |
Default |
Description |
---|---|---|---|
model.head.num_layers |
int |
|
Number of layers in the head network. |
model.head.activation |
str |
|
Activation function used after each layer. |
model.head.log_softmax |
bool |
|
Whether to apply |
model.head.dropout |
float |
|
Dropout probability after each layer. |
Our pre-trained models are optimized with Adam, with a maximum learning of 0.001, beta of (0.9, 0.98), and inverse square root learning rate schedule from. The model.optim section sets the optimization parameters.
The following script trains 6-layer Transformer LM:
python examples/nlp/language_modeling/transformer_lm.py \
-cn transformer_lm_config \
trainer.devices=2 \
trainer.accelerator='gpu' \
+exp_manager.exp_dir=/path/to/store/results \
+exp_manager.create_checkpoint_callback=True \
+exp_manager.checkpoint_callback_params.monitor=val_PPL \
+exp_manager.checkpoint_callback_params.mode=min \
+exp_manager.checkpoint_callback_params.save_top_k=5 \
model.train_ds.file_name=/path/to/train.txt \
model.validation_ds.file_name=/path/to/valid.txt \
model.tokenizer.tokenizer_model=/path/to/yttm_tokenizer_model
The trainer keeps track of the LM perplexity (PPL) on the provided validation set and saves the checkpoints that have the top 5 (by default) PPL. At the end of training, a .nemo
file is written to the result directory which allows to run inference on a test set.
Tarred Datasets for Large Corpora
When training with DistributedDataParallel
, each process has its own copy of the dataset. For large datasets, this may not always fit in CPU memory. Webdatasets circumvents this problem by efficiently iterating over tar files stored on disk. Each tar file can contain hundreds to thousands of pickle files, each containing a single minibatch. We recommend using this method when working with the datasets of more than 5 million sentences.
To use an existing TarredSentenceDataset
instead of a non-tarred SentenceDataset
, set is_tarred: true
in
the experiment config file. Then, pass in the path to the metadata file in metadata_file
and paths to all of the text tarballs in tar_files
, either as a list
of filepaths, e.g. ['/data/shard1.tar', '/data/shard2.tar']
, or in a single brace-expandable string, e.g.
'/data/shard_{1..64}.tar'
or '/data/shard__OP_1..64_CL_'
(recommended, see note below).
Note
For brace expansion, there may be cases where {x..y}
syntax cannot be used due to shell interference. This occurs most commonly
inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to {
) are (
,
[
, <
and the special tag _OP_
. Supported closing braces (equivalent to }
) are )
, ]
, >
and the special
tag _CL_
. For SLURM based tasks, we suggest the use of the special tags for ease of use.
Tarred datasets for sentence-level LMs can be created with the following script:
python examples/nlp/machine_translation/create_tarred_monolingual_dataset.py \
--pkl_file_prefix lm \
--tokenizer_model /path/to/tokenizer_model \
--fname /path/to/training_data \
--out_dir /path/to/tarred_dataset \
--tokens_in_batch 2048 \
--num_batches_per_tarfile 250
For example, if your dataset contains 10000 batches, the script above will create 40 tarballs and the output directory will look similar to the following:
/path/to/tarred_dataset
├── lm-batches.tokens.2048.1.tar
├── lm-batches.tokens.2048.2.tar
├── ...
├── lm-batches.tokens.2048.40.tar
└── metadata.json
To train the model on this dataset, the following parameters have to be specified in the model.train_ds section:
use_tarred_dataset: true
tar_files: /path/to/tarred_dataset/lm-batches.2048._OP_1..40_CL_
metadata_fiel: /path/to/tarred_dataset/metadata.json
Below is the full list of available configuration options for TarredSentenceDataset
:
Parameter |
Data Type |
Default |
Description |
---|---|---|---|
model.{train_ds,validation_ds,test_ds}.use_tarred_dataset |
bool |
|
Whether to use tarred datasets. |
model.{train_ds,validation_ds,test_ds}.tar_files |
str |
|
Path to all tar files. Either a list or a single brace-expandable string. |
model.{train_ds,validation_ds,test_ds}.metadata_file |
str |
|
Path to JSON metadata file that contains only a single entry for the total number of batches in the dataset. |
model.{train_ds,validation_ds,test_ds}.tar_shuffle_n |
int |
|
How many samples to look ahead and load to be shuffled. |
model.{train_ds,validation_ds,test_ds}.shard_strategy |
str |
|
How the shards are distributed between multiple workers. Either |
References
- nlp-language_modeling1
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- nlp-language_modeling2
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 6000–6010. 2017.