Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Training

Prepare the Data

If your data is in JSONL (JSON Lines) format, it might look like this:

{"id": 0, "text": "Hi how are you?", "language": "en"}
...
{"id": 168344, "text": "Today was a great day!", "language": "en"}

You need to extract the text field only from each JSON object and save it into a separate file. The output file should have one sentence per line, like this:

Hi how are you?
...
Today was a great day!

It is also possible to train the SentencePiece tokenizer on multiple data files. Simply repeat the extraction process for each file and concatenate the results into a single text file or use multiple text files during the training phase:

spm_train --input=datafile_1.txt,datafile_2.txt,datafile_3.txt

Example of data reformatting script in Ubuntu:

jq .text input_file.jsonl | sed 's/^.\(.*\).$/\1/' > output_file.txt

Training the Tokenizer

The following example shows a script used by the NeMo team to train the SentencePiece tokenizer for the Red Pajama v2 dataset:

spm_train --input=datafile_1.txt,datafile_2.txt \
        --model_prefix=rp2_tokenizer \
        --vocab_size=32000 \
        --character_coverage=0.9995 \
        --model_type=bpe \
        --num_threads=222 \
        --split_digits \
        --byte_fallback \
        --pad_id=0 \
        --unk_id=1 \
        --bos_id=2 \
        --eos_id=3 \
        --max_sentence_length=10000 \
        --normalization_rule_name=identity \
        --allow_whitespace_only_pieces=true \
        --remove_extra_whitespaces=false

Parameters

input: Path to the input text files containing the sentences for training.
model_prefix: Prefix for the output model files.
vocab_size: Desired vocabulary size.
character_coverage: Character coverage to determine the minimum symbols.
model_type: Type of model to train. Options include unigram, bpe, word, or char.
num_threads: Number of threads to use during training. In our case it is set to the number of available CPU cores minus two.
split_digits: Split all digits (0-9) into separate pieces.
byte_fallback: Decompose unknown pieces into UTF-8 byte pieces.
pad_id=0: ID for the padding token.
unk_id=1: ID for the unknown token.
bos_id=2: ID for the beginning-of-sequence token.
eos_id=3: ID for the end-of-sequence token.
max_sentence_length: Maximum sentence length to consider.
normalization_rule_name: Normalization rule to apply.
allow_whitespace_only_pieces: Allow pieces that contain only whitespaces.
remove_extra_whitespaces: Removes leading, trailing, and duplicate internal whitespace.

For more information on all available parameters for SentencePiece tokenizer training, please see this page.