Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Training
Prepare the Data
If your data is in JSONL (JSON Lines) format, it might look like this:
{"id": 0, "text": "Hi how are you?", "language": "en"}
...
{"id": 168344, "text": "Today was a great day!", "language": "en"}
You need to extract the text field only from each JSON object and save it into a separate file. The output file should have one sentence per line, like this:
Hi how are you?
...
Today was a great day!
It is also possible to train the SentencePiece tokenizer on multiple data files. Simply repeat the extraction process for each file and concatenate the results into a single text file or use multiple text files during the training phase:
spm_train --input=datafile_1.txt,datafile_2.txt,datafile_3.txt
Example of data reformatting script in Ubuntu:
jq .text input_file.jsonl | sed 's/^.\(.*\).$/\1/' > output_file.txt
Training the Tokenizer
The following example shows a script used by the NeMo team to train the SentencePiece tokenizer for the Red Pajama v2 dataset:
spm_train --input=datafile_1.txt,datafile_2.txt \
--model_prefix=rp2_tokenizer \
--vocab_size=32000 \
--character_coverage=0.9995 \
--model_type=bpe \
--num_threads=222 \
--split_digits \
--byte_fallback \
--pad_id=0 \
--unk_id=1 \
--bos_id=2 \
--eos_id=3 \
--max_sentence_length=10000 \
--normalization_rule_name=identity \
--allow_whitespace_only_pieces=true \
--remove_extra_whitespaces=false
Parameters
input: Path to the input text files containing the sentences for training.
model_prefix: Prefix for the output model files.
vocab_size: Desired vocabulary size.
character_coverage: Character coverage to determine the minimum symbols.
model_type: Type of model to train. Options include unigram, bpe, word, or char.
num_threads: Number of threads to use during training. In our case it is set to the number of available CPU cores minus two.
split_digits: Split all digits (0-9) into separate pieces.
byte_fallback: Decompose unknown pieces into UTF-8 byte pieces.
pad_id=0: ID for the padding token.
unk_id=1: ID for the unknown token.
bos_id=2: ID for the beginning-of-sequence token.
eos_id=3: ID for the end-of-sequence token.
max_sentence_length: Maximum sentence length to consider.
normalization_rule_name: Normalization rule to apply.
allow_whitespace_only_pieces: Allow pieces that contain only whitespaces.
remove_extra_whitespaces: Removes leading, trailing, and duplicate internal whitespace.
For more information on all available parameters for SentencePiece tokenizer training, please see this page.