Training

Prepare the Data

If your data is in JSONL (JSON Lines) format, it might look like this:

{"id": 0, "text": "Hi how are you?", "language": "en"}
...
{"id": 168344, "text": "Today was a great day!", "language": "en"}

You need to extract the text field only from each JSON object and save it into a separate file. The output file should have one sentence per line, like this:

Hi how are you?
...
Today was a great day!

It is also possible to train the SentencePiece tokenizer on multiple data files. Simply repeat the extraction process for each file and concatenate the results into a single text file or use multiple text files during the training phase:

Example of data reformatting script in Ubuntu:

Training the Tokenizer

The following example shows a script used by the NeMo team to train the SentencePiece tokenizer for the Red Pajama v2 dataset:

spm_train --input=datafile_1.txt,datafile_2.txt \
        --model_prefix=rp2_tokenizer \
        --vocab_size=32000 \
        --character_coverage=0.9995 \
        --model_type=bpe \
        --num_threads=222 \
        --split_digits \
        --byte_fallback \
        --pad_id=0 \
        --unk_id=1 \
        --bos_id=2 \
        --eos_id=3 \
        --max_sentence_length=10000 \
        --normalization_rule_name=identity \
        --allow_whitespace_only_pieces=true \
        --remove_extra_whitespaces=false

Parameters

  • input: Path to the input text files containing the sentences for training.

  • model_prefix: Prefix for the output model files.

  • vocab_size: Desired vocabulary size.

  • character_coverage: Character coverage to determine the minimum symbols.

  • model_type: Type of model to train. Options include unigram, bpe, word, or char.

  • num_threads: Number of threads to use during training. In our case it is set to the number of available CPU cores minus two.

  • split_digits: Split all digits (0-9) into separate pieces.

  • byte_fallback: Decompose unknown pieces into UTF-8 byte pieces.

  • pad_id=0: ID for the padding token.

  • unk_id=1: ID for the unknown token.

  • bos_id=2: ID for the beginning-of-sequence token.

  • eos_id=3: ID for the end-of-sequence token.

  • max_sentence_length: Maximum sentence length to consider.

  • normalization_rule_name: Normalization rule to apply.

  • allow_whitespace_only_pieces: Allow pieces that contain only whitespaces.

  • remove_extra_whitespaces: Removes leading, trailing, and duplicate internal whitespace.

For more information on all available parameters for SentencePiece tokenizer training, please see this page.