Save and Export Text Data
After processing your text datasets with NeMo Curator, use writer stages to export curated data for downstream use. Curator provides writers for common formats (JSONL, Parquet) as well as specialized writers for training frameworks.
Megatron Tokenization
MegatronTokenizerWriter tokenizes text documents and writes the .bin and .idx files required by Megatron-LM for data loading during pretraining. This replaces the need to run Megatron’s preprocess_data.py script separately and integrates tokenization directly into your curation pipeline.
How It Works
- Tokenizer loading: Downloads and loads a Hugging Face tokenizer specified by
model_identifier. The tokenizer is downloaded once per node and loaded once per worker. - Batched tokenization: Documents are tokenized in batches (controlled by
tokenization_batch_size) to avoid out-of-memory issues on large datasets. - Binary output: Tokenized data is written to a
.binfile containing packed token IDs. Vocabulary sizes above 65,536 use 4 bytes per token (int32); smaller vocabularies use 2 bytes (uint16). - Index output: A
.idxfile stores metadata including sequence lengths, byte offsets, and document boundaries for efficient random access during training.
Quick Start
Configuration
Output Format
The writer produces paired files for each input partition:
File format details
.bin file: Contains concatenated token IDs for all documents in the partition. Token IDs are stored as int32 (4 bytes) when the tokenizer vocabulary exceeds 65,536 tokens, or uint16 (2 bytes) for smaller vocabularies such as GPT-2.
.idx file: Contains a fixed header followed by per-sequence metadata:
- 9-byte magic header (
MMIDIDX\x00\x00) - 8-byte version number
- 1-byte dtype code
- 8-byte sequence count
- 8-byte document count
- Per-sequence lengths: 4-byte
int32array (one entry per sequence) - Per-sequence byte offsets: 8-byte
int64array (one entry per sequence) - Document boundary indices: 8-byte
int64array (sequence count + 1 entries)
These files are directly compatible with Megatron-LM’s MMapIndexedDataset data loader.
End-of-Document Tokens
When append_eod=True, the tokenizer’s EOS token is appended to the end of each document’s token sequence. This is consistent with the behavior of Megatron’s preprocess_data.py and is required for some training configurations that use document boundaries for attention masking.
If the tokenizer does not define an EOS token, append_eod is automatically disabled with a warning.
Using Different Tokenizers
MegatronTokenizerWriter supports any tokenizer available through Hugging Face’s AutoTokenizer:
Standard Model
Gated Model
Local Tokenizer
Complete Pipeline Example
This example reads the TinyStories dataset from Parquet files and tokenizes it for Megatron-LM:
A runnable version of this example is available in the tutorials directory.
For more information on using tokenized data with Megatron-LM, see the Related Tools page.