Pretrain Megatron Core Datasets with NeMo AutoModel

Introduction

Pretraining builds a base large language model (LLM) by training a randomly initialized model to predict the next token across massive, unlabeled datasets.

Robust pretraining establishes a foundation of linguistic competence and world knowledge that scales with data, parameters, and compute. This base model then serves as the necessary starting point for later fine-tuning or domain-specific adaptation.

NeMo AutoModel provides an end-to-end recipe to run LLM pretraining with Hugging Face–native models and Megatron-Core style datasets.

Model and Dataset Context

In this guide, we pretrain OpenAI’s GPT2-124M model on a FineWeb-Edu subset of 10 billion tokens.

About the FineWeb-Edu Dataset

FineWeb-Edu is a dataset consisting of 1.3T tokens of educational web pages filtered from the larger FineWeb dataset. The educational web pages were filtered from the main dataset using a fine-tuned BERT-like classifier. Further reading on the filtering process can be found here.

Here’s a glimpse of what the data looks like:

1 {
2     "id": "<urn:uuid:673b1bf6-2c30-40ae-992b-c387d00a836a>",
3     "dump": "CC-MAIN-2013-20",
4     "text": "No. 24; Updated March 2011
5     Click here to download and print a PDF version of this document.
6     Parents are usually the first to recognize that their child has a problem with emotions or behavior. Still, the decision to seek professional help can be difficult and painful for a parent. The first step is to gently try to talk to the child. An honest open talk about feelings can often help. Parents may choose to consult with the child's physicians, teachers, members of the clergy, or other adults who know the child well. These steps may resolve the problems for the child and family.
7     Following are a few signs which may indicate that a child and adolescent psychiatric evaluation will be useful ...",
8     "url": "https://www.aacap.org/AACAP/Families_and_Youth/Facts_for_Families/FFF-Guide/When-to-Seek-Help-for-Your-Child-024.aspx",
9     "date": null,
10     "file_path": "s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/warc/CC-MAIN-20130516092621-00000-ip-10-60-113-184.ec2.internal.warc.gz",
11     "language": "en",
12     "language_score": 0.927742,
13     "token_count": 755,
14     "score": 3.375,
15     "int_score": 3,
16 }

Download the FineWeb-Edu Dataset

For this guide, we use the FineWeb-Edu 10BT sample — a collection of approximately 10 billion tokens randomly drawn from the full FineWeb-Edu dataset. To prepare the data, run the following commands:

$ # run this inside the AutoModel directory
$ 
$ git clone https://github.com/facebookresearch/lingua.git
$ cd lingua
$ pip install -r requirements.txt
$ python setup/download_prepare_hf_data.py fineweb_edu_10bt <MEMORY> --data_dir <DATA_DIR> --seed 42 --nchunks 1
$ cd ..
$ mv lingua/fineweb_edu .

Replace <MEMORY> with the amount of system memory allocated to terashuf (the tool used for sample shuffling), and set <DATA_DIR> to the root directory where the data will be stored. You can run the following example command:

$ python setup/download_prepare_hf_data.py fineweb_edu_10bt 16 --data_dir ./fineweb_edu --seed 42 --nchunks 1

The expected directory structure is as follows:

$ $ tree fineweb_edu/
$ fineweb_edu/
$ ├── fineweb_edu_10bt
$ │   ├── datatrove
$ │   │   ├── completions
$ │   │   │   ├── 00000
$ │   │   │   ├── 00001
$ │   │   │   ├── 00002
$ │   │   │   ├── 00003
$ │   │   │   ├── 00004
$ │   │   │   ├── 00005
$ │   │   │   │   ...
$ │   │   │   └── 00063
$ │   │   ├── executor.json
$ │   │   ├── logs
$ │   │   │   ├── task_00000.log
$ │   │   │   ├── task_00001.log
$ │   │   │   ├── task_00002.log
$ │   │   │   ├── task_00003.log
$ │   │   │   ├── task_00004.log
$ │   │   │   ├── task_00005.log
$ │   │   │   │   ...
$ │   │   │   └── task_00063.log
$ │   │   ├── stats
$ │   │   │   ├── 00000.json
$ │   │   │   ├── 00001.json
$ │   │   │   ├── 00002.json
$ │   │   │   ├── 00003.json
$ │   │   │   ├── 00004.json
$ │   │   │   ├── 00005.json
$ │   │   │   │   ...
$ │   │   │   └── 00063.json
$ │   │   └── stats.json
$ │   ├── fineweb_edu_10bt.chunk.00000.jsonl
$ │   │   ...
$ │   ├── fineweb_edu_10bt.chunk.00013.jsonl
$ │   ├── sample
$ │   │   └── 10BT
$ │   │       ├── 000_00000.parquet
$ │   │       │   ...
$ │   │       └── 013_00000.parquet
$ │   └── terashuf
$ │       ├── LICENSE
$ │       ├── Makefile
$ │       ├── README.md
$ │       ├── terashuf
$ │       └── terashuf.cc
$ └── fineweb_edu_10bt_shuffled
$     ├── fineweb_edu_10bt.chunk.00.jsonl
$     └── fineweb_edu_10bt.val.jsonl

Preprocess to a Megatron Core Dataset

NeMo AutoModel provides tooling to perform the task of tokenizing and saving in the Megatron Core dataset format. You can use it as follows:

$ uv run tools/preprocess_megatron_dataset.py --input "fineweb_edu/fineweb_edu_10bt/fineweb_edu_10bt.chunk.*.jsonl" --json-keys text --output-prefix processed_data --output-path fineweb_edu/megatron_gpt2/ --workers 8 --pretrained-model-name-or-path openai-community/gpt2 --append-eod

The directory should look like this:

$ $ tree fineweb_edu/megatron_gpt2/
$ fineweb_edu/megatron_gpt2/
$ ├── processed_data_0_text_document.bin
$ ├── processed_data_0_text_document.idx
$ ├── processed_data_10_text_document.bin
$ ├── processed_data_10_text_document.idx
$ ├── processed_data_11_text_document.bin
$ ├── processed_data_11_text_document.idx
$ ├── processed_data_12_text_document.bin
$ ├── processed_data_12_text_document.idx
$ ├── processed_data_13_text_document.bin
$ ├── processed_data_13_text_document.idx
$ ├── processed_data_1_text_document.bin
$ ├── processed_data_1_text_document.idx
$ ├── processed_data_2_text_document.bin
$ ├── processed_data_2_text_document.idx
$ ├── processed_data_3_text_document.bin
$ ├── processed_data_3_text_document.idx
$ ├── processed_data_4_text_document.bin
$ ├── processed_data_4_text_document.idx
$ ├── processed_data_5_text_document.bin
$ ├── processed_data_5_text_document.idx
$ ├── processed_data_6_text_document.bin
$ ├── processed_data_6_text_document.idx
$ ├── processed_data_7_text_document.bin
$ ├── processed_data_7_text_document.idx
$ ├── processed_data_8_text_document.bin
$ ├── processed_data_8_text_document.idx
$ ├── processed_data_9_text_document.bin
$ └── processed_data_9_text_document.idx
$ 
$ 1 directory, 28 files

Replace --workers with the amount of CPU cores you’d like to use to tokenize in parallel.

Use a Recipe for Pretraining

This example demonstrates how to perform pretraining on a large language model using NVIDIA’s NeMo AutoModel library. We use the LLM training recipe, specifically TrainFinetuneRecipeForNextTokenPrediction, which orchestrates the pretraining process — including loading, dataset preparation, optimizer setup, distributed training, checkpointing, and logging.

What is a Recipe?

A recipe in NeMo AutoModel is a self-contained orchestration module that wires together all components needed to perform a specific task (e.g., pretraining). Think of it as the equivalent of a Trainer class, but highly modular, stateful, and reproducible.

The TrainFinetuneRecipeForNextTokenPrediction class is one such recipe. It inherits from BaseRecipe and implements:

setup(): builds all training components from the config
run_train_validation_loop(): executes training + validation steps
Additional responsibilities: Checkpoint handling, logging, and RNG setup.

Recipe Config Example

Below is the configuration from examples/llm_pretrain/megatron_pretrain_gpt2.yaml:

1 # Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
2 #
3 # Licensed under the Apache License, Version 2.0 (the "License");
4 # you may not use this file except in compliance with the License.
5 # You may obtain a copy of the License at
6 #
7 #     http://www.apache.org/licenses/LICENSE-2.0
8 #
9 # Unless required by applicable law or agreed to in writing, software
10 # distributed under the License is distributed on an "AS IS" BASIS,
11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
14 
15 # To run this recipe, please use the following command:
16 # torchrun --nproc-per-node=8 examples/llm_pretrain/pretrain.py --config examples/llm_pretrain/megatron_pretrain_gpt2.yaml
17 # Adjust --nproc-per-node to the number of GPUs available on your host machine.
18 
19 # The model section is responsible for configuring the model we want to finetune.
20 # Since we want to use the GPT2-124M model, we pass `openai-community/gpt2` to the
21 # `pretrained_model_name_or_path` option.
22 model:
23   _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_config
24   config:
25     _target_: transformers.AutoConfig.from_pretrained
26     pretrained_model_name_or_path: openai-community/gpt2
27 
28 # As mentioned earlier, we are using the FineWeb-Edu dataset. NeMo AutoModel provides the MegatronPretraining
29 # class which prepares the dataset by loading, packing, and shuffling. We use the "train" split for
30 # training.
31 dataset:
32   _target_: nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining
33   paths: fineweb_edu/megatron_gpt2/processed_data_*_text_document*  # REPLACE THIS
34   index_mapping_dir: fineweb_edu/megatron_gpt2/mapping_dir  # REPLACE THIS
35   tokenizer:
36     _target_: nemo_automodel._transformers.auto_tokenizer.NeMoAutoTokenizer.from_pretrained
37     pretrained_model_name_or_path: openai-community/gpt2
38   seq_length: 1024
39   split: "0.99, 0.01, 0.00"  # train, validation, test
40   splits_to_build: "train"  # has to be one of train, validation, test
41 
42 dataloader:
43   _target_: torchdata.stateful_dataloader.StatefulDataLoader
44   collate_fn: torch.utils.data.default_collate
45   dataloader_type: "single"  # or "cyclic"
46 
47 # Similarly, for validation we use the "validation" split
48 validation_dataset:
49   _target_: nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining
50   paths: fineweb_edu/megatron_gpt2/processed_data_*_text_document*  # REPLACE THIS
51   index_mapping_dir: fineweb_edu/megatron_gpt2/mapping_dir  # REPLACE THIS
52   tokenizer:
53     _target_: nemo_automodel._transformers.auto_tokenizer.NeMoAutoTokenizer.from_pretrained
54     pretrained_model_name_or_path: openai-community/gpt2
55   seq_length: 1024
56   split: "0.99, 0.01, 0.00"  # train, validation, test
57   splits_to_build: "validation"  # has to be one of train, validation, test
58   num_val_samples: 1024
59 
60 step_scheduler:
61   global_batch_size: 512
62   local_batch_size: 32
63   ckpt_every_steps: 1000 # checkpoints state every 1000 steps
64   val_every_steps: 250  # validates every 250 steps
65   num_epochs: 1
66   max_steps: 18500
67 
68 dist_env:
69   backend: nccl
70   timeout_minutes: 1
71 
72 rng:
73   _target_: nemo_automodel.components.training.rng.StatefulRNG
74   seed: 1111
75   ranked: true
76 
77 checkpoint:
78   enabled: true
79   checkpoint_dir: checkpoints/
80   model_save_format: torch_save # torch_save or safetensors
81   save_consolidated: false # Sharded save; run <checkpoint>/model/consolidate.sh after training to export HF weights.
82 
83 # For distributed processing, we use FSDP2.
84 distributed:
85   strategy: fsdp2
86   dp_size: null
87   dp_replicate_size: null  # dp_shard_size = dp_size / dp_replicate_size when set. For DDP use strategy: ddp.
88   tp_size: 1
89   cp_size: 1
90   sequence_parallel: false
91 
92 loss_fn:
93   _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
94 
95 dataloader:
96   _target_: torchdata.stateful_dataloader.StatefulDataLoader
97   collate_fn: torch.utils.data.default_collate
98 
99 validation_dataloader:
100   _target_: torchdata.stateful_dataloader.StatefulDataLoader
101   collate_fn: torch.utils.data.default_collate
102 
103 # We will use the standard AdamW optimizer, but you can specify any optimizer you want, by changing
104 # the import path using the _target_ option.
105 optimizer:
106   _target_: torch.optim.AdamW
107   betas: [0.9, 0.95]
108   lr: 0.0006
109   weight_decay: 0.1
110 
111 # We will use a cosine LR schedule with 700 warm-up steps.
112 # This means the LR will linearly increase to a maximum of 6e-4, after which
113 # it will decay to 0 over the course of training.
114 lr_scheduler:
115   lr_decay_style: cosine
116   lr_warmup_steps: 700
117   min_lr: 0.0
118 
119 # Uncomment and configure for W&B logging
120 # wandb:
121 #   project: <your_wandb_project>
122 #   entity: <your_wandb_entity>
123 #   name: <your_wandb_exp_name>
124 #   save_dir: <your_wandb_save_dir>

If you want to add weights to the dataset blends, you can do so by passing in a list. For example, paths: ["30", "fineweb_edu/megatron_gpt2/processed_data_0_text_document", "70", "fineweb_edu/megatron_gpt2/processed_data_1_text_document"].

Load Large Models

In distributed training, the typical model-loading pipeline has each GPU load the entire model and then retain only the shard it needs. This approach becomes problematic when the model size exceeds the memory capacity of a single GPU. For instance, a 70B-parameter model requires about 140GB of memory for its parameters when using the BF16 data type (2 bytes per parameter). Since most widely used GPUs are limited to 80GB, the full model cannot be directly loaded onto a single device.

In these scenarios, you can pass is_meta_device: true in the model config. The model will then be instantiated using PyTorch’s Meta device which loads no data, but stores all other parameter metadata necessary for sharding the model. Once the model is sharded, the model weights will be populated by only loading the weights required by the respective model shard.

Run the Pretraining Recipe

Assuming you saved, or plan to use, the provided config at examples/llm_pretrain/megatron_pretrain_gpt2.yaml:

$ uv run torchrun --nproc-per-node=2 examples/llm_pretrain/pretrain.py --config examples/llm_pretrain/megatron_pretrain_gpt2.yaml

Sample Output

You should see step‑wise logs reporting loss, memory usage, and tokens per second. Checkpoints will be saved under the checkpoints/ directory as configured.

$ $ uv run torchrun --nproc-per-node=2 examples/llm_pretrain/pretrain.py --config examples/llm_pretrain/megatron_pretrain_gpt2.yaml
$ cfg-path: examples/llm_pretrain/megatron_pretrain_gpt2.yaml
$ cfg-path: examples/llm_pretrain/megatron_pretrain_gpt2.yaml
$ > initializing torch distributed with 2 workers.
$ 2025-09-01 07:13:17 | INFO | nemo_automodel.components.loggers.log_utils | Setting logging level to 20
$ 2025-09-01 07:13:17 | INFO | root | Experiment_details:
$ 2025-09-01 07:13:17 | INFO | root | Timestamp: '2025-09-01T07:13:17'
$ 2025-09-01 07:13:17 | INFO | root | User: root
$ 2025-09-01 07:13:17 | INFO | root | Host: 9126f6644eca
$ 2025-09-01 07:13:17 | INFO | root | World size: 2
$ 2025-09-01 07:13:17 | INFO | root | Backend: nccl
$ 2025-09-01 07:13:17 | INFO | root | Recipe: TrainFinetuneRecipeForNextTokenPrediction
$ 2025-09-01 07:13:17 | INFO | root | Model name: null
$ 2025-09-01 07:13:17 | INFO | root | Recipe config:
$ 2025-09-01 07:13:17 | INFO | root |   step_scheduler:
$ 2025-09-01 07:13:17 | INFO | root |     global_batch_size: 512
$ 2025-09-01 07:13:17 | INFO | root |     local_batch_size: 32
$ 2025-09-01 07:13:17 | INFO | root |     ckpt_every_steps: 1000
$ 2025-09-01 07:13:17 | INFO | root |     val_every_steps: 250
$ 2025-09-01 07:13:17 | INFO | root |     num_epochs: 1
$ 2025-09-01 07:13:17 | INFO | root |     max_steps: 18500
$ 2025-09-01 07:13:17 | INFO | root |   dist_env:
$ 2025-09-01 07:13:17 | INFO | root |     backend: nccl
$ 2025-09-01 07:13:17 | INFO | root |     timeout_minutes: 1
$ 2025-09-01 07:13:17 | INFO | root |   rng:
$ 2025-09-01 07:13:17 | INFO | root |     _target_: <class 'nemo_automodel.components.training.rng.StatefulRNG'>
$ 2025-09-01 07:13:17 | INFO | root |     seed: 1111
$ 2025-09-01 07:13:17 | INFO | root |     ranked: True
$ 2025-09-01 07:13:17 | INFO | root |   model:
$ 2025-09-01 07:13:17 | INFO | root |     _target_: <bound method _BaseNeMoAutoModelClass.from_config of <class 'nemo_automodel._transformers.auto_model.NeMoAutoModelForCausalLM'>>
$ 2025-09-01 07:13:17 | INFO | root |     config:
$ 2025-09-01 07:13:17 | INFO | root |       _target_: <bound method AutoConfig.from_pretrained of <class 'transformers.models.auto.configuration_auto.AutoConfig'>>
$ 2025-09-01 07:13:17 | INFO | root |       pretrained_model_name_or_path: openai-community/gpt2
$ 2025-09-01 07:13:17 | INFO | root |   checkpoint:
$ 2025-09-01 07:13:17 | INFO | root |     enabled: True
$ 2025-09-01 07:13:17 | INFO | root |     checkpoint_dir: checkpoints/
$ 2025-09-01 07:13:17 | INFO | root |     model_save_format: torch_save
$ 2025-09-01 07:13:17 | INFO | root |     save_consolidated: False
$ 2025-09-01 07:13:17 | INFO | root |   distributed:
$ 2025-09-01 07:13:17 | INFO | root |     strategy: fsdp2
$ 2025-09-01 07:13:17 | INFO | root |     dp_size: None
$ 2025-09-01 07:13:17 | INFO | root |     dp_replicate_size: None
$ 2025-09-01 07:13:17 | INFO | root |     tp_size: 1
$ 2025-09-01 07:13:17 | INFO | root |     cp_size: 1
$ 2025-09-01 07:13:17 | INFO | root |     sequence_parallel: False
$ 2025-09-01 07:13:17 | INFO | root |   loss_fn:
$ 2025-09-01 07:13:17 | INFO | root |     _target_: <class 'nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy'>
$ 2025-09-01 07:13:17 | INFO | root |   dataset:
$ 2025-09-01 07:13:17 | INFO | root |     _target_: <class 'nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining'>
$ 2025-09-01 07:13:17 | INFO | root |     paths: fineweb_edu/megatron_gpt2/processed_data_*_text_document*
$ 2025-09-01 07:13:17 | INFO | root |     index_mapping_dir: fineweb_edu/megatron_gpt2/mapping_dir
$ 2025-09-01 07:13:17 | INFO | root |     tokenizer:
$ 2025-09-01 07:13:17 | INFO | root |       _target_: <bound method AutoTokenizer.from_pretrained of <class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>>
$ 2025-09-01 07:13:17 | INFO | root |       pretrained_model_name_or_path: openai-community/gpt2
$ 2025-09-01 07:13:17 | INFO | root |     seq_length: 1024
$ 2025-09-01 07:13:17 | INFO | root |     split: (0.99, 0.01, 0.0)
$ 2025-09-01 07:13:17 | INFO | root |     splits_to_build: train
$ 2025-09-01 07:13:17 | INFO | root |   dataloader:
$ 2025-09-01 07:13:17 | INFO | root |     _target_: <class 'torchdata.stateful_dataloader.stateful_dataloader.StatefulDataLoader'>
$ 2025-09-01 07:13:17 | INFO | root |     collate_fn: <function default_collate at 0x76c3155f8720>
$ 2025-09-01 07:13:17 | INFO | root |   validation_dataset:
$ 2025-09-01 07:13:17 | INFO | root |     _target_: <class 'nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining'>
$ 2025-09-01 07:13:17 | INFO | root |     paths: fineweb_edu/megatron_gpt2/processed_data_*_text_document*
$ 2025-09-01 07:13:17 | INFO | root |     index_mapping_dir: fineweb_edu/megatron_gpt2/mapping_dir
$ 2025-09-01 07:13:17 | INFO | root |     tokenizer:
$ 2025-09-01 07:13:17 | INFO | root |       _target_: <bound method AutoTokenizer.from_pretrained of <class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>>
$ 2025-09-01 07:13:17 | INFO | root |       pretrained_model_name_or_path: openai-community/gpt2
$ 2025-09-01 07:13:17 | INFO | root |     seq_length: 1024
$ 2025-09-01 07:13:17 | INFO | root |     split: (0.99, 0.01, 0.0)
$ 2025-09-01 07:13:17 | INFO | root |     splits_to_build: validation
$ 2025-09-01 07:13:17 | INFO | root |     num_val_samples: 1024
$ 2025-09-01 07:13:17 | INFO | root |   validation_dataloader:
$ 2025-09-01 07:13:17 | INFO | root |     _target_: <class 'torchdata.stateful_dataloader.stateful_dataloader.StatefulDataLoader'>
$ 2025-09-01 07:13:17 | INFO | root |     collate_fn: <function default_collate at 0x76c3155f8720>
$ 2025-09-01 07:13:17 | INFO | root |   optimizer:
$ 2025-09-01 07:13:17 | INFO | root |     _target_: <class 'torch.optim.adamw.AdamW'>
$ 2025-09-01 07:13:17 | INFO | root |     betas: [0.9, 0.95]
$ 2025-09-01 07:13:17 | INFO | root |     lr: 0.0006
$ 2025-09-01 07:13:17 | INFO | root |     weight_decay: 0.1
$ 2025-09-01 07:13:17 | INFO | root |   lr_scheduler:
$ 2025-09-01 07:13:17 | INFO | root |     lr_decay_style: cosine
$ 2025-09-01 07:13:17 | INFO | root |     lr_warmup_steps: 700
$ 2025-09-01 07:13:17 | INFO | root |     min_lr: 0.0
$ 2025-09-01 07:13:17 | INFO | root | Library versions:
$ 2025-09-01 07:13:17 | INFO | root | - nemo_automodel: 0.2.0rc0 (/opt/Automodel/nemo_automodel/__init__.py)
$ 2025-09-01 07:13:17 | INFO | root | - transformers: 4.55.4 (/opt/venv/lib/python3.12/site-packages/transformers/__init__.py)
$ 2025-09-01 07:13:17 | INFO | root | - torch: 2.8.0+cu128 CUDA 12.8
$ 2025-09-01 07:13:27 | INFO | root | Patched model with SDPA method= [<SDPBackend.CUDNN_ATTENTION: 3>, <SDPBackend.FLASH_ATTENTION: 1>, <SDPBackend.EFFICIENT_ATTENTION: 2>, <SDPBackend.MATH: 0>]
$ 2025-09-01 07:13:27 | INFO | root | Model summary:
$ 2025-09-01 07:13:27 | INFO | root | --------------------------------
$ 2025-09-01 07:13:27 | INFO | root | Trainable parameters: 124,439,808
$ 2025-09-01 07:13:27 | INFO | root | Total parameters: 124,439,808
$ 2025-09-01 07:13:27 | INFO | root | Trainable parameters percentage: 100.00%
$ 2025-09-01 07:13:27 | INFO | root | Param L2 norm: 234.2000
$ 2025-09-01 07:13:27 | INFO | root | --------------------------------
$ /opt/venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. 
$   warnings.warn(  # warn only once
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Let split_matrix = [(0, 0.99), (0.99, 1.0), None]
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.builder | Building GPTDataset splits with sizes=[9472000, 37888, None] and config=[random_seed: 1234, sequence_length: 1024, blend: [['fineweb_edu/megatron_gpt2/processed_data_0_text_document', 'fineweb_edu/megatron_gpt2/processed_data_10_text_document', 'fineweb_edu/megatron_gpt2/processed_data_11_text_document', 'fineweb_edu/megatron_gpt2/processed_data_12_text_document', 'fineweb_edu/megatron_gpt2/processed_data_13_text_document', 'fineweb_edu/megatron_gpt2/processed_data_1_text_document', 'fineweb_edu/megatron_gpt2/processed_data_2_text_document', 'fineweb_edu/megatron_gpt2/processed_data_3_text_document', 'fineweb_edu/megatron_gpt2/processed_data_4_text_document', 'fineweb_edu/megatron_gpt2/processed_data_5_text_document', 'fineweb_edu/megatron_gpt2/processed_data_6_text_document', 'fineweb_edu/megatron_gpt2/processed_data_7_text_document', 'fineweb_edu/megatron_gpt2/processed_data_8_text_document', 'fineweb_edu/megatron_gpt2/processed_data_9_text_document'], None], blend_per_split: None, split: 0.99, 0.01, 0.0, num_dataset_builder_threads: 1, path_to_cache: fineweb_edu/megatron_gpt2/mapping_dir, mmap_bin_files: True, tokenizer: openai-community/gpt2, mid_level_dataset_surplus: 0.005, reset_position_ids: False, reset_attention_mask: False, eod_mask_loss: False, create_attention_mask: False, drop_last_partial_validation_sequence: True, add_extra_token_to_sequence: True, split_matrix: [(0, 0.99), (0.99, 1.0), None]]
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_0_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 713000 | Documents: 713000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 728328
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_10_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 734000 | Documents: 734000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 725047
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_11_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 724000 | Documents: 724000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 726124
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_12_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 745000 | Documents: 745000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 723682
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_13_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 738000 | Documents: 738000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 725268
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_1_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 727000 | Documents: 727000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 726263
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_2_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 730000 | Documents: 730000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 726543
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_3_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 725000 | Documents: 725000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 726632
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_4_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 732000 | Documents: 732000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 726860
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_5_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 726000 | Documents: 726000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 727143
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_6_text_document.idx
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 735000 | Documents: 735000
> 2025-09-01 07:13:28 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 725603
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_7_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 732000 | Documents: 732000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 726076
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_8_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 182101 | Documents: 182101
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 182792
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_9_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 729000 | Documents: 729000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset train indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 726153
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> /opt/venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. 
>   warnings.warn(  # warn only once
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.builder | Build and save the BlendedDataset indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.builder |  Build and save the dataset and dataset sample indexes
> 2025-09-01 07:13:29 | INFO | root | Instantiating MegatronPretrainingSampler with total_samples: 9472000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Let split_matrix = [(0, 0.99), (0.99, 1.0), None]
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.builder | Building GPTDataset splits with sizes=[9472000, 1024, None] and config=[random_seed: 1234, sequence_length: 1024, blend: [['fineweb_edu/megatron_gpt2/processed_data_0_text_document', 'fineweb_edu/megatron_gpt2/processed_data_10_text_document', 'fineweb_edu/megatron_gpt2/processed_data_11_text_document', 'fineweb_edu/megatron_gpt2/processed_data_12_text_document', 'fineweb_edu/megatron_gpt2/processed_data_13_text_document', 'fineweb_edu/megatron_gpt2/processed_data_1_text_document', 'fineweb_edu/megatron_gpt2/processed_data_2_text_document', 'fineweb_edu/megatron_gpt2/processed_data_3_text_document', 'fineweb_edu/megatron_gpt2/processed_data_4_text_document', 'fineweb_edu/megatron_gpt2/processed_data_5_text_document', 'fineweb_edu/megatron_gpt2/processed_data_6_text_document', 'fineweb_edu/megatron_gpt2/processed_data_7_text_document', 'fineweb_edu/megatron_gpt2/processed_data_8_text_document', 'fineweb_edu/megatron_gpt2/processed_data_9_text_document'], None], blend_per_split: None, split: 0.99, 0.01, 0.0, num_dataset_builder_threads: 1, path_to_cache: fineweb_edu/megatron_gpt2/mapping_dir, mmap_bin_files: True, tokenizer: openai-community/gpt2, mid_level_dataset_surplus: 0.005, reset_position_ids: False, reset_attention_mask: False, eod_mask_loss: False, create_attention_mask: False, drop_last_partial_validation_sequence: True, add_extra_token_to_sequence: True, split_matrix: [(0, 0.99), (0.99, 1.0), None]]
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_0_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 713000 | Documents: 713000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7221
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_10_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 734000 | Documents: 734000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7215
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_11_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 724000 | Documents: 724000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7502
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_12_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 745000 | Documents: 745000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7209
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_13_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 738000 | Documents: 738000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7453
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_1_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 727000 | Documents: 727000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7492
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_2_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 730000 | Documents: 730000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7464
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_3_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 725000 | Documents: 725000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7362
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_4_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 732000 | Documents: 732000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7520
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_5_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 726000 | Documents: 726000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7326
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_6_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 735000 | Documents: 735000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7498
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_7_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 732000 | Documents: 732000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7531
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_8_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 182101 | Documents: 182101
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 1912
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Loading index file fineweb_edu/megatron_gpt2/processed_data_9_text_document.idx
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence lengths
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting sequence pointers
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Extracting document indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.indexed_dataset | Sequences: 729000 | Documents: 729000
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | Build and save the GPTDataset valid indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of samples: 7462
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.gpt_dataset | > total number of epochs: 1
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.builder | Build and save the BlendedDataset indices
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.datasets.llm.megatron.builder |  Build and save the dataset and dataset sample indexes
> 2025-09-01 07:13:29 | INFO | root | Instantiating MegatronPretrainingSampler with total_samples: 1024
> 2025-09-01 07:13:29 | INFO | nemo_automodel.recipes.llm.train_ft | Building LR scheduler with total_steps=18500, warmup_steps=700, decay_style=cosine
> 2025-09-01 07:13:29 | INFO | nemo_automodel.components.optim.scheduler | learning rate decay style: cosine
> 2025-09-01 07:13:29 | INFO | root | Model Part 0:
> 2025-09-01 07:13:29 | INFO | root | FSDPGPT2LMHeadModel(
> 2025-09-01 07:13:29 | INFO | root |   (transformer): GPT2Model(
> 2025-09-01 07:13:29 | INFO | root |     (wte): Embedding(50257, 768)
> 2025-09-01 07:13:29 | INFO | root |     (wpe): Embedding(1024, 768)
> 2025-09-01 07:13:29 | INFO | root |     (drop): Dropout(p=0.1, inplace=False)
> 2025-09-01 07:13:29 | INFO | root |     (h): ModuleList(
> 2025-09-01 07:13:29 | INFO | root |       (0-11): 12 x FSDPGPT2Block(
> 2025-09-01 07:13:29 | INFO | root |         (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
> 2025-09-01 07:13:29 | INFO | root |         (attn): GPT2Attention(
> 2025-09-01 07:13:29 | INFO | root |           (c_attn): Conv1D(nf=2304, nx=768)
> 2025-09-01 07:13:29 | INFO | root |           (c_proj): Conv1D(nf=768, nx=768)
> 2025-09-01 07:13:29 | INFO | root |           (attn_dropout): Dropout(p=0.1, inplace=False)
> 2025-09-01 07:13:29 | INFO | root |           (resid_dropout): Dropout(p=0.1, inplace=False)
> 2025-09-01 07:13:29 | INFO | root |         )
> 2025-09-01 07:13:29 | INFO | root |         (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
> 2025-09-01 07:13:29 | INFO | root |         (mlp): GPT2MLP(
> 2025-09-01 07:13:29 | INFO | root |           (c_fc): Conv1D(nf=3072, nx=768)
> 2025-09-01 07:13:29 | INFO | root |           (c_proj): Conv1D(nf=768, nx=3072)
> 2025-09-01 07:13:29 | INFO | root |           (act): NewGELUActivation()
> 2025-09-01 07:13:29 | INFO | root |           (dropout): Dropout(p=0.1, inplace=False)
> 2025-09-01 07:13:29 | INFO | root |         )
> 2025-09-01 07:13:29 | INFO | root |       )
> 2025-09-01 07:13:29 | INFO | root |     )
> 2025-09-01 07:13:29 | INFO | root |     (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
> 2025-09-01 07:13:29 | INFO | root |   )
> 2025-09-01 07:13:29 | INFO | root |   (lm_head): Linear(in_features=768, out_features=50257, bias=False)
> 2025-09-01 07:13:29 | INFO | root | )
> 2025-09-01 07:13:29 | INFO | root | Optimizer:
> 2025-09-01 07:13:29 | INFO | root | AdamW (
> 2025-09-01 07:13:29 | INFO | root | Parameter Group 0
> 2025-09-01 07:13:29 | INFO | root |     amsgrad: False
> 2025-09-01 07:13:29 | INFO | root |     betas: [0.9, 0.95]
> 2025-09-01 07:13:29 | INFO | root |     capturable: False
> 2025-09-01 07:13:29 | INFO | root |     decoupled_weight_decay: True
> 2025-09-01 07:13:29 | INFO | root |     differentiable: False
> 2025-09-01 07:13:29 | INFO | root |     eps: 1e-08
> 2025-09-01 07:13:29 | INFO | root |     foreach: None
> 2025-09-01 07:13:29 | INFO | root |     fused: None
> 2025-09-01 07:13:29 | INFO | root |     lr: 5.9999999999999995e-05
> 2025-09-01 07:13:29 | INFO | root |     maximize: False
> 2025-09-01 07:13:29 | INFO | root |     weight_decay: 0.1
> 2025-09-01 07:13:29 | INFO | root | )
> 2025-09-01 07:13:29 | INFO | root | LR scheduler:
> 2025-09-01 07:13:29 | INFO | root | OptimizerParamScheduler(
> 2025-09-01 07:13:29 | INFO | root |     optimizer: AdamW
> 2025-09-01 07:13:29 | INFO | root |     learning_rate:
> 2025-09-01 07:13:29 | INFO | root |         init_lr: 5.9999999999999995e-05
> 2025-09-01 07:13:29 | INFO | root |         max_lr: 0.0006
> 2025-09-01 07:13:29 | INFO | root |         min_lr: 0.0
> 2025-09-01 07:13:29 | INFO | root |         warmup_steps: 700
> 2025-09-01 07:13:29 | INFO | root |         decay_steps: 18500
> 2025-09-01 07:13:29 | INFO | root |         decay_style: cosine
> 2025-09-01 07:13:29 | INFO | root |     weight_decay:
> 2025-09-01 07:13:29 | INFO | root |         start_wd: 0.1
> 2025-09-01 07:13:29 | INFO | root |         end_wd: 0.1
> 2025-09-01 07:13:29 | INFO | root |         incr_steps: 18500
> 2025-09-01 07:13:29 | INFO | root |         incr_style: constant
> 2025-09-01 07:13:29 | INFO | root |     current_step: 0
> 2025-09-01 07:13:29 | INFO | root | )
> 2025-09-01 07:13:29 | INFO | root | Step scheduler:
> 2025-09-01 07:13:29 | INFO | root | - Gradient accumulation steps: 8
> 2025-09-01 07:13:29 | INFO | root | - Checkpoint every steps: 1000
> 2025-09-01 07:13:29 | INFO | root | - Current Epoch: 0
> 2025-09-01 07:13:29 | INFO | root | - Number of epochs: 1
> 2025-09-01 07:13:29 | INFO | root | - Validation every steps: 250
> 2025-09-01 07:13:29 | INFO | root | - Max train steps: 18500
> 2025-09-01 07:13:33 | INFO | root | step 1 | epoch 0 | loss 10.9521 | grad_norm 12.9375 | lr 6.08e-05 | mem 38.39 GiB | tps 132005.57(66002.79/gpu) | num_label_tokens 524288
> 2025-09-01 07:13:37 | INFO | root | step 2 | epoch 0 | loss 10.1146 | grad_norm 6.0312 | lr 6.15e-05 | mem 38.63 GiB | tps 146246.38(73123.19/gpu) | num_label_tokens 524288
> 2025-09-01 07:13:41 | INFO | root | step 3 | epoch 0 | loss 9.7842 | grad_norm 3.0781 | lr 6.23e-05 | mem 38.63 GiB | tps 145236.76(72618.38/gpu) | num_label_tokens 524288
> 2025-09-01 07:13:44 | INFO | root | step 4 | epoch 0 | loss 9.6514 | grad_norm 2.2812 | lr 6.31e-05 | mem 38.63 GiB | tps 144882.21(72441.11/gpu) | num_label_tokens 524288
> 2025-09-01 07:13:48 | INFO | root | step 5 | epoch 0 | loss 9.5964 | grad_norm 2.2188 | lr 6.39e-05 | mem 38.63 GiB | tps 144711.55(72355.78/gpu) | num_label_tokens 524288

For each training batch, the pretraining recipe logs the current loss, along with current peak memory usage and tokens per second (TPS).

As training progresses, you should observe the model loss beginning to converge. To verify your results, you can compare your convergence curves against the baseline benchmarks provided in the llm.c repository.

Example of GPT-2 training convergence on FineWeb-Edu-10B.