Adding the OAS Dataset: Modifying the Dataset Class
Contents
Adding the OAS Dataset: Modifying the Dataset Class#
This tutorial is the second part of a series focused on adding a new dataset to BioNeMo using the Observed Antibody Space (OAS) database. There are three steps to this task:
Preprocessing includes download of the raw data and any additional preparation steps, such as extracting the files. It also includes dividing the data into train, validation, and test splits. The preprocessing step can make use of two BioNeMo base classes,
RemoteResource
andResourcePreprocessor
, frombionemo.utils.remote
andbionemo.data.preprocess.dna.preprocess
, respectively. Their use is optional but they provide some basic functionality which can accelerate development. This step is covered by this tutorial. This objective was accomplished by the previous tutorial, Downloading and Preprocessing.Development of the new dataset class. Here, the NeMo dataset class CSVMemMapDataset will be used. This step will be completed during the current tutorial.
Modification of the dataloader classes. This task will be covered by the third tutorial, Adding a Custom Dataloader. TODO FIX LINK WHEN TUTORIAL FINISHED
This tutorial assumes the first step has been completed successfully.
Setup and Assumptions#
This tutorial assumes that a copy of the BioNeMo framework repo exists on workstation or server and has been mounted inside the container at /workspace/bionemo
as described in the Code Development section of the Quickstart Guide. This path will be referred to with the variable BIONEMO_WORKSPACE
in the tutorial.
All commands should be executed inside the BioNeMo docker container.
Configuring the CSV Memory Mapped Dataset#
Custom YAML Config#
BioNeMo uses memory mapping to enable the flexibility of text based data formats, such as CSV, while also minimizing memory usage. The key elements of the CSVMemMapDataset dataset that must be changed in the model.data
section of the YAML configuration file are:
dataset_paths
: a list of the paths to all data files for a given split, which contains thetrain
,val
,test
columns. For the OAS heavy chain data, the path is/data/OASpaired/processed/heavy
.data_col
: the zero-based integer number of the column containing the pretraining data. This will be set to1
to select the columnsequence_heavy
.data_sep
: the delimiter for the CSV dataset, defaults to ‘,
’. This will not need to be changed.header_lines
: the number of header lines in the data files, defaults to1
. This will not need to be changed.
The range of exsting datafiles must also be updated to reflect that there are six files (named x000.csv
through x005.csv
for training and two (x000.csv
and x001.csv
) for validation and test data, respectively. do_training
will also be set to True
since a pretraining run is required to test the dataset class.
The YAML configuration file below demonstrates these changes. Config files are located in {BIONEMO_WORKSPACE}/examples/protein/esm1nv/conf/
defaults:
- base_config
name: esm1nv-oas
do_training: True ### Set to True to run training
###### Begin OAS Related Addtions ######
trainer:
devices: 1
max_steps: 200 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
val_check_interval: 100
limit_val_batches: 10 # number of batches in validation step, use fraction for fraction of data, 0 to disable
###### End OAS Related Addtions ######
exp_manager:
create_wandb_logger: False
restore_from_path: null # used when starting from a .nemo file
model:
tokenizer:
library: 'sentencepiece'
type: null
model: /tokenizers/protein/esm1nv/vocab/protein_sequence_sentencepiece.model
vocab_file: /tokenizers/vocab/protein_sequence_sentencepiece.vocab
data:
###### Begin OAS Related Addtions ######
dataset_path: /data/OASpaired/processed/heavy # parent directory for data, contains train / val / test folders. Needs to be writeable for index creation.
dataset: # inclusive range of data files to load x[000..049] or can a single file, e.g. x000
train: x[000..005]
test: x[000..001]
val: x[000..001]
data_impl_kwargs:
csv_mmap:
data_col: 1 # 0-based
###### End OAS Related Addtions ######
micro_batch_size: ${model.micro_batch_size}
num_workers: 10
modify_percent: 0.1 # Percentage of characters in a protein sequence to modify. (Modification means replacing with another amino acid or with a mask token)
perturb_percent: 0.5 # Of the modify_percent, what percentage of characters are to be replaced with another amino acid.
Testing#
No additional changes should need to be made to other files before testing.
As before, execute the pretrain script:
cd examples/protein/esm1nv
python pretrain_oas.py
The entire log is shown this time for completeness, but the sections associated with loading data can be found by searching for the text “Loading data from”.
[NeMo W 2023-08-17 16:22:29 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-08-17 16:22:29 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-08-17 16:22:30 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'pretrain_oas': Defaults list is missing `_self_`. See https://hydra.cc/docs/upgrades/1.0_to_1.1/default_composition_order for more information
warnings.warn(msg, UserWarning)
[NeMo W 2023-08-17 16:22:30 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
[NeMo I 2023-08-17 16:22:30 pretrain_oas:14]
************** Experiment configuration ***********
[NeMo I 2023-08-17 16:22:30 pretrain_oas:15]
name: esm1nv-oas
do_training: true
do_testing: false
restore_from_path: null
trainer:
devices: 1
num_nodes: 1
accelerator: gpu
precision: 16
logger: false
enable_checkpointing: false
replace_sampler_ddp: false
max_epochs: null
max_steps: 200
log_every_n_steps: 10
val_check_interval: 100
limit_val_batches: 10
limit_test_batches: 500
accumulate_grad_batches: 1
gradient_clip_val: 1.0
benchmark: false
exp_manager:
name: ${name}
exp_dir: /result/nemo_experiments/${.name}/${.wandb_logger_kwargs.name}
explicit_log_dir: ${.exp_dir}
create_wandb_logger: false
create_tensorboard_logger: true
wandb_logger_kwargs:
project: ${name}_pretraining
name: ${name}_pretraining
group: ${name}
job_type: Localhost_nodes_${trainer.num_nodes}_gpus_${trainer.devices}
notes: 'date: ${now:%y%m%d-%H%M%S}'
tags:
- ${name}
offline: false
resume_if_exists: true
resume_ignore_no_checkpoint: true
create_checkpoint_callback: true
checkpoint_callback_params:
monitor: val_loss
save_top_k: 10
mode: min
always_save_nemo: false
filename: megatron_bert--{val_loss:.2f}-{step}-{consumed_samples}
model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
model:
micro_batch_size: 8
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
seq_length: 512
max_position_embeddings: ${.seq_length}
encoder_seq_length: ${.seq_length}
num_layers: 6
hidden_size: 768
ffn_hidden_size: 3072
num_attention_heads: 12
init_method_std: 0.02
hidden_dropout: 0.1
kv_channels: null
apply_query_key_layer_scaling: true
layernorm_epsilon: 1.0e-05
make_vocab_size_divisible_by: 128
pre_process: true
post_process: true
bert_binary_head: false
resume_from_checkpoint: null
masked_softmax_fusion: true
tokenizer:
library: sentencepiece
type: null
model: /tokenizers/protein/esm1nv/vocab/protein_sequence_sentencepiece.model
vocab_file: /tokenizers/vocab/protein_sequence_sentencepiece.vocab
merge_file: null
native_amp_init_scale: 4294967296
native_amp_growth_interval: 1000
fp32_residual_connection: false
fp16_lm_cross_entropy: false
seed: 1234
use_cpu_initialization: false
onnx_safe: false
activations_checkpoint_method: null
activations_checkpoint_num_layers: 1
data:
ngc_registry_target: uniref50_2022_05
ngc_registry_version: v23.06
data_prefix: ''
num_workers: 10
dataloader_type: single
reset_position_ids: false
reset_attention_mask: false
eod_mask_loss: false
masked_lm_prob: 0.15
short_seq_prob: 0.1
skip_lines: 0
drop_last: false
pin_memory: false
data_impl: csv_mmap
data_impl_kwargs:
csv_mmap:
header_lines: 1
newline_int: 10
workers: ${model.data.num_workers}
sort_dataset_paths: true
data_sep: ','
data_col: 1
use_upsampling: true
seed: ${model.seed}
max_seq_length: ${model.seq_length}
dataset_path: /data/OASpaired/processed/heavy
dataset:
train: x[000..005]
test: x[000..001]
val: x[000..001]
micro_batch_size: ${model.micro_batch_size}
modify_percent: 0.1
perturb_percent: 0.5
optim:
name: fused_adam
lr: 0.0002
weight_decay: 0.01
betas:
- 0.9
- 0.98
sched:
name: CosineAnnealing
warmup_steps: 500
constant_steps: 50000
min_lr: 2.0e-05
[NeMo W 2023-08-17 16:22:30 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/precision/native_amp.py:131: LightningDeprecationWarning: The `NativeMixedPrecisionPlugin` class has been renamed in v1.9.0 and will be removed in v2.0.0. Please use `pytorch_lightning.plugins.MixedPrecisionPlugin` instead.
rank_zero_deprecation(
[NeMo I 2023-08-17 16:22:30 utils:168] Selected Callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo E 2023-08-17 16:22:30 exp_manager:646] exp_manager received explicit_log_dir: /result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining and at least one of exp_dir: /result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining, or version: None. Please note that exp_dir, name, and version will be ignored.
[NeMo W 2023-08-17 16:22:30 exp_manager:568] There was no checkpoint folder at checkpoint_dir :/result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining/checkpoints. Training from scratch.
[NeMo I 2023-08-17 16:22:30 exp_manager:374] Experiments will be logged at /result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining
[NeMo I 2023-08-17 16:22:30 exp_manager:797] TensorboardLogger has been set up
[NeMo W 2023-08-17 16:22:30 exp_manager:893] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 200. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo I 2023-08-17 16:22:30 utils:191] Resuming training from checkpoint: None
[NeMo I 2023-08-17 16:22:30 utils:234]
************** Trainer configuration ***********
[NeMo I 2023-08-17 16:22:30 utils:235]
name: esm1nv-oas
do_training: true
do_testing: false
restore_from_path: null
trainer:
devices: 1
num_nodes: 1
accelerator: gpu
precision: 16
logger: false
enable_checkpointing: false
replace_sampler_ddp: false
max_epochs: null
max_steps: 200
log_every_n_steps: 10
val_check_interval: 100
limit_val_batches: 10
limit_test_batches: 500
accumulate_grad_batches: 1
gradient_clip_val: 1.0
benchmark: false
exp_manager:
name: ${name}
exp_dir: /result/nemo_experiments/${.name}/${.wandb_logger_kwargs.name}
explicit_log_dir: ${.exp_dir}
create_wandb_logger: false
create_tensorboard_logger: true
wandb_logger_kwargs:
project: ${name}_pretraining
name: ${name}_pretraining
group: ${name}
job_type: Localhost_nodes_${trainer.num_nodes}_gpus_${trainer.devices}
notes: 'date: ${now:%y%m%d-%H%M%S}'
tags:
- ${name}
offline: false
resume_if_exists: true
resume_ignore_no_checkpoint: true
create_checkpoint_callback: true
checkpoint_callback_params:
monitor: val_loss
save_top_k: 10
mode: min
always_save_nemo: false
filename: megatron_bert--{val_loss:.2f}-{step}-{consumed_samples}
model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
model:
micro_batch_size: 8
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
seq_length: 512
max_position_embeddings: ${.seq_length}
encoder_seq_length: ${.seq_length}
num_layers: 6
hidden_size: 768
ffn_hidden_size: 3072
num_attention_heads: 12
init_method_std: 0.02
hidden_dropout: 0.1
kv_channels: null
apply_query_key_layer_scaling: true
layernorm_epsilon: 1.0e-05
make_vocab_size_divisible_by: 128
pre_process: true
post_process: true
bert_binary_head: false
resume_from_checkpoint: null
masked_softmax_fusion: true
tokenizer:
library: sentencepiece
type: null
model: /tokenizers/protein/esm1nv/vocab/protein_sequence_sentencepiece.model
vocab_file: /tokenizers/vocab/protein_sequence_sentencepiece.vocab
merge_file: null
native_amp_init_scale: 4294967296
native_amp_growth_interval: 1000
fp32_residual_connection: false
fp16_lm_cross_entropy: false
seed: 1234
use_cpu_initialization: false
onnx_safe: false
activations_checkpoint_method: null
activations_checkpoint_num_layers: 1
data:
ngc_registry_target: uniref50_2022_05
ngc_registry_version: v23.06
data_prefix: ''
num_workers: 10
dataloader_type: single
reset_position_ids: false
reset_attention_mask: false
eod_mask_loss: false
masked_lm_prob: 0.15
short_seq_prob: 0.1
skip_lines: 0
drop_last: false
pin_memory: false
data_impl: csv_mmap
data_impl_kwargs:
csv_mmap:
header_lines: 1
newline_int: 10
workers: ${model.data.num_workers}
sort_dataset_paths: true
data_sep: ','
data_col: 1
use_upsampling: true
seed: ${model.seed}
max_seq_length: ${model.seq_length}
dataset_path: /data/OASpaired/processed/heavy
dataset:
train: x[000..005]
test: x[000..001]
val: x[000..001]
micro_batch_size: ${model.micro_batch_size}
modify_percent: 0.1
perturb_percent: 0.5
optim:
name: fused_adam
lr: 0.0002
weight_decay: 0.01
betas:
- 0.9
- 0.98
sched:
name: CosineAnnealing
warmup_steps: 500
constant_steps: 50000
min_lr: 2.0e-05
global_batch_size: 8
precision: 16
[NeMo I 2023-08-17 16:22:30 pretrain_oas:21] ************** Starting Training ***********
[NeMo I 2023-08-17 16:22:30 megatron_init:231] Rank 0 has data parallel group: [0]
[NeMo I 2023-08-17 16:22:30 megatron_init:234] All data parallel group ranks: [[0]]
[NeMo I 2023-08-17 16:22:30 megatron_init:235] Ranks 0 has data parallel rank: 0
[NeMo I 2023-08-17 16:22:30 megatron_init:243] Rank 0 has model parallel group: [0]
[NeMo I 2023-08-17 16:22:30 megatron_init:244] All model parallel group ranks: [[0]]
[NeMo I 2023-08-17 16:22:30 megatron_init:254] Rank 0 has tensor model parallel group: [0]
[NeMo I 2023-08-17 16:22:30 megatron_init:258] All tensor model parallel group ranks: [[0]]
[NeMo I 2023-08-17 16:22:30 megatron_init:259] Rank 0 has tensor model parallel rank: 0
[NeMo I 2023-08-17 16:22:30 megatron_init:273] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2023-08-17 16:22:30 megatron_init:285] Rank 0 has embedding group: [0]
[NeMo I 2023-08-17 16:22:30 megatron_init:291] All pipeline model parallel group ranks: [[0]]
[NeMo I 2023-08-17 16:22:30 megatron_init:292] Rank 0 has pipeline model parallel rank 0
[NeMo I 2023-08-17 16:22:30 megatron_init:293] All embedding group ranks: [[0]]
[NeMo I 2023-08-17 16:22:30 megatron_init:294] Rank 0 has embedding rank: 0
23-08-17 16:22:30 - PID:335 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 1
[NeMo I 2023-08-17 16:22:30 tokenizer_utils:191] Getting SentencePiece with model: /tokenizers/protein/esm1nv/vocab/protein_sequence_sentencepiece.model
[NeMo I 2023-08-17 16:22:30 megatron_base_model:229] Padded vocab_size: 128, original vocab_size: 30, dummy tokens: 98.
[NeMo W 2023-08-17 16:22:30 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/configuration_validator.py:175: UserWarning: The `batch_idx` argument in `ESM1nvModel.on_train_batch_start` hook may not match with the actual batch index when using a `dataloader_iter` argument in your `training_step`.
rank_zero_warn(
[NeMo W 2023-08-17 16:22:30 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/configuration_validator.py:175: UserWarning: The `batch_idx` argument in `ESM1nvModel.on_train_batch_end` hook may not match with the actual batch index when using a `dataloader_iter` argument in your `training_step`.
rank_zero_warn(
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Added key: store_based_barrier_key:2 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
Added key: store_based_barrier_key:3 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 1 nodes.
Added key: store_based_barrier_key:4 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 1 nodes.
Added key: store_based_barrier_key:5 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:5 with 1 nodes.
Added key: store_based_barrier_key:6 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:6 with 1 nodes.
Added key: store_based_barrier_key:7 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:7 with 1 nodes.
[NeMo I 2023-08-17 16:22:31 megatron_bert_model:563] Pipeline model parallel rank: 0, Tensor model parallel rank: 0, Number of model parameters on device: 4.36e+07. Total number of model parameters: 4.36e+07.
[NeMo I 2023-08-17 16:22:31 esm1nv_model:96] Building Bert datasets.
train:1600
Loading data from /data/OASpaired/processed/heavy/train/x000.csv, /data/OASpaired/processed/heavy/train/x001.csv, /data/OASpaired/processed/heavy/train/x002.csv, /data/OASpaired/processed/heavy/train/x003.csv, /data/OASpaired/processed/heavy/train/x004.csv, /data/OASpaired/processed/heavy/train/x005.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:104] Building data files
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:343] Processing 6 data files using 10 workers
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:349] Time building 0 / 6 mem-mapped files: 0:00:00.148196
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:114] Loading data files
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x000.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x001.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x002.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x003.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x004.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x005.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:117] Time loading 6 mem-mapped files: 0:00:00.003227
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:121] Computing global indices
[NeMo I 2023-08-17 16:22:31 dataset_utils:1341] > loading indexed mapping from /data/OASpaired/processed/heavy/train/__indexmap_1600mns_512msl_0.00ssp_1234s.npy
[NeMo I 2023-08-17 16:22:31 dataset_utils:1344] loaded indexed file in 0.001 seconds
[NeMo I 2023-08-17 16:22:31 dataset_utils:1345] total number of samples: 21110
val:240
Loading data from /data/OASpaired/processed/heavy/val/x000.csv, /data/OASpaired/processed/heavy/val/x001.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:104] Building data files
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:343] Processing 2 data files using 10 workers
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:349] Time building 0 / 2 mem-mapped files: 0:00:00.137560
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:114] Loading data files
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/val/x000.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/val/x001.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:117] Time loading 2 mem-mapped files: 0:00:00.001477
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:121] Computing global indices
[NeMo I 2023-08-17 16:22:31 dataset_utils:1341] > loading indexed mapping from /data/OASpaired/processed/heavy/val/__indexmap_240mns_512msl_0.00ssp_1234s.npy
[NeMo I 2023-08-17 16:22:31 dataset_utils:1344] loaded indexed file in 0.000 seconds
[NeMo I 2023-08-17 16:22:31 dataset_utils:1345] total number of samples: 3754
test:4000
Loading data from /data/OASpaired/processed/heavy/test/x000.csv, /data/OASpaired/processed/heavy/test/x001.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:104] Building data files
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:343] Processing 2 data files using 10 workers
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:349] Time building 0 / 2 mem-mapped files: 0:00:00.165497
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:114] Loading data files
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/test/x000.csv
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/test/x001.csv
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:117] Time loading 2 mem-mapped files: 0:00:00.001370
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:121] Computing global indices
[NeMo I 2023-08-17 16:22:32 dataset_utils:1341] > loading indexed mapping from /data/OASpaired/processed/heavy/test/__indexmap_4000mns_512msl_0.00ssp_1234s.npy
[NeMo I 2023-08-17 16:22:32 dataset_utils:1344] loaded indexed file in 0.000 seconds
[NeMo I 2023-08-17 16:22:32 dataset_utils:1345] total number of samples: 5822
[NeMo I 2023-08-17 16:22:32 esm1nv_model:114] Length of train dataset: 1600
[NeMo I 2023-08-17 16:22:32 esm1nv_model:115] Length of val dataset: 240
[NeMo I 2023-08-17 16:22:32 esm1nv_model:116] Length of test dataset: 4000
[NeMo I 2023-08-17 16:22:32 esm1nv_model:117] Finished building Bert datasets.
[NeMo I 2023-08-17 16:22:32 megatron_bert_model:662] Setting up train dataloader with len(len(self._train_ds)): 1600 and consumed samples: 0
[NeMo I 2023-08-17 16:22:32 data_samplers:76] Instantiating MegatronPretrainingSampler with total_samples: 1600 and consumed_samples: 0
[NeMo I 2023-08-17 16:22:32 megatron_bert_model:670] Setting up validation dataloader with len(len(self._validation_ds)): 240 and consumed samples: 0
[NeMo I 2023-08-17 16:22:32 data_samplers:76] Instantiating MegatronPretrainingSampler with total_samples: 240 and consumed_samples: 0
[NeMo I 2023-08-17 16:22:32 megatron_bert_model:678] Setting up test dataloader with len(len(self._test_ds)): 4000 and consumed samples: 0
[NeMo I 2023-08-17 16:22:32 data_samplers:76] Instantiating MegatronPretrainingSampler with total_samples: 4000 and consumed_samples: 0
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[NeMo I 2023-08-17 16:22:32 nlp_overrides:124] Configuring DDP for model parallelism.
[NeMo I 2023-08-17 16:22:32 modelPT:722] Optimizer config = FusedAdam (
Parameter Group 0
betas: [0.9, 0.98]
bias_correction: True
eps: 1e-08
lr: 0.0002
weight_decay: 0.01
Parameter Group 1
betas: [0.9, 0.98]
bias_correction: True
eps: 1e-08
lr: 0.0002
weight_decay: 0.0
)
[NeMo I 2023-08-17 16:22:32 lr_scheduler:910] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7fc96434f790>"
will be used during training (effective maximum steps = 200) -
Parameters :
(warmup_steps: 500
constant_steps: 50000
min_lr: 2.0e-05
max_steps: 200
)
| Name | Type | Params
----------------------------------------------------------------------------
0 | model | BertModel | 43.6 M
1 | model.language_model | TransformerLanguageModel | 43.0 M
2 | model.language_model.embedding | Embedding | 491 K
3 | model.language_model.encoder | ParallelTransformer | 42.5 M
4 | model.lm_head | BertLMHead | 592 K
5 | model.lm_head.dense | Linear | 590 K
6 | model.lm_head.layernorm | MixedFusedLayerNorm | 1.5 K
----------------------------------------------------------------------------
43.6 M Trainable params
0 Non-trainable params
43.6 M Total params
87.225 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s][NeMo W 2023-08-17 16:22:32 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py:401: UserWarning: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
rank_zero_warn(
Sanity Checking: 0%| | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 50%|█████ | 1/2 [00:01<00:01, 1.44s/it]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:01<00:00, 1.36it/s][NeMo W 2023-08-17 16:22:33 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called `self.log('consumed_samples', ...)` in your `validation_epoch_end` but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
[NeMo W 2023-08-17 16:22:33 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
[NeMo W 2023-08-17 16:22:33 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('val_loss_ECE', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
[NeMo W 2023-08-17 16:22:33 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('consumed_samples', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
warning_cache.warn(
[NeMo W 2023-08-17 16:22:33 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py:344: UserWarning: Found `dataloader_iter` argument in the `training_step`. Note that the support for this signature is experimental and the behavior is subject to change.
rank_zero_warn(
Training: 0it [00:00, ?it/s]
Training: 0%| | 0/220 [00:00<?, ?it/s]
Epoch 0: 0%| | 0/220 [00:00<?, ?it/s] [NeMo W 2023-08-17 16:22:35 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called `self.log('global_step', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
[NeMo W 2023-08-17 16:22:35 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called `self.log('consumed_samples', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
[NeMo W 2023-08-17 16:22:35 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Epoch 0: 0%| | 1/220 [00:01<06:47, 1.86s/it]
Epoch 0: 0%| | 1/220 [00:01<06:47, 1.86s/it, loss=4.58, v_num=, reduced_train_loss=4.580, global_step=0.000, consumed_samples=0.000]
Epoch 0: 1%| | 2/220 [00:01<03:29, 1.04it/s, loss=4.58, v_num=, reduced_train_loss=4.580, global_step=0.000, consumed_samples=0.000]
Epoch 0: 1%| | 2/220 [00:01<03:29, 1.04it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=1.000, consumed_samples=8.000]
Epoch 0: 1%|▏ | 3/220 [00:01<02:22, 1.53it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=1.000, consumed_samples=8.000]
Epoch 0: 1%|▏ | 3/220 [00:01<02:22, 1.53it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=2.000, consumed_samples=16.00]
Epoch 0: 2%|▏ | 4/220 [00:02<01:48, 1.99it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=2.000, consumed_samples=16.00]
Epoch 0: 2%|▏ | 4/220 [00:02<01:48, 1.99it/s, loss=4.6, v_num=, reduced_train_loss=4.620, global_step=3.000, consumed_samples=24.00]
Epoch 0: 2%|▏ | 5/220 [00:02<01:28, 2.44it/s, loss=4.6, v_num=, reduced_train_loss=4.620, global_step=3.000, consumed_samples=24.00]
Epoch 0: 2%|▏ | 5/220 [00:02<01:28, 2.44it/s, loss=4.59, v_num=, reduced_train_loss=4.580, global_step=4.000, consumed_samples=32.00]
Epoch 0: 3%|▎ | 6/220 [00:02<01:14, 2.87it/s, loss=4.59, v_num=, reduced_train_loss=4.580, global_step=4.000, consumed_samples=32.00]
Epoch 0: 3%|▎ | 6/220 [00:02<01:14, 2.87it/s, loss=4.59, v_num=, reduced_train_loss=4.560, global_step=5.000, consumed_samples=40.00]
Epoch 0: 3%|▎ | 7/220 [00:02<01:05, 3.27it/s, loss=4.59, v_num=, reduced_train_loss=4.560, global_step=5.000, consumed_samples=40.00]
Epoch 0: 3%|▎ | 7/220 [00:02<01:05, 3.27it/s, loss=4.58, v_num=, reduced_train_loss=4.550, global_step=6.000, consumed_samples=48.00]
Epoch 0: 4%|▎ | 8/220 [00:02<00:58, 3.64it/s, loss=4.58, v_num=, reduced_train_loss=4.550, global_step=6.000, consumed_samples=48.00]
Epoch 0: 4%|▎ | 8/220 [00:02<00:58, 3.64it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=7.000, consumed_samples=56.00]
Epoch 0: 4%|▍ | 9/220 [00:02<00:52, 4.00it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=7.000, consumed_samples=56.00]
Epoch 0: 4%|▍ | 9/220 [00:02<00:52, 4.00it/s, loss=4.59, v_num=, reduced_train_loss=4.580, global_step=8.000, consumed_samples=64.00]
Epoch 0: 5%|▍ | 10/220 [00:02<00:48, 4.35it/s, loss=4.59, v_num=, reduced_train_loss=4.580, global_step=8.000, consumed_samples=64.00]
Epoch 0: 5%|▍ | 10/220 [00:02<00:48, 4.35it/s, loss=4.59, v_num=, reduced_train_loss=4.620, global_step=9.000, consumed_samples=72.00]
Epoch 0: 5%|▌ | 11/220 [00:02<00:44, 4.68it/s, loss=4.59, v_num=, reduced_train_loss=4.620, global_step=9.000, consumed_samples=72.00]
Epoch 0: 5%|▌ | 11/220 [00:02<00:44, 4.67it/s, loss=4.59, v_num=, reduced_train_loss=4.570, global_step=10.00, consumed_samples=80.00]
Epoch 0: 5%|▌ | 12/220 [00:02<00:41, 4.98it/s, loss=4.59, v_num=, reduced_train_loss=4.570, global_step=10.00, consumed_samples=80.00]
Epoch 0: 5%|▌ | 12/220 [00:02<00:41, 4.98it/s, loss=4.59, v_num=, reduced_train_loss=4.610, global_step=11.00, consumed_samples=88.00]
Epoch 0: 6%|▌ | 13/220 [00:02<00:39, 5.29it/s, loss=4.59, v_num=, reduced_train_loss=4.610, global_step=11.00, consumed_samples=88.00]
Epoch 0: 6%|▌ | 13/220 [00:02<00:39, 5.29it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=12.00, consumed_samples=96.00]
Epoch 0: 6%|▋ | 14/220 [00:02<00:36, 5.60it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=12.00, consumed_samples=96.00]
Epoch 0: 6%|▋ | 14/220 [00:02<00:36, 5.60it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=13.00, consumed_samples=104.0]
Epoch 0: 7%|▋ | 15/220 [00:02<00:34, 5.90it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=13.00, consumed_samples=104.0]
Epoch 0: 7%|▋ | 15/220 [00:02<00:34, 5.90it/s, loss=4.59, v_num=, reduced_train_loss=4.610, global_step=14.00, consumed_samples=112.0]
Epoch 0: 7%|▋ | 16/220 [00:02<00:32, 6.19it/s, loss=4.59, v_num=, reduced_train_loss=4.610, global_step=14.00, consumed_samples=112.0]
Epoch 0: 7%|▋ | 16/220 [00:02<00:32, 6.19it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=15.00, consumed_samples=120.0]
Epoch 0: 8%|▊ | 17/220 [00:02<00:31, 6.48it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=15.00, consumed_samples=120.0]
Epoch 0: 8%|▊ | 17/220 [00:02<00:31, 6.48it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=16.00, consumed_samples=128.0]
Epoch 0: 8%|▊ | 18/220 [00:02<00:29, 6.75it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=16.00, consumed_samples=128.0]
Epoch 0: 8%|▊ | 18/220 [00:02<00:29, 6.75it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=17.00, consumed_samples=136.0]
Epoch 0: 9%|▊ | 19/220 [00:02<00:28, 7.01it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=17.00, consumed_samples=136.0]
Epoch 0: 9%|▊ | 19/220 [00:02<00:28, 7.01it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=18.00, consumed_samples=144.0]
Epoch 0: 9%|▉ | 20/220 [00:02<00:27, 7.26it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=18.00, consumed_samples=144.0]
Epoch 0: 9%|▉ | 20/220 [00:02<00:27, 7.26it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=19.00, consumed_samples=152.0]
Epoch 0: 10%|▉ | 21/220 [00:02<00:26, 7.50it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=19.00, consumed_samples=152.0]
Epoch 0: 10%|▉ | 21/220 [00:02<00:26, 7.50it/s, loss=4.59, v_num=, reduced_train_loss=4.550, global_step=20.00, consumed_samples=160.0]
Epoch 0: 10%|█ | 22/220 [00:02<00:25, 7.74it/s, loss=4.59, v_num=, reduced_train_loss=4.550, global_step=20.00, consumed_samples=160.0]
Epoch 0: 10%|█ | 22/220 [00:02<00:25, 7.74it/s, loss=4.58, v_num=, reduced_train_loss=4.390, global_step=21.00, consumed_samples=168.0]
Epoch 0: 10%|█ | 23/220 [00:02<00:24, 7.94it/s, loss=4.58, v_num=, reduced_train_loss=4.390, global_step=21.00, consumed_samples=168.0]
Epoch 0: 10%|█ | 23/220 [00:02<00:24, 7.94it/s, loss=4.56, v_num=, reduced_train_loss=4.200, global_step=22.00, consumed_samples=176.0]
Epoch 0: 11%|█ | 24/220 [00:02<00:24, 8.14it/s, loss=4.56, v_num=, reduced_train_loss=4.200, global_step=22.00, consumed_samples=176.0]
Epoch 0: 11%|█ | 24/220 [00:02<00:24, 8.14it/s, loss=4.53, v_num=, reduced_train_loss=3.920, global_step=23.00, consumed_samples=184.0]
Epoch 0: 11%|█▏ | 25/220 [00:02<00:23, 8.34it/s, loss=4.53, v_num=, reduced_train_loss=3.920, global_step=23.00, consumed_samples=184.0]
Epoch 0: 11%|█▏ | 25/220 [00:02<00:23, 8.34it/s, loss=4.47, v_num=, reduced_train_loss=3.550, global_step=24.00, consumed_samples=192.0]
Epoch 0: 12%|█▏ | 26/220 [00:03<00:22, 8.55it/s, loss=4.47, v_num=, reduced_train_loss=3.550, global_step=24.00, consumed_samples=192.0]
Epoch 0: 12%|█▏ | 26/220 [00:03<00:22, 8.55it/s, loss=4.41, v_num=, reduced_train_loss=3.270, global_step=25.00, consumed_samples=200.0]
Epoch 0: 12%|█▏ | 27/220 [00:03<00:22, 8.74it/s, loss=4.41, v_num=, reduced_train_loss=3.270, global_step=25.00, consumed_samples=200.0]
Epoch 0: 12%|█▏ | 27/220 [00:03<00:22, 8.74it/s, loss=4.33, v_num=, reduced_train_loss=2.980, global_step=26.00, consumed_samples=208.0]
Epoch 0: 13%|█▎ | 28/220 [00:03<00:21, 8.94it/s, loss=4.33, v_num=, reduced_train_loss=2.980, global_step=26.00, consumed_samples=208.0]
Epoch 0: 13%|█▎ | 28/220 [00:03<00:21, 8.94it/s, loss=4.24, v_num=, reduced_train_loss=2.740, global_step=27.00, consumed_samples=216.0]
Epoch 0: 13%|█▎ | 29/220 [00:03<00:20, 9.10it/s, loss=4.24, v_num=, reduced_train_loss=2.740, global_step=27.00, consumed_samples=216.0]
Epoch 0: 13%|█▎ | 29/220 [00:03<00:20, 9.10it/s, loss=4.13, v_num=, reduced_train_loss=2.460, global_step=28.00, consumed_samples=224.0]
Epoch 0: 14%|█▎ | 30/220 [00:03<00:20, 9.27it/s, loss=4.13, v_num=, reduced_train_loss=2.460, global_step=28.00, consumed_samples=224.0]
Epoch 0: 14%|█▎ | 30/220 [00:03<00:20, 9.27it/s, loss=4.01, v_num=, reduced_train_loss=2.240, global_step=29.00, consumed_samples=232.0]
Epoch 0: 14%|█▍ | 31/220 [00:03<00:20, 9.41it/s, loss=4.01, v_num=, reduced_train_loss=2.240, global_step=29.00, consumed_samples=232.0]
Epoch 0: 14%|█▍ | 31/220 [00:03<00:20, 9.41it/s, loss=3.89, v_num=, reduced_train_loss=2.060, global_step=30.00, consumed_samples=240.0]
Epoch 0: 15%|█▍ | 32/220 [00:03<00:19, 9.58it/s, loss=3.89, v_num=, reduced_train_loss=2.060, global_step=30.00, consumed_samples=240.0]
Epoch 0: 15%|█▍ | 32/220 [00:03<00:19, 9.58it/s, loss=3.76, v_num=, reduced_train_loss=1.970, global_step=31.00, consumed_samples=248.0]
Epoch 0: 15%|█▌ | 33/220 [00:03<00:19, 9.75it/s, loss=3.76, v_num=, reduced_train_loss=1.970, global_step=31.00, consumed_samples=248.0]
Epoch 0: 15%|█▌ | 33/220 [00:03<00:19, 9.75it/s, loss=3.62, v_num=, reduced_train_loss=1.900, global_step=32.00, consumed_samples=256.0]
Epoch 0: 15%|█▌ | 34/220 [00:03<00:18, 9.91it/s, loss=3.62, v_num=, reduced_train_loss=1.900, global_step=32.00, consumed_samples=256.0]
Epoch 0: 15%|█▌ | 34/220 [00:03<00:18, 9.91it/s, loss=3.48, v_num=, reduced_train_loss=1.800, global_step=33.00, consumed_samples=264.0]
Epoch 0: 16%|█▌ | 35/220 [00:03<00:18, 10.07it/s, loss=3.48, v_num=, reduced_train_loss=1.800, global_step=33.00, consumed_samples=264.0]
Epoch 0: 16%|█▌ | 35/220 [00:03<00:18, 10.07it/s, loss=3.34, v_num=, reduced_train_loss=1.750, global_step=34.00, consumed_samples=272.0]
Epoch 0: 16%|█▋ | 36/220 [00:03<00:18, 10.22it/s, loss=3.34, v_num=, reduced_train_loss=1.750, global_step=34.00, consumed_samples=272.0]
Epoch 0: 16%|█▋ | 36/220 [00:03<00:18, 10.22it/s, loss=3.19, v_num=, reduced_train_loss=1.680, global_step=35.00, consumed_samples=280.0]
Epoch 0: 17%|█▋ | 37/220 [00:03<00:17, 10.37it/s, loss=3.19, v_num=, reduced_train_loss=1.680, global_step=35.00, consumed_samples=280.0]
Epoch 0: 17%|█▋ | 37/220 [00:03<00:17, 10.37it/s, loss=3.05, v_num=, reduced_train_loss=1.660, global_step=36.00, consumed_samples=288.0]
Epoch 0: 17%|█▋ | 38/220 [00:03<00:17, 10.52it/s, loss=3.05, v_num=, reduced_train_loss=1.660, global_step=36.00, consumed_samples=288.0]
Epoch 0: 17%|█▋ | 38/220 [00:03<00:17, 10.52it/s, loss=2.9, v_num=, reduced_train_loss=1.610, global_step=37.00, consumed_samples=296.0]
Epoch 0: 18%|█▊ | 39/220 [00:03<00:16, 10.65it/s, loss=2.9, v_num=, reduced_train_loss=1.610, global_step=37.00, consumed_samples=296.0]
Epoch 0: 18%|█▊ | 39/220 [00:03<00:16, 10.65it/s, loss=2.75, v_num=, reduced_train_loss=1.590, global_step=38.00, consumed_samples=304.0]
Epoch 0: 18%|█▊ | 40/220 [00:03<00:16, 10.76it/s, loss=2.75, v_num=, reduced_train_loss=1.590, global_step=38.00, consumed_samples=304.0]
Epoch 0: 18%|█▊ | 40/220 [00:03<00:16, 10.76it/s, loss=2.59, v_num=, reduced_train_loss=1.570, global_step=39.00, consumed_samples=312.0]
Epoch 0: 19%|█▊ | 41/220 [00:03<00:16, 10.89it/s, loss=2.59, v_num=, reduced_train_loss=1.570, global_step=39.00, consumed_samples=312.0]
Epoch 0: 19%|█▊ | 41/220 [00:03<00:16, 10.89it/s, loss=2.44, v_num=, reduced_train_loss=1.550, global_step=40.00, consumed_samples=320.0]
Epoch 0: 19%|█▉ | 42/220 [00:03<00:16, 11.02it/s, loss=2.44, v_num=, reduced_train_loss=1.550, global_step=40.00, consumed_samples=320.0]
Epoch 0: 19%|█▉ | 42/220 [00:03<00:16, 11.02it/s, loss=2.3, v_num=, reduced_train_loss=1.530, global_step=41.00, consumed_samples=328.0]
Epoch 0: 20%|█▉ | 43/220 [00:03<00:15, 11.15it/s, loss=2.3, v_num=, reduced_train_loss=1.530, global_step=41.00, consumed_samples=328.0]
Epoch 0: 20%|█▉ | 43/220 [00:03<00:15, 11.15it/s, loss=2.17, v_num=, reduced_train_loss=1.510, global_step=42.00, consumed_samples=336.0]
Epoch 0: 20%|██ | 44/220 [00:03<00:15, 11.27it/s, loss=2.17, v_num=, reduced_train_loss=1.510, global_step=42.00, consumed_samples=336.0]
Epoch 0: 20%|██ | 44/220 [00:03<00:15, 11.27it/s, loss=2.05, v_num=, reduced_train_loss=1.520, global_step=43.00, consumed_samples=344.0]
Epoch 0: 20%|██ | 45/220 [00:03<00:15, 11.39it/s, loss=2.05, v_num=, reduced_train_loss=1.520, global_step=43.00, consumed_samples=344.0]
Epoch 0: 20%|██ | 45/220 [00:03<00:15, 11.39it/s, loss=1.94, v_num=, reduced_train_loss=1.500, global_step=44.00, consumed_samples=352.0]
Epoch 0: 21%|██ | 46/220 [00:03<00:15, 11.50it/s, loss=1.94, v_num=, reduced_train_loss=1.500, global_step=44.00, consumed_samples=352.0]
Epoch 0: 21%|██ | 46/220 [00:03<00:15, 11.50it/s, loss=1.86, v_num=, reduced_train_loss=1.500, global_step=45.00, consumed_samples=360.0]
Epoch 0: 21%|██▏ | 47/220 [00:04<00:14, 11.62it/s, loss=1.86, v_num=, reduced_train_loss=1.500, global_step=45.00, consumed_samples=360.0]
Epoch 0: 21%|██▏ | 47/220 [00:04<00:14, 11.62it/s, loss=1.78, v_num=, reduced_train_loss=1.490, global_step=46.00, consumed_samples=368.0]
Epoch 0: 22%|██▏ | 48/220 [00:04<00:14, 11.73it/s, loss=1.78, v_num=, reduced_train_loss=1.490, global_step=46.00, consumed_samples=368.0]
Epoch 0: 22%|██▏ | 48/220 [00:04<00:14, 11.73it/s, loss=1.72, v_num=, reduced_train_loss=1.470, global_step=47.00, consumed_samples=376.0]
Epoch 0: 22%|██▏ | 49/220 [00:04<00:14, 11.85it/s, loss=1.72, v_num=, reduced_train_loss=1.470, global_step=47.00, consumed_samples=376.0]
Epoch 0: 22%|██▏ | 49/220 [00:04<00:14, 11.85it/s, loss=1.67, v_num=, reduced_train_loss=1.480, global_step=48.00, consumed_samples=384.0]
Epoch 0: 23%|██▎ | 50/220 [00:04<00:14, 11.96it/s, loss=1.67, v_num=, reduced_train_loss=1.480, global_step=48.00, consumed_samples=384.0]
Epoch 0: 23%|██▎ | 50/220 [00:04<00:14, 11.96it/s, loss=1.63, v_num=, reduced_train_loss=1.490, global_step=49.00, consumed_samples=392.0]
Epoch 0: 23%|██▎ | 51/220 [00:04<00:14, 12.06it/s, loss=1.63, v_num=, reduced_train_loss=1.490, global_step=49.00, consumed_samples=392.0]
Epoch 0: 23%|██▎ | 51/220 [00:04<00:14, 12.06it/s, loss=1.6, v_num=, reduced_train_loss=1.490, global_step=50.00, consumed_samples=400.0]
Epoch 0: 24%|██▎ | 52/220 [00:04<00:13, 12.16it/s, loss=1.6, v_num=, reduced_train_loss=1.490, global_step=50.00, consumed_samples=400.0]
Epoch 0: 24%|██▎ | 52/220 [00:04<00:13, 12.16it/s, loss=1.58, v_num=, reduced_train_loss=1.460, global_step=51.00, consumed_samples=408.0]
Epoch 0: 24%|██▍ | 53/220 [00:04<00:13, 12.25it/s, loss=1.58, v_num=, reduced_train_loss=1.460, global_step=51.00, consumed_samples=408.0]
Epoch 0: 24%|██▍ | 53/220 [00:04<00:13, 12.25it/s, loss=1.56, v_num=, reduced_train_loss=1.490, global_step=52.00, consumed_samples=416.0]
Epoch 0: 25%|██▍ | 54/220 [00:04<00:13, 12.34it/s, loss=1.56, v_num=, reduced_train_loss=1.490, global_step=52.00, consumed_samples=416.0]
Epoch 0: 25%|██▍ | 54/220 [00:04<00:13, 12.34it/s, loss=1.54, v_num=, reduced_train_loss=1.460, global_step=53.00, consumed_samples=424.0]
Epoch 0: 25%|██▌ | 55/220 [00:04<00:13, 12.44it/s, loss=1.54, v_num=, reduced_train_loss=1.460, global_step=53.00, consumed_samples=424.0]
Epoch 0: 25%|██▌ | 55/220 [00:04<00:13, 12.44it/s, loss=1.52, v_num=, reduced_train_loss=1.440, global_step=54.00, consumed_samples=432.0]
Epoch 0: 25%|██▌ | 56/220 [00:04<00:13, 12.54it/s, loss=1.52, v_num=, reduced_train_loss=1.440, global_step=54.00, consumed_samples=432.0]
Epoch 0: 25%|██▌ | 56/220 [00:04<00:13, 12.54it/s, loss=1.51, v_num=, reduced_train_loss=1.480, global_step=55.00, consumed_samples=440.0]
Epoch 0: 26%|██▌ | 57/220 [00:04<00:12, 12.63it/s, loss=1.51, v_num=, reduced_train_loss=1.480, global_step=55.00, consumed_samples=440.0]
Epoch 0: 26%|██▌ | 57/220 [00:04<00:12, 12.63it/s, loss=1.5, v_num=, reduced_train_loss=1.450, global_step=56.00, consumed_samples=448.0]
Epoch 0: 26%|██▋ | 58/220 [00:04<00:12, 12.71it/s, loss=1.5, v_num=, reduced_train_loss=1.450, global_step=56.00, consumed_samples=448.0]
Epoch 0: 26%|██▋ | 58/220 [00:04<00:12, 12.71it/s, loss=1.5, v_num=, reduced_train_loss=1.460, global_step=57.00, consumed_samples=456.0]
Epoch 0: 27%|██▋ | 59/220 [00:04<00:12, 12.80it/s, loss=1.5, v_num=, reduced_train_loss=1.460, global_step=57.00, consumed_samples=456.0]
Epoch 0: 27%|██▋ | 59/220 [00:04<00:12, 12.80it/s, loss=1.49, v_num=, reduced_train_loss=1.460, global_step=58.00, consumed_samples=464.0]
Epoch 0: 27%|██▋ | 60/220 [00:04<00:12, 12.89it/s, loss=1.49, v_num=, reduced_train_loss=1.460, global_step=58.00, consumed_samples=464.0]
Epoch 0: 27%|██▋ | 60/220 [00:04<00:12, 12.89it/s, loss=1.48, v_num=, reduced_train_loss=1.460, global_step=59.00, consumed_samples=472.0]
Epoch 0: 28%|██▊ | 61/220 [00:04<00:12, 12.97it/s, loss=1.48, v_num=, reduced_train_loss=1.460, global_step=59.00, consumed_samples=472.0]
Epoch 0: 28%|██▊ | 61/220 [00:04<00:12, 12.97it/s, loss=1.48, v_num=, reduced_train_loss=1.440, global_step=60.00, consumed_samples=480.0]
Epoch 0: 28%|██▊ | 62/220 [00:04<00:12, 13.06it/s, loss=1.48, v_num=, reduced_train_loss=1.440, global_step=60.00, consumed_samples=480.0]
Epoch 0: 28%|██▊ | 62/220 [00:04<00:12, 13.06it/s, loss=1.48, v_num=, reduced_train_loss=1.490, global_step=61.00, consumed_samples=488.0]
Epoch 0: 29%|██▊ | 63/220 [00:04<00:11, 13.15it/s, loss=1.48, v_num=, reduced_train_loss=1.490, global_step=61.00, consumed_samples=488.0]
Epoch 0: 29%|██▊ | 63/220 [00:04<00:11, 13.15it/s, loss=1.48, v_num=, reduced_train_loss=1.490, global_step=62.00, consumed_samples=496.0]
Epoch 0: 29%|██▉ | 64/220 [00:04<00:11, 13.23it/s, loss=1.48, v_num=, reduced_train_loss=1.490, global_step=62.00, consumed_samples=496.0]
Epoch 0: 29%|██▉ | 64/220 [00:04<00:11, 13.23it/s, loss=1.47, v_num=, reduced_train_loss=1.430, global_step=63.00, consumed_samples=504.0]
Epoch 0: 30%|██▉ | 65/220 [00:04<00:11, 13.31it/s, loss=1.47, v_num=, reduced_train_loss=1.430, global_step=63.00, consumed_samples=504.0]
Epoch 0: 30%|██▉ | 65/220 [00:04<00:11, 13.30it/s, loss=1.47, v_num=, reduced_train_loss=1.450, global_step=64.00, consumed_samples=512.0]
Epoch 0: 30%|███ | 66/220 [00:04<00:11, 13.38it/s, loss=1.47, v_num=, reduced_train_loss=1.450, global_step=64.00, consumed_samples=512.0]
Epoch 0: 30%|███ | 66/220 [00:04<00:11, 13.38it/s, loss=1.47, v_num=, reduced_train_loss=1.460, global_step=65.00, consumed_samples=520.0]
Epoch 0: 30%|███ | 67/220 [00:04<00:11, 13.45it/s, loss=1.47, v_num=, reduced_train_loss=1.460, global_step=65.00, consumed_samples=520.0]
Epoch 0: 30%|███ | 67/220 [00:04<00:11, 13.45it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=66.00, consumed_samples=528.0]
Epoch 0: 31%|███ | 68/220 [00:05<00:11, 13.53it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=66.00, consumed_samples=528.0]
Epoch 0: 31%|███ | 68/220 [00:05<00:11, 13.53it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=67.00, consumed_samples=536.0]
Epoch 0: 31%|███▏ | 69/220 [00:05<00:11, 13.60it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=67.00, consumed_samples=536.0]
Epoch 0: 31%|███▏ | 69/220 [00:05<00:11, 13.60it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=68.00, consumed_samples=544.0]
Epoch 0: 32%|███▏ | 70/220 [00:05<00:10, 13.67it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=68.00, consumed_samples=544.0]
Epoch 0: 32%|███▏ | 70/220 [00:05<00:10, 13.67it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=69.00, consumed_samples=552.0]
Epoch 0: 32%|███▏ | 71/220 [00:05<00:10, 13.68it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=69.00, consumed_samples=552.0]
Epoch 0: 32%|███▏ | 71/220 [00:05<00:10, 13.68it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=70.00, consumed_samples=560.0]
Epoch 0: 33%|███▎ | 72/220 [00:05<00:10, 13.72it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=70.00, consumed_samples=560.0]
Epoch 0: 33%|███▎ | 72/220 [00:05<00:10, 13.72it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=71.00, consumed_samples=568.0]
Epoch 0: 33%|███▎ | 73/220 [00:05<00:10, 13.78it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=71.00, consumed_samples=568.0]
Epoch 0: 33%|███▎ | 73/220 [00:05<00:10, 13.78it/s, loss=1.46, v_num=, reduced_train_loss=1.420, global_step=72.00, consumed_samples=576.0]
Epoch 0: 34%|███▎ | 74/220 [00:05<00:10, 13.85it/s, loss=1.46, v_num=, reduced_train_loss=1.420, global_step=72.00, consumed_samples=576.0]
Epoch 0: 34%|███▎ | 74/220 [00:05<00:10, 13.85it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=73.00, consumed_samples=584.0]
Epoch 0: 34%|███▍ | 75/220 [00:05<00:10, 13.90it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=73.00, consumed_samples=584.0]
Epoch 0: 34%|███▍ | 75/220 [00:05<00:10, 13.90it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=74.00, consumed_samples=592.0]
Epoch 0: 35%|███▍ | 76/220 [00:05<00:10, 13.96it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=74.00, consumed_samples=592.0]
Epoch 0: 35%|███▍ | 76/220 [00:05<00:10, 13.96it/s, loss=1.46, v_num=, reduced_train_loss=1.490, global_step=75.00, consumed_samples=600.0]
Epoch 0: 35%|███▌ | 77/220 [00:05<00:10, 14.00it/s, loss=1.46, v_num=, reduced_train_loss=1.490, global_step=75.00, consumed_samples=600.0]
Epoch 0: 35%|███▌ | 77/220 [00:05<00:10, 14.00it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=76.00, consumed_samples=608.0]
Epoch 0: 35%|███▌ | 78/220 [00:05<00:10, 14.05it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=76.00, consumed_samples=608.0]
Epoch 0: 35%|███▌ | 78/220 [00:05<00:10, 14.05it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=77.00, consumed_samples=616.0]
Epoch 0: 36%|███▌ | 79/220 [00:05<00:09, 14.12it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=77.00, consumed_samples=616.0]
Epoch 0: 36%|███▌ | 79/220 [00:05<00:09, 14.12it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=78.00, consumed_samples=624.0]
Epoch 0: 36%|███▋ | 80/220 [00:05<00:09, 14.18it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=78.00, consumed_samples=624.0]
Epoch 0: 36%|███▋ | 80/220 [00:05<00:09, 14.17it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=79.00, consumed_samples=632.0]
Epoch 0: 37%|███▋ | 81/220 [00:05<00:09, 14.23it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=79.00, consumed_samples=632.0]
Epoch 0: 37%|███▋ | 81/220 [00:05<00:09, 14.23it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=80.00, consumed_samples=640.0]
Epoch 0: 37%|███▋ | 82/220 [00:05<00:09, 14.28it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=80.00, consumed_samples=640.0]
Epoch 0: 37%|███▋ | 82/220 [00:05<00:09, 14.28it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=81.00, consumed_samples=648.0]
Epoch 0: 38%|███▊ | 83/220 [00:05<00:09, 14.34it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=81.00, consumed_samples=648.0]
Epoch 0: 38%|███▊ | 83/220 [00:05<00:09, 14.34it/s, loss=1.46, v_num=, reduced_train_loss=1.510, global_step=82.00, consumed_samples=656.0]
Epoch 0: 38%|███▊ | 84/220 [00:05<00:09, 14.40it/s, loss=1.46, v_num=, reduced_train_loss=1.510, global_step=82.00, consumed_samples=656.0]
Epoch 0: 38%|███▊ | 84/220 [00:05<00:09, 14.39it/s, loss=1.46, v_num=, reduced_train_loss=1.500, global_step=83.00, consumed_samples=664.0]
Epoch 0: 39%|███▊ | 85/220 [00:05<00:09, 14.44it/s, loss=1.46, v_num=, reduced_train_loss=1.500, global_step=83.00, consumed_samples=664.0]
Epoch 0: 39%|███▊ | 85/220 [00:05<00:09, 14.44it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=84.00, consumed_samples=672.0]
Epoch 0: 39%|███▉ | 86/220 [00:05<00:09, 14.48it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=84.00, consumed_samples=672.0]
Epoch 0: 39%|███▉ | 86/220 [00:05<00:09, 14.48it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=85.00, consumed_samples=680.0]
Epoch 0: 40%|███▉ | 87/220 [00:05<00:09, 14.53it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=85.00, consumed_samples=680.0]
Epoch 0: 40%|███▉ | 87/220 [00:05<00:09, 14.53it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=86.00, consumed_samples=688.0]
Epoch 0: 40%|████ | 88/220 [00:06<00:09, 14.59it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=86.00, consumed_samples=688.0]
Epoch 0: 40%|████ | 88/220 [00:06<00:09, 14.59it/s, loss=1.46, v_num=, reduced_train_loss=1.490, global_step=87.00, consumed_samples=696.0]
Epoch 0: 40%|████ | 89/220 [00:06<00:08, 14.65it/s, loss=1.46, v_num=, reduced_train_loss=1.490, global_step=87.00, consumed_samples=696.0]
Epoch 0: 40%|████ | 89/220 [00:06<00:08, 14.64it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=88.00, consumed_samples=704.0]
Epoch 0: 41%|████ | 90/220 [00:06<00:08, 14.69it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=88.00, consumed_samples=704.0]
Epoch 0: 41%|████ | 90/220 [00:06<00:08, 14.69it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=89.00, consumed_samples=712.0]
Epoch 0: 41%|████▏ | 91/220 [00:06<00:08, 14.72it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=89.00, consumed_samples=712.0]
Epoch 0: 41%|████▏ | 91/220 [00:06<00:08, 14.72it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=90.00, consumed_samples=720.0]
Epoch 0: 42%|████▏ | 92/220 [00:06<00:08, 14.77it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=90.00, consumed_samples=720.0]
Epoch 0: 42%|████▏ | 92/220 [00:06<00:08, 14.77it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=91.00, consumed_samples=728.0]
Epoch 0: 42%|████▏ | 93/220 [00:06<00:08, 14.83it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=91.00, consumed_samples=728.0]
Epoch 0: 42%|████▏ | 93/220 [00:06<00:08, 14.83it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=92.00, consumed_samples=736.0]
Epoch 0: 43%|████▎ | 94/220 [00:06<00:08, 14.88it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=92.00, consumed_samples=736.0]
Epoch 0: 43%|████▎ | 94/220 [00:06<00:08, 14.88it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=93.00, consumed_samples=744.0]
Epoch 0: 43%|████▎ | 95/220 [00:06<00:08, 14.93it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=93.00, consumed_samples=744.0]
Epoch 0: 43%|████▎ | 95/220 [00:06<00:08, 14.93it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=94.00, consumed_samples=752.0]
Epoch 0: 44%|████▎ | 96/220 [00:06<00:08, 14.98it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=94.00, consumed_samples=752.0]
Epoch 0: 44%|████▎ | 96/220 [00:06<00:08, 14.98it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=95.00, consumed_samples=760.0]
Epoch 0: 44%|████▍ | 97/220 [00:06<00:08, 15.02it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=95.00, consumed_samples=760.0]
Epoch 0: 44%|████▍ | 97/220 [00:06<00:08, 15.02it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=96.00, consumed_samples=768.0]
Epoch 0: 45%|████▍ | 98/220 [00:06<00:08, 15.07it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=96.00, consumed_samples=768.0]
Epoch 0: 45%|████▍ | 98/220 [00:06<00:08, 15.07it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=97.00, consumed_samples=776.0]
Epoch 0: 45%|████▌ | 99/220 [00:06<00:08, 15.11it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=97.00, consumed_samples=776.0]
Epoch 0: 45%|████▌ | 99/220 [00:06<00:08, 15.10it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=98.00, consumed_samples=784.0]
Epoch 0: 45%|████▌ | 100/220 [00:06<00:07, 15.14it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=98.00, consumed_samples=784.0]
Epoch 0: 45%|████▌ | 100/220 [00:06<00:07, 15.14it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Validation: 0it [00:00, ?it/s]
Validation: 0%| | 0/10 [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/10 [00:00<?, ?it/s]
Validation DataLoader 0: 10%|█ | 1/10 [00:00<00:01, 8.65it/s]
Epoch 0: 46%|████▌ | 101/220 [00:06<00:07, 14.98it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Validation DataLoader 0: 20%|██ | 2/10 [00:00<00:00, 15.52it/s]
Epoch 0: 46%|████▋ | 102/220 [00:06<00:07, 15.10it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Validation DataLoader 0: 30%|███ | 3/10 [00:00<00:00, 17.33it/s]
Epoch 0: 47%|████▋ | 103/220 [00:06<00:07, 15.15it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Validation DataLoader 0: 40%|████ | 4/10 [00:00<00:00, 20.76it/s]
Epoch 0: 47%|████▋ | 104/220 [00:06<00:07, 15.25it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Validation DataLoader 0: 50%|█████ | 5/10 [00:00<00:00, 23.95it/s]
Epoch 0: 48%|████▊ | 105/220 [00:06<00:07, 15.36it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Validation DataLoader 0: 60%|██████ | 6/10 [00:00<00:00, 26.99it/s]
Epoch 0: 48%|████▊ | 106/220 [00:06<00:07, 15.48it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Validation DataLoader 0: 70%|███████ | 7/10 [00:00<00:00, 29.75it/s]
Epoch 0: 49%|████▊ | 107/220 [00:06<00:07, 15.59it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Validation DataLoader 0: 80%|████████ | 8/10 [00:00<00:00, 32.18it/s]
Epoch 0: 49%|████▉ | 108/220 [00:06<00:07, 15.71it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Validation DataLoader 0: 90%|█████████ | 9/10 [00:00<00:00, 34.39it/s]
Epoch 0: 50%|████▉ | 109/220 [00:06<00:07, 15.82it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Validation DataLoader 0: 100%|██████████| 10/10 [00:00<00:00, 36.39it/s]
Epoch 0: 50%|█████ | 110/220 [00:06<00:06, 15.94it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Epoch 0: 50%|█████ | 110/220 [00:06<00:06, 15.93it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0, val_loss=1.420]
Epoch 0, global step 100: 'val_loss' reached 1.42065 (best 1.42065), saving model to '/result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining/checkpoints/megatron_bert--val_loss=1.42-step=100-consumed_samples=800.0.ckpt' as top 10
Epoch 0: 50%|█████ | 111/220 [00:08<00:08, 13.43it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0, val_loss=1.420]
Epoch 0: 50%|█████ | 111/220 [00:08<00:08, 13.43it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=100.0, consumed_samples=800.0, val_loss=1.420]
Epoch 0: 51%|█████ | 112/220 [00:08<00:08, 13.46it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=100.0, consumed_samples=800.0, val_loss=1.420]
Epoch 0: 51%|█████ | 112/220 [00:08<00:08, 13.46it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=101.0, consumed_samples=808.0, val_loss=1.420]
Epoch 0: 51%|█████▏ | 113/220 [00:08<00:07, 13.50it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=101.0, consumed_samples=808.0, val_loss=1.420]
Epoch 0: 51%|█████▏ | 113/220 [00:08<00:07, 13.50it/s, loss=1.46, v_num=, reduced_train_loss=1.420, global_step=102.0, consumed_samples=816.0, val_loss=1.420]
Epoch 0: 52%|█████▏ | 114/220 [00:08<00:07, 13.54it/s, loss=1.46, v_num=, reduced_train_loss=1.420, global_step=102.0, consumed_samples=816.0, val_loss=1.420]
Epoch 0: 52%|█████▏ | 114/220 [00:08<00:07, 13.54it/s, loss=1.45, v_num=, reduced_train_loss=1.420, global_step=103.0, consumed_samples=824.0, val_loss=1.420]
Epoch 0: 52%|█████▏ | 115/220 [00:08<00:07, 13.58it/s, loss=1.45, v_num=, reduced_train_loss=1.420, global_step=103.0, consumed_samples=824.0, val_loss=1.420]
Epoch 0: 52%|█████▏ | 115/220 [00:08<00:07, 13.58it/s, loss=1.45, v_num=, reduced_train_loss=1.420, global_step=104.0, consumed_samples=832.0, val_loss=1.420]
Epoch 0: 53%|█████▎ | 116/220 [00:08<00:07, 13.62it/s, loss=1.45, v_num=, reduced_train_loss=1.420, global_step=104.0, consumed_samples=832.0, val_loss=1.420]
Epoch 0: 53%|█████▎ | 116/220 [00:08<00:07, 13.62it/s, loss=1.45, v_num=, reduced_train_loss=1.410, global_step=105.0, consumed_samples=840.0, val_loss=1.420]
Epoch 0: 53%|█████▎ | 117/220 [00:08<00:07, 13.66it/s, loss=1.45, v_num=, reduced_train_loss=1.410, global_step=105.0, consumed_samples=840.0, val_loss=1.420]
Epoch 0: 53%|█████▎ | 117/220 [00:08<00:07, 13.66it/s, loss=1.45, v_num=, reduced_train_loss=1.410, global_step=106.0, consumed_samples=848.0, val_loss=1.420]
Epoch 0: 54%|█████▎ | 118/220 [00:08<00:07, 13.70it/s, loss=1.45, v_num=, reduced_train_loss=1.410, global_step=106.0, consumed_samples=848.0, val_loss=1.420]
Epoch 0: 54%|█████▎ | 118/220 [00:08<00:07, 13.70it/s, loss=1.44, v_num=, reduced_train_loss=1.410, global_step=107.0, consumed_samples=856.0, val_loss=1.420]
Epoch 0: 54%|█████▍ | 119/220 [00:08<00:07, 13.73it/s, loss=1.44, v_num=, reduced_train_loss=1.410, global_step=107.0, consumed_samples=856.0, val_loss=1.420]
Epoch 0: 54%|█████▍ | 119/220 [00:08<00:07, 13.73it/s, loss=1.44, v_num=, reduced_train_loss=1.430, global_step=108.0, consumed_samples=864.0, val_loss=1.420]
Epoch 0: 55%|█████▍ | 120/220 [00:08<00:07, 13.77it/s, loss=1.44, v_num=, reduced_train_loss=1.430, global_step=108.0, consumed_samples=864.0, val_loss=1.420]
Epoch 0: 55%|█████▍ | 120/220 [00:08<00:07, 13.77it/s, loss=1.44, v_num=, reduced_train_loss=1.480, global_step=109.0, consumed_samples=872.0, val_loss=1.420]
Epoch 0: 55%|█████▌ | 121/220 [00:08<00:07, 13.81it/s, loss=1.44, v_num=, reduced_train_loss=1.480, global_step=109.0, consumed_samples=872.0, val_loss=1.420]
Epoch 0: 55%|█████▌ | 121/220 [00:08<00:07, 13.81it/s, loss=1.44, v_num=, reduced_train_loss=1.410, global_step=110.0, consumed_samples=880.0, val_loss=1.420]
Epoch 0: 55%|█████▌ | 122/220 [00:08<00:07, 13.86it/s, loss=1.44, v_num=, reduced_train_loss=1.410, global_step=110.0, consumed_samples=880.0, val_loss=1.420]
Epoch 0: 55%|█████▌ | 122/220 [00:08<00:07, 13.86it/s, loss=1.44, v_num=, reduced_train_loss=1.430, global_step=111.0, consumed_samples=888.0, val_loss=1.420]
Epoch 0: 56%|█████▌ | 123/220 [00:08<00:06, 13.90it/s, loss=1.44, v_num=, reduced_train_loss=1.430, global_step=111.0, consumed_samples=888.0, val_loss=1.420]
Epoch 0: 56%|█████▌ | 123/220 [00:08<00:06, 13.90it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=112.0, consumed_samples=896.0, val_loss=1.420]
Epoch 0: 56%|█████▋ | 124/220 [00:08<00:06, 13.94it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=112.0, consumed_samples=896.0, val_loss=1.420]
Epoch 0: 56%|█████▋ | 124/220 [00:08<00:06, 13.94it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=113.0, consumed_samples=904.0, val_loss=1.420]
Epoch 0: 57%|█████▋ | 125/220 [00:08<00:06, 13.99it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=113.0, consumed_samples=904.0, val_loss=1.420]
Epoch 0: 57%|█████▋ | 125/220 [00:08<00:06, 13.99it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=114.0, consumed_samples=912.0, val_loss=1.420]
Epoch 0: 57%|█████▋ | 126/220 [00:08<00:06, 14.03it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=114.0, consumed_samples=912.0, val_loss=1.420]
Epoch 0: 57%|█████▋ | 126/220 [00:08<00:06, 14.02it/s, loss=1.43, v_num=, reduced_train_loss=1.400, global_step=115.0, consumed_samples=920.0, val_loss=1.420]
Epoch 0: 58%|█████▊ | 127/220 [00:09<00:06, 14.07it/s, loss=1.43, v_num=, reduced_train_loss=1.400, global_step=115.0, consumed_samples=920.0, val_loss=1.420]
Epoch 0: 58%|█████▊ | 127/220 [00:09<00:06, 14.07it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=116.0, consumed_samples=928.0, val_loss=1.420]
Epoch 0: 58%|█████▊ | 128/220 [00:09<00:06, 14.11it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=116.0, consumed_samples=928.0, val_loss=1.420]
Epoch 0: 58%|█████▊ | 128/220 [00:09<00:06, 14.11it/s, loss=1.43, v_num=, reduced_train_loss=1.390, global_step=117.0, consumed_samples=936.0, val_loss=1.420]
Epoch 0: 59%|█████▊ | 129/220 [00:09<00:06, 14.14it/s, loss=1.43, v_num=, reduced_train_loss=1.390, global_step=117.0, consumed_samples=936.0, val_loss=1.420]
Epoch 0: 59%|█████▊ | 129/220 [00:09<00:06, 14.14it/s, loss=1.42, v_num=, reduced_train_loss=1.440, global_step=118.0, consumed_samples=944.0, val_loss=1.420]
Epoch 0: 59%|█████▉ | 130/220 [00:09<00:06, 14.17it/s, loss=1.42, v_num=, reduced_train_loss=1.440, global_step=118.0, consumed_samples=944.0, val_loss=1.420]
Epoch 0: 59%|█████▉ | 130/220 [00:09<00:06, 14.17it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=119.0, consumed_samples=952.0, val_loss=1.420]
Epoch 0: 60%|█████▉ | 131/220 [00:09<00:06, 14.21it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=119.0, consumed_samples=952.0, val_loss=1.420]
Epoch 0: 60%|█████▉ | 131/220 [00:09<00:06, 14.21it/s, loss=1.42, v_num=, reduced_train_loss=1.420, global_step=120.0, consumed_samples=960.0, val_loss=1.420]
Epoch 0: 60%|██████ | 132/220 [00:09<00:06, 14.25it/s, loss=1.42, v_num=, reduced_train_loss=1.420, global_step=120.0, consumed_samples=960.0, val_loss=1.420]
Epoch 0: 60%|██████ | 132/220 [00:09<00:06, 14.25it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=121.0, consumed_samples=968.0, val_loss=1.420]
Epoch 0: 60%|██████ | 133/220 [00:09<00:06, 14.28it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=121.0, consumed_samples=968.0, val_loss=1.420]
Epoch 0: 60%|██████ | 133/220 [00:09<00:06, 14.28it/s, loss=1.42, v_num=, reduced_train_loss=1.470, global_step=122.0, consumed_samples=976.0, val_loss=1.420]
Epoch 0: 61%|██████ | 134/220 [00:09<00:06, 14.31it/s, loss=1.42, v_num=, reduced_train_loss=1.470, global_step=122.0, consumed_samples=976.0, val_loss=1.420]
Epoch 0: 61%|██████ | 134/220 [00:09<00:06, 14.31it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=123.0, consumed_samples=984.0, val_loss=1.420]
Epoch 0: 61%|██████▏ | 135/220 [00:09<00:05, 14.34it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=123.0, consumed_samples=984.0, val_loss=1.420]
Epoch 0: 61%|██████▏ | 135/220 [00:09<00:05, 14.34it/s, loss=1.42, v_num=, reduced_train_loss=1.470, global_step=124.0, consumed_samples=992.0, val_loss=1.420]
Epoch 0: 62%|██████▏ | 136/220 [00:09<00:05, 14.37it/s, loss=1.42, v_num=, reduced_train_loss=1.470, global_step=124.0, consumed_samples=992.0, val_loss=1.420]
Epoch 0: 62%|██████▏ | 136/220 [00:09<00:05, 14.37it/s, loss=1.42, v_num=, reduced_train_loss=1.430, global_step=125.0, consumed_samples=1e+3, val_loss=1.420]
Epoch 0: 62%|██████▏ | 137/220 [00:09<00:05, 14.39it/s, loss=1.42, v_num=, reduced_train_loss=1.430, global_step=125.0, consumed_samples=1e+3, val_loss=1.420]
Epoch 0: 62%|██████▏ | 137/220 [00:09<00:05, 14.39it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=126.0, consumed_samples=1008.0, val_loss=1.420]
Epoch 0: 63%|██████▎ | 138/220 [00:09<00:05, 14.43it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=126.0, consumed_samples=1008.0, val_loss=1.420]
Epoch 0: 63%|██████▎ | 138/220 [00:09<00:05, 14.43it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=127.0, consumed_samples=1016.0, val_loss=1.420]
Epoch 0: 63%|██████▎ | 139/220 [00:09<00:05, 14.45it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=127.0, consumed_samples=1016.0, val_loss=1.420]
Epoch 0: 63%|██████▎ | 139/220 [00:09<00:05, 14.45it/s, loss=1.42, v_num=, reduced_train_loss=1.460, global_step=128.0, consumed_samples=1024.0, val_loss=1.420]
Epoch 0: 64%|██████▎ | 140/220 [00:09<00:05, 14.49it/s, loss=1.42, v_num=, reduced_train_loss=1.460, global_step=128.0, consumed_samples=1024.0, val_loss=1.420]
Epoch 0: 64%|██████▎ | 140/220 [00:09<00:05, 14.49it/s, loss=1.42, v_num=, reduced_train_loss=1.420, global_step=129.0, consumed_samples=1032.0, val_loss=1.420]
Epoch 0: 64%|██████▍ | 141/220 [00:09<00:05, 14.52it/s, loss=1.42, v_num=, reduced_train_loss=1.420, global_step=129.0, consumed_samples=1032.0, val_loss=1.420]
Epoch 0: 64%|██████▍ | 141/220 [00:09<00:05, 14.52it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=130.0, consumed_samples=1040.0, val_loss=1.420]
Epoch 0: 65%|██████▍ | 142/220 [00:09<00:05, 14.55it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=130.0, consumed_samples=1040.0, val_loss=1.420]
Epoch 0: 65%|██████▍ | 142/220 [00:09<00:05, 14.55it/s, loss=1.42, v_num=, reduced_train_loss=1.450, global_step=131.0, consumed_samples=1048.0, val_loss=1.420]
Epoch 0: 65%|██████▌ | 143/220 [00:09<00:05, 14.59it/s, loss=1.42, v_num=, reduced_train_loss=1.450, global_step=131.0, consumed_samples=1048.0, val_loss=1.420]
Epoch 0: 65%|██████▌ | 143/220 [00:09<00:05, 14.59it/s, loss=1.43, v_num=, reduced_train_loss=1.460, global_step=132.0, consumed_samples=1056.0, val_loss=1.420]
Epoch 0: 65%|██████▌ | 144/220 [00:09<00:05, 14.62it/s, loss=1.43, v_num=, reduced_train_loss=1.460, global_step=132.0, consumed_samples=1056.0, val_loss=1.420]
Epoch 0: 65%|██████▌ | 144/220 [00:09<00:05, 14.62it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=133.0, consumed_samples=1064.0, val_loss=1.420]
Epoch 0: 66%|██████▌ | 145/220 [00:09<00:05, 14.63it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=133.0, consumed_samples=1064.0, val_loss=1.420]
Epoch 0: 66%|██████▌ | 145/220 [00:09<00:05, 14.63it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=134.0, consumed_samples=1072.0, val_loss=1.420]
Epoch 0: 66%|██████▋ | 146/220 [00:09<00:05, 14.67it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=134.0, consumed_samples=1072.0, val_loss=1.420]
Epoch 0: 66%|██████▋ | 146/220 [00:09<00:05, 14.67it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=135.0, consumed_samples=1080.0, val_loss=1.420]
Epoch 0: 67%|██████▋ | 147/220 [00:10<00:04, 14.70it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=135.0, consumed_samples=1080.0, val_loss=1.420]
Epoch 0: 67%|██████▋ | 147/220 [00:10<00:04, 14.70it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=136.0, consumed_samples=1088.0, val_loss=1.420]
Epoch 0: 67%|██████▋ | 148/220 [00:10<00:04, 14.73it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=136.0, consumed_samples=1088.0, val_loss=1.420]
Epoch 0: 67%|██████▋ | 148/220 [00:10<00:04, 14.72it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=137.0, consumed_samples=1096.0, val_loss=1.420]
Epoch 0: 68%|██████▊ | 149/220 [00:10<00:04, 14.75it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=137.0, consumed_samples=1096.0, val_loss=1.420]
Epoch 0: 68%|██████▊ | 149/220 [00:10<00:04, 14.75it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=138.0, consumed_samples=1104.0, val_loss=1.420]
Epoch 0: 68%|██████▊ | 150/220 [00:10<00:04, 14.78it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=138.0, consumed_samples=1104.0, val_loss=1.420]
Epoch 0: 68%|██████▊ | 150/220 [00:10<00:04, 14.78it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=139.0, consumed_samples=1112.0, val_loss=1.420]
Epoch 0: 69%|██████▊ | 151/220 [00:10<00:04, 14.81it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=139.0, consumed_samples=1112.0, val_loss=1.420]
Epoch 0: 69%|██████▊ | 151/220 [00:10<00:04, 14.81it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=140.0, consumed_samples=1120.0, val_loss=1.420]
Epoch 0: 69%|██████▉ | 152/220 [00:10<00:04, 14.84it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=140.0, consumed_samples=1120.0, val_loss=1.420]
Epoch 0: 69%|██████▉ | 152/220 [00:10<00:04, 14.83it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=141.0, consumed_samples=1128.0, val_loss=1.420]
Epoch 0: 70%|██████▉ | 153/220 [00:10<00:04, 14.86it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=141.0, consumed_samples=1128.0, val_loss=1.420]
Epoch 0: 70%|██████▉ | 153/220 [00:10<00:04, 14.86it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=142.0, consumed_samples=1136.0, val_loss=1.420]
Epoch 0: 70%|███████ | 154/220 [00:10<00:04, 14.88it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=142.0, consumed_samples=1136.0, val_loss=1.420]
Epoch 0: 70%|███████ | 154/220 [00:10<00:04, 14.88it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=143.0, consumed_samples=1144.0, val_loss=1.420]
Epoch 0: 70%|███████ | 155/220 [00:10<00:04, 14.90it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=143.0, consumed_samples=1144.0, val_loss=1.420]
Epoch 0: 70%|███████ | 155/220 [00:10<00:04, 14.90it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=144.0, consumed_samples=1152.0, val_loss=1.420]
Epoch 0: 71%|███████ | 156/220 [00:10<00:04, 14.93it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=144.0, consumed_samples=1152.0, val_loss=1.420]
Epoch 0: 71%|███████ | 156/220 [00:10<00:04, 14.93it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=145.0, consumed_samples=1160.0, val_loss=1.420]
Epoch 0: 71%|███████▏ | 157/220 [00:10<00:04, 14.95it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=145.0, consumed_samples=1160.0, val_loss=1.420]
Epoch 0: 71%|███████▏ | 157/220 [00:10<00:04, 14.95it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=146.0, consumed_samples=1168.0, val_loss=1.420]
Epoch 0: 72%|███████▏ | 158/220 [00:10<00:04, 14.97it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=146.0, consumed_samples=1168.0, val_loss=1.420]
Epoch 0: 72%|███████▏ | 158/220 [00:10<00:04, 14.97it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=147.0, consumed_samples=1176.0, val_loss=1.420]
Epoch 0: 72%|███████▏ | 159/220 [00:10<00:04, 14.99it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=147.0, consumed_samples=1176.0, val_loss=1.420]
Epoch 0: 72%|███████▏ | 159/220 [00:10<00:04, 14.99it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=148.0, consumed_samples=1184.0, val_loss=1.420]
Epoch 0: 73%|███████▎ | 160/220 [00:10<00:03, 15.02it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=148.0, consumed_samples=1184.0, val_loss=1.420]
Epoch 0: 73%|███████▎ | 160/220 [00:10<00:03, 15.02it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=149.0, consumed_samples=1192.0, val_loss=1.420]
Epoch 0: 73%|███████▎ | 161/220 [00:10<00:03, 15.04it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=149.0, consumed_samples=1192.0, val_loss=1.420]
Epoch 0: 73%|███████▎ | 161/220 [00:10<00:03, 15.04it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=150.0, consumed_samples=1200.0, val_loss=1.420]
Epoch 0: 74%|███████▎ | 162/220 [00:10<00:03, 15.07it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=150.0, consumed_samples=1200.0, val_loss=1.420]
Epoch 0: 74%|███████▎ | 162/220 [00:10<00:03, 15.07it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=151.0, consumed_samples=1208.0, val_loss=1.420]
Epoch 0: 74%|███████▍ | 163/220 [00:10<00:03, 15.10it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=151.0, consumed_samples=1208.0, val_loss=1.420]
Epoch 0: 74%|███████▍ | 163/220 [00:10<00:03, 15.09it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=152.0, consumed_samples=1216.0, val_loss=1.420]
Epoch 0: 75%|███████▍ | 164/220 [00:10<00:03, 15.12it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=152.0, consumed_samples=1216.0, val_loss=1.420]
Epoch 0: 75%|███████▍ | 164/220 [00:10<00:03, 15.12it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=153.0, consumed_samples=1224.0, val_loss=1.420]
Epoch 0: 75%|███████▌ | 165/220 [00:10<00:03, 15.15it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=153.0, consumed_samples=1224.0, val_loss=1.420]
Epoch 0: 75%|███████▌ | 165/220 [00:10<00:03, 15.15it/s, loss=1.43, v_num=, reduced_train_loss=1.470, global_step=154.0, consumed_samples=1232.0, val_loss=1.420]
Epoch 0: 75%|███████▌ | 166/220 [00:10<00:03, 15.18it/s, loss=1.43, v_num=, reduced_train_loss=1.470, global_step=154.0, consumed_samples=1232.0, val_loss=1.420]
Epoch 0: 75%|███████▌ | 166/220 [00:10<00:03, 15.18it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=155.0, consumed_samples=1240.0, val_loss=1.420]
Epoch 0: 76%|███████▌ | 167/220 [00:10<00:03, 15.21it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=155.0, consumed_samples=1240.0, val_loss=1.420]
Epoch 0: 76%|███████▌ | 167/220 [00:10<00:03, 15.21it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=156.0, consumed_samples=1248.0, val_loss=1.420]
Epoch 0: 76%|███████▋ | 168/220 [00:11<00:03, 15.23it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=156.0, consumed_samples=1248.0, val_loss=1.420]
Epoch 0: 76%|███████▋ | 168/220 [00:11<00:03, 15.23it/s, loss=1.43, v_num=, reduced_train_loss=1.390, global_step=157.0, consumed_samples=1256.0, val_loss=1.420]
Epoch 0: 77%|███████▋ | 169/220 [00:11<00:03, 15.26it/s, loss=1.43, v_num=, reduced_train_loss=1.390, global_step=157.0, consumed_samples=1256.0, val_loss=1.420]
Epoch 0: 77%|███████▋ | 169/220 [00:11<00:03, 15.26it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=158.0, consumed_samples=1264.0, val_loss=1.420]
Epoch 0: 77%|███████▋ | 170/220 [00:11<00:03, 15.28it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=158.0, consumed_samples=1264.0, val_loss=1.420]
Epoch 0: 77%|███████▋ | 170/220 [00:11<00:03, 15.28it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=159.0, consumed_samples=1272.0, val_loss=1.420]
Epoch 0: 78%|███████▊ | 171/220 [00:11<00:03, 15.30it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=159.0, consumed_samples=1272.0, val_loss=1.420]
Epoch 0: 78%|███████▊ | 171/220 [00:11<00:03, 15.30it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=160.0, consumed_samples=1280.0, val_loss=1.420]
Epoch 0: 78%|███████▊ | 172/220 [00:11<00:03, 15.32it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=160.0, consumed_samples=1280.0, val_loss=1.420]
Epoch 0: 78%|███████▊ | 172/220 [00:11<00:03, 15.32it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=161.0, consumed_samples=1288.0, val_loss=1.420]
Epoch 0: 79%|███████▊ | 173/220 [00:11<00:03, 15.34it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=161.0, consumed_samples=1288.0, val_loss=1.420]
Epoch 0: 79%|███████▊ | 173/220 [00:11<00:03, 15.33it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=162.0, consumed_samples=1296.0, val_loss=1.420]
Epoch 0: 79%|███████▉ | 174/220 [00:11<00:02, 15.36it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=162.0, consumed_samples=1296.0, val_loss=1.420]
Epoch 0: 79%|███████▉ | 174/220 [00:11<00:02, 15.36it/s, loss=1.42, v_num=, reduced_train_loss=1.430, global_step=163.0, consumed_samples=1304.0, val_loss=1.420]
Epoch 0: 80%|███████▉ | 175/220 [00:11<00:02, 15.38it/s, loss=1.42, v_num=, reduced_train_loss=1.430, global_step=163.0, consumed_samples=1304.0, val_loss=1.420]
Epoch 0: 80%|███████▉ | 175/220 [00:11<00:02, 15.38it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=164.0, consumed_samples=1312.0, val_loss=1.420]
Epoch 0: 80%|████████ | 176/220 [00:11<00:02, 15.40it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=164.0, consumed_samples=1312.0, val_loss=1.420]
Epoch 0: 80%|████████ | 176/220 [00:11<00:02, 15.40it/s, loss=1.42, v_num=, reduced_train_loss=1.390, global_step=165.0, consumed_samples=1320.0, val_loss=1.420]
Epoch 0: 80%|████████ | 177/220 [00:11<00:02, 15.40it/s, loss=1.42, v_num=, reduced_train_loss=1.390, global_step=165.0, consumed_samples=1320.0, val_loss=1.420]
Epoch 0: 80%|████████ | 177/220 [00:11<00:02, 15.40it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=166.0, consumed_samples=1328.0, val_loss=1.420]
Epoch 0: 81%|████████ | 178/220 [00:11<00:02, 15.42it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=166.0, consumed_samples=1328.0, val_loss=1.420]
Epoch 0: 81%|████████ | 178/220 [00:11<00:02, 15.42it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=167.0, consumed_samples=1336.0, val_loss=1.420]
Epoch 0: 81%|████████▏ | 179/220 [00:11<00:02, 15.44it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=167.0, consumed_samples=1336.0, val_loss=1.420]
Epoch 0: 81%|████████▏ | 179/220 [00:11<00:02, 15.44it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=168.0, consumed_samples=1344.0, val_loss=1.420]
Epoch 0: 82%|████████▏ | 180/220 [00:11<00:02, 15.46it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=168.0, consumed_samples=1344.0, val_loss=1.420]
Epoch 0: 82%|████████▏ | 180/220 [00:11<00:02, 15.46it/s, loss=1.41, v_num=, reduced_train_loss=1.380, global_step=169.0, consumed_samples=1352.0, val_loss=1.420]
Epoch 0: 82%|████████▏ | 181/220 [00:11<00:02, 15.48it/s, loss=1.41, v_num=, reduced_train_loss=1.380, global_step=169.0, consumed_samples=1352.0, val_loss=1.420]
Epoch 0: 82%|████████▏ | 181/220 [00:11<00:02, 15.48it/s, loss=1.41, v_num=, reduced_train_loss=1.420, global_step=170.0, consumed_samples=1360.0, val_loss=1.420]
Epoch 0: 83%|████████▎ | 182/220 [00:11<00:02, 15.49it/s, loss=1.41, v_num=, reduced_train_loss=1.420, global_step=170.0, consumed_samples=1360.0, val_loss=1.420]
Epoch 0: 83%|████████▎ | 182/220 [00:11<00:02, 15.49it/s, loss=1.41, v_num=, reduced_train_loss=1.450, global_step=171.0, consumed_samples=1368.0, val_loss=1.420]
Epoch 0: 83%|████████▎ | 183/220 [00:11<00:02, 15.50it/s, loss=1.41, v_num=, reduced_train_loss=1.450, global_step=171.0, consumed_samples=1368.0, val_loss=1.420]
Epoch 0: 83%|████████▎ | 183/220 [00:11<00:02, 15.50it/s, loss=1.41, v_num=, reduced_train_loss=1.430, global_step=172.0, consumed_samples=1376.0, val_loss=1.420]
Epoch 0: 84%|████████▎ | 184/220 [00:11<00:02, 15.52it/s, loss=1.41, v_num=, reduced_train_loss=1.430, global_step=172.0, consumed_samples=1376.0, val_loss=1.420]
Epoch 0: 84%|████████▎ | 184/220 [00:11<00:02, 15.52it/s, loss=1.41, v_num=, reduced_train_loss=1.390, global_step=173.0, consumed_samples=1384.0, val_loss=1.420]
Epoch 0: 84%|████████▍ | 185/220 [00:11<00:02, 15.54it/s, loss=1.41, v_num=, reduced_train_loss=1.390, global_step=173.0, consumed_samples=1384.0, val_loss=1.420]
Epoch 0: 84%|████████▍ | 185/220 [00:11<00:02, 15.54it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=174.0, consumed_samples=1392.0, val_loss=1.420]
Epoch 0: 85%|████████▍ | 186/220 [00:11<00:02, 15.56it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=174.0, consumed_samples=1392.0, val_loss=1.420]
Epoch 0: 85%|████████▍ | 186/220 [00:11<00:02, 15.56it/s, loss=1.41, v_num=, reduced_train_loss=1.410, global_step=175.0, consumed_samples=1400.0, val_loss=1.420]
Epoch 0: 85%|████████▌ | 187/220 [00:12<00:02, 15.58it/s, loss=1.41, v_num=, reduced_train_loss=1.410, global_step=175.0, consumed_samples=1400.0, val_loss=1.420]
Epoch 0: 85%|████████▌ | 187/220 [00:12<00:02, 15.58it/s, loss=1.41, v_num=, reduced_train_loss=1.410, global_step=176.0, consumed_samples=1408.0, val_loss=1.420]
Epoch 0: 85%|████████▌ | 188/220 [00:12<00:02, 15.60it/s, loss=1.41, v_num=, reduced_train_loss=1.410, global_step=176.0, consumed_samples=1408.0, val_loss=1.420]
Epoch 0: 85%|████████▌ | 188/220 [00:12<00:02, 15.60it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=177.0, consumed_samples=1416.0, val_loss=1.420]
Epoch 0: 86%|████████▌ | 189/220 [00:12<00:01, 15.61it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=177.0, consumed_samples=1416.0, val_loss=1.420]
Epoch 0: 86%|████████▌ | 189/220 [00:12<00:01, 15.61it/s, loss=1.41, v_num=, reduced_train_loss=1.380, global_step=178.0, consumed_samples=1424.0, val_loss=1.420]
Epoch 0: 86%|████████▋ | 190/220 [00:12<00:01, 15.63it/s, loss=1.41, v_num=, reduced_train_loss=1.380, global_step=178.0, consumed_samples=1424.0, val_loss=1.420]
Epoch 0: 86%|████████▋ | 190/220 [00:12<00:01, 15.63it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=179.0, consumed_samples=1432.0, val_loss=1.420]
Epoch 0: 87%|████████▋ | 191/220 [00:12<00:01, 15.65it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=179.0, consumed_samples=1432.0, val_loss=1.420]
Epoch 0: 87%|████████▋ | 191/220 [00:12<00:01, 15.65it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=180.0, consumed_samples=1440.0, val_loss=1.420]
Epoch 0: 87%|████████▋ | 192/220 [00:12<00:01, 15.66it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=180.0, consumed_samples=1440.0, val_loss=1.420]
Epoch 0: 87%|████████▋ | 192/220 [00:12<00:01, 15.66it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=181.0, consumed_samples=1448.0, val_loss=1.420]
Epoch 0: 88%|████████▊ | 193/220 [00:12<00:01, 15.68it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=181.0, consumed_samples=1448.0, val_loss=1.420]
Epoch 0: 88%|████████▊ | 193/220 [00:12<00:01, 15.68it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=182.0, consumed_samples=1456.0, val_loss=1.420]
Epoch 0: 88%|████████▊ | 194/220 [00:12<00:01, 15.70it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=182.0, consumed_samples=1456.0, val_loss=1.420]
Epoch 0: 88%|████████▊ | 194/220 [00:12<00:01, 15.70it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=183.0, consumed_samples=1464.0, val_loss=1.420]
Epoch 0: 89%|████████▊ | 195/220 [00:12<00:01, 15.72it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=183.0, consumed_samples=1464.0, val_loss=1.420]
Epoch 0: 89%|████████▊ | 195/220 [00:12<00:01, 15.71it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=184.0, consumed_samples=1472.0, val_loss=1.420]
Epoch 0: 89%|████████▉ | 196/220 [00:12<00:01, 15.73it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=184.0, consumed_samples=1472.0, val_loss=1.420]
Epoch 0: 89%|████████▉ | 196/220 [00:12<00:01, 15.73it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=185.0, consumed_samples=1480.0, val_loss=1.420]
Epoch 0: 90%|████████▉ | 197/220 [00:12<00:01, 15.75it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=185.0, consumed_samples=1480.0, val_loss=1.420]
Epoch 0: 90%|████████▉ | 197/220 [00:12<00:01, 15.75it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=186.0, consumed_samples=1488.0, val_loss=1.420]
Epoch 0: 90%|█████████ | 198/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=186.0, consumed_samples=1488.0, val_loss=1.420]
Epoch 0: 90%|█████████ | 198/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=187.0, consumed_samples=1496.0, val_loss=1.420]
Epoch 0: 90%|█████████ | 199/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=187.0, consumed_samples=1496.0, val_loss=1.420]
Epoch 0: 90%|█████████ | 199/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.370, global_step=188.0, consumed_samples=1504.0, val_loss=1.420]
Epoch 0: 91%|█████████ | 200/220 [00:12<00:01, 15.75it/s, loss=1.4, v_num=, reduced_train_loss=1.370, global_step=188.0, consumed_samples=1504.0, val_loss=1.420]
Epoch 0: 91%|█████████ | 200/220 [00:12<00:01, 15.75it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=189.0, consumed_samples=1512.0, val_loss=1.420]
Epoch 0: 91%|█████████▏| 201/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=189.0, consumed_samples=1512.0, val_loss=1.420]
Epoch 0: 91%|█████████▏| 201/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.420, global_step=190.0, consumed_samples=1520.0, val_loss=1.420]
Epoch 0: 92%|█████████▏| 202/220 [00:12<00:01, 15.78it/s, loss=1.4, v_num=, reduced_train_loss=1.420, global_step=190.0, consumed_samples=1520.0, val_loss=1.420]
Epoch 0: 92%|█████████▏| 202/220 [00:12<00:01, 15.78it/s, loss=1.4, v_num=, reduced_train_loss=1.400, global_step=191.0, consumed_samples=1528.0, val_loss=1.420]
Epoch 0: 92%|█████████▏| 203/220 [00:12<00:01, 15.79it/s, loss=1.4, v_num=, reduced_train_loss=1.400, global_step=191.0, consumed_samples=1528.0, val_loss=1.420]
Epoch 0: 92%|█████████▏| 203/220 [00:12<00:01, 15.79it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=192.0, consumed_samples=1536.0, val_loss=1.420]
Epoch 0: 93%|█████████▎| 204/220 [00:12<00:01, 15.81it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=192.0, consumed_samples=1536.0, val_loss=1.420]
Epoch 0: 93%|█████████▎| 204/220 [00:12<00:01, 15.81it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=193.0, consumed_samples=1544.0, val_loss=1.420]
Epoch 0: 93%|█████████▎| 205/220 [00:12<00:00, 15.83it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=193.0, consumed_samples=1544.0, val_loss=1.420]
Epoch 0: 93%|█████████▎| 205/220 [00:12<00:00, 15.83it/s, loss=1.39, v_num=, reduced_train_loss=1.410, global_step=194.0, consumed_samples=1552.0, val_loss=1.420]
Epoch 0: 94%|█████████▎| 206/220 [00:13<00:00, 15.84it/s, loss=1.39, v_num=, reduced_train_loss=1.410, global_step=194.0, consumed_samples=1552.0, val_loss=1.420]
Epoch 0: 94%|█████████▎| 206/220 [00:13<00:00, 15.84it/s, loss=1.39, v_num=, reduced_train_loss=1.380, global_step=195.0, consumed_samples=1560.0, val_loss=1.420]
Epoch 0: 94%|█████████▍| 207/220 [00:13<00:00, 15.86it/s, loss=1.39, v_num=, reduced_train_loss=1.380, global_step=195.0, consumed_samples=1560.0, val_loss=1.420]
Epoch 0: 94%|█████████▍| 207/220 [00:13<00:00, 15.86it/s, loss=1.39, v_num=, reduced_train_loss=1.410, global_step=196.0, consumed_samples=1568.0, val_loss=1.420]
Epoch 0: 95%|█████████▍| 208/220 [00:13<00:00, 15.87it/s, loss=1.39, v_num=, reduced_train_loss=1.410, global_step=196.0, consumed_samples=1568.0, val_loss=1.420]
Epoch 0: 95%|█████████▍| 208/220 [00:13<00:00, 15.87it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=197.0, consumed_samples=1576.0, val_loss=1.420]
Epoch 0: 95%|█████████▌| 209/220 [00:13<00:00, 15.89it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=197.0, consumed_samples=1576.0, val_loss=1.420]
Epoch 0: 95%|█████████▌| 209/220 [00:13<00:00, 15.89it/s, loss=1.39, v_num=, reduced_train_loss=1.400, global_step=198.0, consumed_samples=1584.0, val_loss=1.420]
Epoch 0: 95%|█████████▌| 210/220 [00:13<00:00, 15.90it/s, loss=1.39, v_num=, reduced_train_loss=1.400, global_step=198.0, consumed_samples=1584.0, val_loss=1.420]
Epoch 0: 95%|█████████▌| 210/220 [00:13<00:00, 15.90it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Validation: 0it [00:00, ?it/s]
Validation: 0%| | 0/10 [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/10 [00:00<?, ?it/s]
Validation DataLoader 0: 10%|█ | 1/10 [00:00<00:00, 12.19it/s]
Epoch 0: 96%|█████████▌| 211/220 [00:13<00:00, 15.84it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Validation DataLoader 0: 20%|██ | 2/10 [00:00<00:00, 16.00it/s]
Epoch 0: 96%|█████████▋| 212/220 [00:13<00:00, 15.86it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Validation DataLoader 0: 30%|███ | 3/10 [00:00<00:00, 16.09it/s]
Epoch 0: 97%|█████████▋| 213/220 [00:13<00:00, 15.86it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Validation DataLoader 0: 40%|████ | 4/10 [00:00<00:00, 19.97it/s]
Epoch 0: 97%|█████████▋| 214/220 [00:13<00:00, 15.92it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Validation DataLoader 0: 50%|█████ | 5/10 [00:00<00:00, 22.20it/s]
Epoch 0: 98%|█████████▊| 215/220 [00:13<00:00, 15.97it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Validation DataLoader 0: 60%|██████ | 6/10 [00:00<00:00, 25.03it/s]
Epoch 0: 98%|█████████▊| 216/220 [00:13<00:00, 16.02it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Validation DataLoader 0: 70%|███████ | 7/10 [00:00<00:00, 27.69it/s]
Epoch 0: 99%|█████████▊| 217/220 [00:13<00:00, 16.08it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Validation DataLoader 0: 80%|████████ | 8/10 [00:00<00:00, 29.94it/s]
Epoch 0: 99%|█████████▉| 218/220 [00:13<00:00, 16.14it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Validation DataLoader 0: 90%|█████████ | 9/10 [00:00<00:00, 32.10it/s]
Epoch 0: 100%|█████████▉| 219/220 [00:13<00:00, 16.20it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Validation DataLoader 0: 100%|██████████| 10/10 [00:00<00:00, 34.09it/s]
Epoch 0: 100%|██████████| 220/220 [00:13<00:00, 16.25it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Epoch 0: 100%|██████████| 220/220 [00:13<00:00, 16.25it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.380]
Epoch 0, global step 200: 'val_loss' reached 1.38422 (best 1.38422), saving model to '/result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining/checkpoints/megatron_bert--val_loss=1.38-step=200-consumed_samples=1600.0.ckpt' as top 10
[NeMo I 2023-08-17 16:22:48 nlp_overrides:226] Removing checkpoint: /result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining/checkpoints/megatron_bert--val_loss=1.42-step=100-consumed_samples=800.0-last.ckpt
Epoch 0: 100%|██████████| 220/220 [00:14<00:00, 14.79it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.380]`Trainer.fit` stopped: `max_steps=200` reached.
Epoch 0: 100%|██████████| 220/220 [00:14<00:00, 14.79it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.380]
[NeMo I 2023-08-17 16:22:49 pretrain_oas:24] ************** Finished Training ***********
Results#
The training run will create a directory called esm1nv-oas_pretraining
in /result/nemo_experiments/esm1nv-oas
containing the files (logs, checkpoints, etc.) for the training run:
checkpoints
cmd-args.log
events.out.tfevents.1692289352.drugdiscovery3-dt.335.0
git-info.log
hparams.yaml
lightning_logs.txt
nemo_error_log.txt
nemo_log_globalrank-0_localrank-0.txt