Adding the OAS Dataset: Modifying the Dataset Class#

This tutorial is the second part of a series focused on adding a new dataset to BioNeMo using the Observed Antibody Space (OAS) database. There are three steps to this task:

  1. Preprocessing includes download of the raw data and any additional preparation steps, such as extracting the files. It also includes dividing the data into train, validation, and test splits. The preprocessing step can make use of two BioNeMo base classes, RemoteResource and ResourcePreprocessor, from bionemo.utils.remote and bionemo.data.preprocess.dna.preprocess, respectively. Their use is optional but they provide some basic functionality which can accelerate development. This step is covered by this tutorial. This objective was accomplished by the previous tutorial, Downloading and Preprocessing.

  2. Development of the new dataset class. Here, the NeMo dataset class CSVMemMapDataset will be used. This step will be completed during the current tutorial.

  3. Modification of the dataloader classes. This task will be covered by the third tutorial, Adding a Custom Dataloader. TODO FIX LINK WHEN TUTORIAL FINISHED

This tutorial assumes the first step has been completed successfully.

Setup and Assumptions#

This tutorial assumes that a copy of the BioNeMo framework repo exists on workstation or server and has been mounted inside the container at /workspace/bionemo as described in the Code Development section of the Quickstart Guide. This path will be referred to with the variable BIONEMO_WORKSPACE in the tutorial.

All commands should be executed inside the BioNeMo docker container.

Configuring the CSV Memory Mapped Dataset#

Custom YAML Config#

BioNeMo uses memory mapping to enable the flexibility of text based data formats, such as CSV, while also minimizing memory usage. The key elements of the CSVMemMapDataset dataset that must be changed in the model.data section of the YAML configuration file are:

  • dataset_paths: a list of the paths to all data files for a given split, which contains the train, val, test columns. For the OAS heavy chain data, the path is /data/OASpaired/processed/heavy.

  • data_col: the zero-based integer number of the column containing the pretraining data. This will be set to 1 to select the column sequence_heavy.

  • data_sep: the delimiter for the CSV dataset, defaults to ‘,’. This will not need to be changed.

  • header_lines: the number of header lines in the data files, defaults to 1. This will not need to be changed.

The range of exsting datafiles must also be updated to reflect that there are six files (named x000.csv through x005.csv for training and two (x000.csv and x001.csv) for validation and test data, respectively. do_training will also be set to True since a pretraining run is required to test the dataset class.

The YAML configuration file below demonstrates these changes. Config files are located in {BIONEMO_WORKSPACE}/examples/protein/esm1nv/conf/

defaults:
  - base_config

name: esm1nv-oas
do_training: True ### Set to True to run training

###### Begin OAS Related Addtions ######

trainer:
  devices: 1 
  max_steps: 200 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
  val_check_interval: 100
  limit_val_batches: 10 # number of batches in validation step, use fraction for fraction of data, 0 to disable

###### End OAS Related Addtions ######

exp_manager: 
  create_wandb_logger: False

restore_from_path: null # used when starting from a .nemo file

model:
  tokenizer:
    library: 'sentencepiece'
    type: null
    model: /tokenizers/protein/esm1nv/vocab/protein_sequence_sentencepiece.model
    vocab_file: /tokenizers/vocab/protein_sequence_sentencepiece.vocab
  data:
  
    ###### Begin OAS Related Addtions ######
  
    dataset_path: /data/OASpaired/processed/heavy # parent directory for data, contains train / val / test folders. Needs to be writeable for index creation.
    dataset: # inclusive range of data files to load x[000..049] or can a single file, e.g. x000
      train: x[000..005]
      test: x[000..001]
      val: x[000..001]
    data_impl_kwargs:
      csv_mmap:
        data_col: 1 # 0-based
  
    ###### End OAS Related Addtions ######
  
    micro_batch_size: ${model.micro_batch_size}
    num_workers: 10
    modify_percent: 0.1 # Percentage of characters in a protein sequence to modify. (Modification means replacing with another amino acid or with a mask token)
    perturb_percent: 0.5 # Of the modify_percent, what percentage of characters are to be replaced with another amino acid.

Testing#

No additional changes should need to be made to other files before testing.

As before, execute the pretrain script:

cd examples/protein/esm1nv
python pretrain_oas.py

The entire log is shown this time for completeness, but the sections associated with loading data can be found by searching for the text “Loading data from”.

[NeMo W 2023-08-17 16:22:29 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-08-17 16:22:29 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-08-17 16:22:30 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'pretrain_oas': Defaults list is missing `_self_`. See https://hydra.cc/docs/upgrades/1.0_to_1.1/default_composition_order for more information
      warnings.warn(msg, UserWarning)
    
[NeMo W 2023-08-17 16:22:30 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
    See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    
[NeMo I 2023-08-17 16:22:30 pretrain_oas:14] 
    
    ************** Experiment configuration ***********
[NeMo I 2023-08-17 16:22:30 pretrain_oas:15] 
    name: esm1nv-oas
    do_training: true
    do_testing: false
    restore_from_path: null
    trainer:
      devices: 1
      num_nodes: 1
      accelerator: gpu
      precision: 16
      logger: false
      enable_checkpointing: false
      replace_sampler_ddp: false
      max_epochs: null
      max_steps: 200
      log_every_n_steps: 10
      val_check_interval: 100
      limit_val_batches: 10
      limit_test_batches: 500
      accumulate_grad_batches: 1
      gradient_clip_val: 1.0
      benchmark: false
    exp_manager:
      name: ${name}
      exp_dir: /result/nemo_experiments/${.name}/${.wandb_logger_kwargs.name}
      explicit_log_dir: ${.exp_dir}
      create_wandb_logger: false
      create_tensorboard_logger: true
      wandb_logger_kwargs:
        project: ${name}_pretraining
        name: ${name}_pretraining
        group: ${name}
        job_type: Localhost_nodes_${trainer.num_nodes}_gpus_${trainer.devices}
        notes: 'date: ${now:%y%m%d-%H%M%S}'
        tags:
        - ${name}
        offline: false
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: val_loss
        save_top_k: 10
        mode: min
        always_save_nemo: false
        filename: megatron_bert--{val_loss:.2f}-{step}-{consumed_samples}
        model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
    model:
      micro_batch_size: 8
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      seq_length: 512
      max_position_embeddings: ${.seq_length}
      encoder_seq_length: ${.seq_length}
      num_layers: 6
      hidden_size: 768
      ffn_hidden_size: 3072
      num_attention_heads: 12
      init_method_std: 0.02
      hidden_dropout: 0.1
      kv_channels: null
      apply_query_key_layer_scaling: true
      layernorm_epsilon: 1.0e-05
      make_vocab_size_divisible_by: 128
      pre_process: true
      post_process: true
      bert_binary_head: false
      resume_from_checkpoint: null
      masked_softmax_fusion: true
      tokenizer:
        library: sentencepiece
        type: null
        model: /tokenizers/protein/esm1nv/vocab/protein_sequence_sentencepiece.model
        vocab_file: /tokenizers/vocab/protein_sequence_sentencepiece.vocab
        merge_file: null
      native_amp_init_scale: 4294967296
      native_amp_growth_interval: 1000
      fp32_residual_connection: false
      fp16_lm_cross_entropy: false
      seed: 1234
      use_cpu_initialization: false
      onnx_safe: false
      activations_checkpoint_method: null
      activations_checkpoint_num_layers: 1
      data:
        ngc_registry_target: uniref50_2022_05
        ngc_registry_version: v23.06
        data_prefix: ''
        num_workers: 10
        dataloader_type: single
        reset_position_ids: false
        reset_attention_mask: false
        eod_mask_loss: false
        masked_lm_prob: 0.15
        short_seq_prob: 0.1
        skip_lines: 0
        drop_last: false
        pin_memory: false
        data_impl: csv_mmap
        data_impl_kwargs:
          csv_mmap:
            header_lines: 1
            newline_int: 10
            workers: ${model.data.num_workers}
            sort_dataset_paths: true
            data_sep: ','
            data_col: 1
        use_upsampling: true
        seed: ${model.seed}
        max_seq_length: ${model.seq_length}
        dataset_path: /data/OASpaired/processed/heavy
        dataset:
          train: x[000..005]
          test: x[000..001]
          val: x[000..001]
        micro_batch_size: ${model.micro_batch_size}
        modify_percent: 0.1
        perturb_percent: 0.5
      optim:
        name: fused_adam
        lr: 0.0002
        weight_decay: 0.01
        betas:
        - 0.9
        - 0.98
        sched:
          name: CosineAnnealing
          warmup_steps: 500
          constant_steps: 50000
          min_lr: 2.0e-05
    
[NeMo W 2023-08-17 16:22:30 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/precision/native_amp.py:131: LightningDeprecationWarning: The `NativeMixedPrecisionPlugin` class has been renamed in v1.9.0 and will be removed in v2.0.0. Please use `pytorch_lightning.plugins.MixedPrecisionPlugin` instead.
      rank_zero_deprecation(
    
[NeMo I 2023-08-17 16:22:30 utils:168] Selected Callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]
Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo E 2023-08-17 16:22:30 exp_manager:646] exp_manager received explicit_log_dir: /result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining and at least one of exp_dir: /result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining, or version: None. Please note that exp_dir, name, and version will be ignored.
[NeMo W 2023-08-17 16:22:30 exp_manager:568] There was no checkpoint folder at checkpoint_dir :/result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining/checkpoints. Training from scratch.
[NeMo I 2023-08-17 16:22:30 exp_manager:374] Experiments will be logged at /result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining
[NeMo I 2023-08-17 16:22:30 exp_manager:797] TensorboardLogger has been set up
[NeMo W 2023-08-17 16:22:30 exp_manager:893] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 200. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo I 2023-08-17 16:22:30 utils:191] Resuming training from checkpoint: None
[NeMo I 2023-08-17 16:22:30 utils:234] 
    
    ************** Trainer configuration ***********
[NeMo I 2023-08-17 16:22:30 utils:235] 
    name: esm1nv-oas
    do_training: true
    do_testing: false
    restore_from_path: null
    trainer:
      devices: 1
      num_nodes: 1
      accelerator: gpu
      precision: 16
      logger: false
      enable_checkpointing: false
      replace_sampler_ddp: false
      max_epochs: null
      max_steps: 200
      log_every_n_steps: 10
      val_check_interval: 100
      limit_val_batches: 10
      limit_test_batches: 500
      accumulate_grad_batches: 1
      gradient_clip_val: 1.0
      benchmark: false
    exp_manager:
      name: ${name}
      exp_dir: /result/nemo_experiments/${.name}/${.wandb_logger_kwargs.name}
      explicit_log_dir: ${.exp_dir}
      create_wandb_logger: false
      create_tensorboard_logger: true
      wandb_logger_kwargs:
        project: ${name}_pretraining
        name: ${name}_pretraining
        group: ${name}
        job_type: Localhost_nodes_${trainer.num_nodes}_gpus_${trainer.devices}
        notes: 'date: ${now:%y%m%d-%H%M%S}'
        tags:
        - ${name}
        offline: false
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: val_loss
        save_top_k: 10
        mode: min
        always_save_nemo: false
        filename: megatron_bert--{val_loss:.2f}-{step}-{consumed_samples}
        model_parallel_size: ${multiply:${model.tensor_model_parallel_size}, ${model.pipeline_model_parallel_size}}
    model:
      micro_batch_size: 8
      tensor_model_parallel_size: 1
      pipeline_model_parallel_size: 1
      seq_length: 512
      max_position_embeddings: ${.seq_length}
      encoder_seq_length: ${.seq_length}
      num_layers: 6
      hidden_size: 768
      ffn_hidden_size: 3072
      num_attention_heads: 12
      init_method_std: 0.02
      hidden_dropout: 0.1
      kv_channels: null
      apply_query_key_layer_scaling: true
      layernorm_epsilon: 1.0e-05
      make_vocab_size_divisible_by: 128
      pre_process: true
      post_process: true
      bert_binary_head: false
      resume_from_checkpoint: null
      masked_softmax_fusion: true
      tokenizer:
        library: sentencepiece
        type: null
        model: /tokenizers/protein/esm1nv/vocab/protein_sequence_sentencepiece.model
        vocab_file: /tokenizers/vocab/protein_sequence_sentencepiece.vocab
        merge_file: null
      native_amp_init_scale: 4294967296
      native_amp_growth_interval: 1000
      fp32_residual_connection: false
      fp16_lm_cross_entropy: false
      seed: 1234
      use_cpu_initialization: false
      onnx_safe: false
      activations_checkpoint_method: null
      activations_checkpoint_num_layers: 1
      data:
        ngc_registry_target: uniref50_2022_05
        ngc_registry_version: v23.06
        data_prefix: ''
        num_workers: 10
        dataloader_type: single
        reset_position_ids: false
        reset_attention_mask: false
        eod_mask_loss: false
        masked_lm_prob: 0.15
        short_seq_prob: 0.1
        skip_lines: 0
        drop_last: false
        pin_memory: false
        data_impl: csv_mmap
        data_impl_kwargs:
          csv_mmap:
            header_lines: 1
            newline_int: 10
            workers: ${model.data.num_workers}
            sort_dataset_paths: true
            data_sep: ','
            data_col: 1
        use_upsampling: true
        seed: ${model.seed}
        max_seq_length: ${model.seq_length}
        dataset_path: /data/OASpaired/processed/heavy
        dataset:
          train: x[000..005]
          test: x[000..001]
          val: x[000..001]
        micro_batch_size: ${model.micro_batch_size}
        modify_percent: 0.1
        perturb_percent: 0.5
      optim:
        name: fused_adam
        lr: 0.0002
        weight_decay: 0.01
        betas:
        - 0.9
        - 0.98
        sched:
          name: CosineAnnealing
          warmup_steps: 500
          constant_steps: 50000
          min_lr: 2.0e-05
      global_batch_size: 8
      precision: 16
    
[NeMo I 2023-08-17 16:22:30 pretrain_oas:21] ************** Starting Training ***********
[NeMo I 2023-08-17 16:22:30 megatron_init:231] Rank 0 has data parallel group: [0]
[NeMo I 2023-08-17 16:22:30 megatron_init:234] All data parallel group ranks: [[0]]
[NeMo I 2023-08-17 16:22:30 megatron_init:235] Ranks 0 has data parallel rank: 0
[NeMo I 2023-08-17 16:22:30 megatron_init:243] Rank 0 has model parallel group: [0]
[NeMo I 2023-08-17 16:22:30 megatron_init:244] All model parallel group ranks: [[0]]
[NeMo I 2023-08-17 16:22:30 megatron_init:254] Rank 0 has tensor model parallel group: [0]
[NeMo I 2023-08-17 16:22:30 megatron_init:258] All tensor model parallel group ranks: [[0]]
[NeMo I 2023-08-17 16:22:30 megatron_init:259] Rank 0 has tensor model parallel rank: 0
[NeMo I 2023-08-17 16:22:30 megatron_init:273] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2023-08-17 16:22:30 megatron_init:285] Rank 0 has embedding group: [0]
[NeMo I 2023-08-17 16:22:30 megatron_init:291] All pipeline model parallel group ranks: [[0]]
[NeMo I 2023-08-17 16:22:30 megatron_init:292] Rank 0 has pipeline model parallel rank 0
[NeMo I 2023-08-17 16:22:30 megatron_init:293] All embedding group ranks: [[0]]
[NeMo I 2023-08-17 16:22:30 megatron_init:294] Rank 0 has embedding rank: 0
23-08-17 16:22:30 - PID:335 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 1
[NeMo I 2023-08-17 16:22:30 tokenizer_utils:191] Getting SentencePiece with model: /tokenizers/protein/esm1nv/vocab/protein_sequence_sentencepiece.model
[NeMo I 2023-08-17 16:22:30 megatron_base_model:229] Padded vocab_size: 128, original vocab_size: 30, dummy tokens: 98.
[NeMo W 2023-08-17 16:22:30 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/configuration_validator.py:175: UserWarning: The `batch_idx` argument in `ESM1nvModel.on_train_batch_start` hook may not match with the actual batch index when using a `dataloader_iter` argument in your `training_step`.
      rank_zero_warn(
    
[NeMo W 2023-08-17 16:22:30 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/configuration_validator.py:175: UserWarning: The `batch_idx` argument in `ESM1nvModel.on_train_batch_end` hook may not match with the actual batch index when using a `dataloader_iter` argument in your `training_step`.
      rank_zero_warn(
    
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Added key: store_based_barrier_key:2 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
Added key: store_based_barrier_key:3 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 1 nodes.
Added key: store_based_barrier_key:4 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 1 nodes.
Added key: store_based_barrier_key:5 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:5 with 1 nodes.
Added key: store_based_barrier_key:6 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:6 with 1 nodes.
Added key: store_based_barrier_key:7 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:7 with 1 nodes.
[NeMo I 2023-08-17 16:22:31 megatron_bert_model:563] Pipeline model parallel rank: 0, Tensor model parallel rank: 0, Number of model parameters on device: 4.36e+07. Total number of model parameters: 4.36e+07.
[NeMo I 2023-08-17 16:22:31 esm1nv_model:96] Building Bert datasets.
train:1600
Loading data from /data/OASpaired/processed/heavy/train/x000.csv, /data/OASpaired/processed/heavy/train/x001.csv, /data/OASpaired/processed/heavy/train/x002.csv, /data/OASpaired/processed/heavy/train/x003.csv, /data/OASpaired/processed/heavy/train/x004.csv, /data/OASpaired/processed/heavy/train/x005.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:104] Building data files
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:343] Processing 6 data files using 10 workers
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:349] Time building 0 / 6 mem-mapped files: 0:00:00.148196
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:114] Loading data files
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x000.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x001.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x002.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x003.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x004.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/train/x005.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:117] Time loading 6 mem-mapped files: 0:00:00.003227
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:121] Computing global indices
[NeMo I 2023-08-17 16:22:31 dataset_utils:1341]  > loading indexed mapping from /data/OASpaired/processed/heavy/train/__indexmap_1600mns_512msl_0.00ssp_1234s.npy
[NeMo I 2023-08-17 16:22:31 dataset_utils:1344]     loaded indexed file in 0.001 seconds
[NeMo I 2023-08-17 16:22:31 dataset_utils:1345]     total number of samples: 21110
val:240
Loading data from /data/OASpaired/processed/heavy/val/x000.csv, /data/OASpaired/processed/heavy/val/x001.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:104] Building data files
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:343] Processing 2 data files using 10 workers
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:349] Time building 0 / 2 mem-mapped files: 0:00:00.137560
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:114] Loading data files
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/val/x000.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/val/x001.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:117] Time loading 2 mem-mapped files: 0:00:00.001477
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:121] Computing global indices
[NeMo I 2023-08-17 16:22:31 dataset_utils:1341]  > loading indexed mapping from /data/OASpaired/processed/heavy/val/__indexmap_240mns_512msl_0.00ssp_1234s.npy
[NeMo I 2023-08-17 16:22:31 dataset_utils:1344]     loaded indexed file in 0.000 seconds
[NeMo I 2023-08-17 16:22:31 dataset_utils:1345]     total number of samples: 3754
test:4000
Loading data from /data/OASpaired/processed/heavy/test/x000.csv, /data/OASpaired/processed/heavy/test/x001.csv
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:104] Building data files
[NeMo I 2023-08-17 16:22:31 text_memmap_dataset:343] Processing 2 data files using 10 workers
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:349] Time building 0 / 2 mem-mapped files: 0:00:00.165497
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:114] Loading data files
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/test/x000.csv
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:205] Loading /data/OASpaired/processed/heavy/test/x001.csv
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:117] Time loading 2 mem-mapped files: 0:00:00.001370
[NeMo I 2023-08-17 16:22:32 text_memmap_dataset:121] Computing global indices
[NeMo I 2023-08-17 16:22:32 dataset_utils:1341]  > loading indexed mapping from /data/OASpaired/processed/heavy/test/__indexmap_4000mns_512msl_0.00ssp_1234s.npy
[NeMo I 2023-08-17 16:22:32 dataset_utils:1344]     loaded indexed file in 0.000 seconds
[NeMo I 2023-08-17 16:22:32 dataset_utils:1345]     total number of samples: 5822
[NeMo I 2023-08-17 16:22:32 esm1nv_model:114] Length of train dataset: 1600
[NeMo I 2023-08-17 16:22:32 esm1nv_model:115] Length of val dataset: 240
[NeMo I 2023-08-17 16:22:32 esm1nv_model:116] Length of test dataset: 4000
[NeMo I 2023-08-17 16:22:32 esm1nv_model:117] Finished building Bert datasets.
[NeMo I 2023-08-17 16:22:32 megatron_bert_model:662] Setting up train dataloader with len(len(self._train_ds)): 1600 and consumed samples: 0
[NeMo I 2023-08-17 16:22:32 data_samplers:76] Instantiating MegatronPretrainingSampler with total_samples: 1600 and consumed_samples: 0
[NeMo I 2023-08-17 16:22:32 megatron_bert_model:670] Setting up validation dataloader with len(len(self._validation_ds)): 240 and consumed samples: 0
[NeMo I 2023-08-17 16:22:32 data_samplers:76] Instantiating MegatronPretrainingSampler with total_samples: 240 and consumed_samples: 0
[NeMo I 2023-08-17 16:22:32 megatron_bert_model:678] Setting up test dataloader with len(len(self._test_ds)): 4000 and consumed samples: 0
[NeMo I 2023-08-17 16:22:32 data_samplers:76] Instantiating MegatronPretrainingSampler with total_samples: 4000 and consumed_samples: 0
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[NeMo I 2023-08-17 16:22:32 nlp_overrides:124] Configuring DDP for model parallelism.
[NeMo I 2023-08-17 16:22:32 modelPT:722] Optimizer config = FusedAdam (
    Parameter Group 0
        betas: [0.9, 0.98]
        bias_correction: True
        eps: 1e-08
        lr: 0.0002
        weight_decay: 0.01
    
    Parameter Group 1
        betas: [0.9, 0.98]
        bias_correction: True
        eps: 1e-08
        lr: 0.0002
        weight_decay: 0.0
    )
[NeMo I 2023-08-17 16:22:32 lr_scheduler:910] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7fc96434f790>" 
    will be used during training (effective maximum steps = 200) - 
    Parameters : 
    (warmup_steps: 500
    constant_steps: 50000
    min_lr: 2.0e-05
    max_steps: 200
    )

  | Name                           | Type                     | Params
----------------------------------------------------------------------------
0 | model                          | BertModel                | 43.6 M
1 | model.language_model           | TransformerLanguageModel | 43.0 M
2 | model.language_model.embedding | Embedding                | 491 K 
3 | model.language_model.encoder   | ParallelTransformer      | 42.5 M
4 | model.lm_head                  | BertLMHead               | 592 K 
5 | model.lm_head.dense            | Linear                   | 590 K 
6 | model.lm_head.layernorm        | MixedFusedLayerNorm      | 1.5 K 
----------------------------------------------------------------------------
43.6 M    Trainable params
0         Non-trainable params
43.6 M    Total params
87.225    Total estimated model params size (MB)

Sanity Checking: 0it [00:00, ?it/s][NeMo W 2023-08-17 16:22:32 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py:401: UserWarning: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
      rank_zero_warn(
    

Sanity Checking:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:  50%|█████     | 1/2 [00:01<00:01,  1.44s/it]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:01<00:00,  1.36it/s][NeMo W 2023-08-17 16:22:33 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called `self.log('consumed_samples', ...)` in your `validation_epoch_end` but the value needs to be floating point. Converting it to torch.float32.
      warning_cache.warn(
    
[NeMo W 2023-08-17 16:22:33 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
      warning_cache.warn(
    
[NeMo W 2023-08-17 16:22:33 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('val_loss_ECE', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
      warning_cache.warn(
    
[NeMo W 2023-08-17 16:22:33 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:536: PossibleUserWarning: It is recommended to use `self.log('consumed_samples', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
      warning_cache.warn(
    

                                                                           [NeMo W 2023-08-17 16:22:33 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/loops/fit_loop.py:344: UserWarning: Found `dataloader_iter` argument in the `training_step`. Note that the support for this signature is experimental and the behavior is subject to change.
      rank_zero_warn(
    


Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/220 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/220 [00:00<?, ?it/s] [NeMo W 2023-08-17 16:22:35 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called `self.log('global_step', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
      warning_cache.warn(
    
[NeMo W 2023-08-17 16:22:35 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: UserWarning: You called `self.log('consumed_samples', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
      warning_cache.warn(
    
[NeMo W 2023-08-17 16:22:35 nemo_logging:349] /usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
      warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
    

Epoch 0:   0%|          | 1/220 [00:01<06:47,  1.86s/it]
Epoch 0:   0%|          | 1/220 [00:01<06:47,  1.86s/it, loss=4.58, v_num=, reduced_train_loss=4.580, global_step=0.000, consumed_samples=0.000]
Epoch 0:   1%|          | 2/220 [00:01<03:29,  1.04it/s, loss=4.58, v_num=, reduced_train_loss=4.580, global_step=0.000, consumed_samples=0.000]
Epoch 0:   1%|          | 2/220 [00:01<03:29,  1.04it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=1.000, consumed_samples=8.000]
Epoch 0:   1%|▏         | 3/220 [00:01<02:22,  1.53it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=1.000, consumed_samples=8.000]
Epoch 0:   1%|▏         | 3/220 [00:01<02:22,  1.53it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=2.000, consumed_samples=16.00]
Epoch 0:   2%|▏         | 4/220 [00:02<01:48,  1.99it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=2.000, consumed_samples=16.00]
Epoch 0:   2%|▏         | 4/220 [00:02<01:48,  1.99it/s, loss=4.6, v_num=, reduced_train_loss=4.620, global_step=3.000, consumed_samples=24.00] 
Epoch 0:   2%|▏         | 5/220 [00:02<01:28,  2.44it/s, loss=4.6, v_num=, reduced_train_loss=4.620, global_step=3.000, consumed_samples=24.00]
Epoch 0:   2%|▏         | 5/220 [00:02<01:28,  2.44it/s, loss=4.59, v_num=, reduced_train_loss=4.580, global_step=4.000, consumed_samples=32.00]
Epoch 0:   3%|▎         | 6/220 [00:02<01:14,  2.87it/s, loss=4.59, v_num=, reduced_train_loss=4.580, global_step=4.000, consumed_samples=32.00]
Epoch 0:   3%|▎         | 6/220 [00:02<01:14,  2.87it/s, loss=4.59, v_num=, reduced_train_loss=4.560, global_step=5.000, consumed_samples=40.00]
Epoch 0:   3%|▎         | 7/220 [00:02<01:05,  3.27it/s, loss=4.59, v_num=, reduced_train_loss=4.560, global_step=5.000, consumed_samples=40.00]
Epoch 0:   3%|▎         | 7/220 [00:02<01:05,  3.27it/s, loss=4.58, v_num=, reduced_train_loss=4.550, global_step=6.000, consumed_samples=48.00]
Epoch 0:   4%|▎         | 8/220 [00:02<00:58,  3.64it/s, loss=4.58, v_num=, reduced_train_loss=4.550, global_step=6.000, consumed_samples=48.00]
Epoch 0:   4%|▎         | 8/220 [00:02<00:58,  3.64it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=7.000, consumed_samples=56.00]
Epoch 0:   4%|▍         | 9/220 [00:02<00:52,  4.00it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=7.000, consumed_samples=56.00]
Epoch 0:   4%|▍         | 9/220 [00:02<00:52,  4.00it/s, loss=4.59, v_num=, reduced_train_loss=4.580, global_step=8.000, consumed_samples=64.00]
Epoch 0:   5%|▍         | 10/220 [00:02<00:48,  4.35it/s, loss=4.59, v_num=, reduced_train_loss=4.580, global_step=8.000, consumed_samples=64.00]
Epoch 0:   5%|▍         | 10/220 [00:02<00:48,  4.35it/s, loss=4.59, v_num=, reduced_train_loss=4.620, global_step=9.000, consumed_samples=72.00]
Epoch 0:   5%|▌         | 11/220 [00:02<00:44,  4.68it/s, loss=4.59, v_num=, reduced_train_loss=4.620, global_step=9.000, consumed_samples=72.00]
Epoch 0:   5%|▌         | 11/220 [00:02<00:44,  4.67it/s, loss=4.59, v_num=, reduced_train_loss=4.570, global_step=10.00, consumed_samples=80.00]
Epoch 0:   5%|▌         | 12/220 [00:02<00:41,  4.98it/s, loss=4.59, v_num=, reduced_train_loss=4.570, global_step=10.00, consumed_samples=80.00]
Epoch 0:   5%|▌         | 12/220 [00:02<00:41,  4.98it/s, loss=4.59, v_num=, reduced_train_loss=4.610, global_step=11.00, consumed_samples=88.00]
Epoch 0:   6%|▌         | 13/220 [00:02<00:39,  5.29it/s, loss=4.59, v_num=, reduced_train_loss=4.610, global_step=11.00, consumed_samples=88.00]
Epoch 0:   6%|▌         | 13/220 [00:02<00:39,  5.29it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=12.00, consumed_samples=96.00]
Epoch 0:   6%|▋         | 14/220 [00:02<00:36,  5.60it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=12.00, consumed_samples=96.00]
Epoch 0:   6%|▋         | 14/220 [00:02<00:36,  5.60it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=13.00, consumed_samples=104.0]
Epoch 0:   7%|▋         | 15/220 [00:02<00:34,  5.90it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=13.00, consumed_samples=104.0]
Epoch 0:   7%|▋         | 15/220 [00:02<00:34,  5.90it/s, loss=4.59, v_num=, reduced_train_loss=4.610, global_step=14.00, consumed_samples=112.0]
Epoch 0:   7%|▋         | 16/220 [00:02<00:32,  6.19it/s, loss=4.59, v_num=, reduced_train_loss=4.610, global_step=14.00, consumed_samples=112.0]
Epoch 0:   7%|▋         | 16/220 [00:02<00:32,  6.19it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=15.00, consumed_samples=120.0]
Epoch 0:   8%|▊         | 17/220 [00:02<00:31,  6.48it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=15.00, consumed_samples=120.0]
Epoch 0:   8%|▊         | 17/220 [00:02<00:31,  6.48it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=16.00, consumed_samples=128.0]
Epoch 0:   8%|▊         | 18/220 [00:02<00:29,  6.75it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=16.00, consumed_samples=128.0]
Epoch 0:   8%|▊         | 18/220 [00:02<00:29,  6.75it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=17.00, consumed_samples=136.0]
Epoch 0:   9%|▊         | 19/220 [00:02<00:28,  7.01it/s, loss=4.59, v_num=, reduced_train_loss=4.600, global_step=17.00, consumed_samples=136.0]
Epoch 0:   9%|▊         | 19/220 [00:02<00:28,  7.01it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=18.00, consumed_samples=144.0]
Epoch 0:   9%|▉         | 20/220 [00:02<00:27,  7.26it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=18.00, consumed_samples=144.0]
Epoch 0:   9%|▉         | 20/220 [00:02<00:27,  7.26it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=19.00, consumed_samples=152.0]
Epoch 0:  10%|▉         | 21/220 [00:02<00:26,  7.50it/s, loss=4.59, v_num=, reduced_train_loss=4.590, global_step=19.00, consumed_samples=152.0]
Epoch 0:  10%|▉         | 21/220 [00:02<00:26,  7.50it/s, loss=4.59, v_num=, reduced_train_loss=4.550, global_step=20.00, consumed_samples=160.0]
Epoch 0:  10%|█         | 22/220 [00:02<00:25,  7.74it/s, loss=4.59, v_num=, reduced_train_loss=4.550, global_step=20.00, consumed_samples=160.0]
Epoch 0:  10%|█         | 22/220 [00:02<00:25,  7.74it/s, loss=4.58, v_num=, reduced_train_loss=4.390, global_step=21.00, consumed_samples=168.0]
Epoch 0:  10%|█         | 23/220 [00:02<00:24,  7.94it/s, loss=4.58, v_num=, reduced_train_loss=4.390, global_step=21.00, consumed_samples=168.0]
Epoch 0:  10%|█         | 23/220 [00:02<00:24,  7.94it/s, loss=4.56, v_num=, reduced_train_loss=4.200, global_step=22.00, consumed_samples=176.0]
Epoch 0:  11%|█         | 24/220 [00:02<00:24,  8.14it/s, loss=4.56, v_num=, reduced_train_loss=4.200, global_step=22.00, consumed_samples=176.0]
Epoch 0:  11%|█         | 24/220 [00:02<00:24,  8.14it/s, loss=4.53, v_num=, reduced_train_loss=3.920, global_step=23.00, consumed_samples=184.0]
Epoch 0:  11%|█▏        | 25/220 [00:02<00:23,  8.34it/s, loss=4.53, v_num=, reduced_train_loss=3.920, global_step=23.00, consumed_samples=184.0]
Epoch 0:  11%|█▏        | 25/220 [00:02<00:23,  8.34it/s, loss=4.47, v_num=, reduced_train_loss=3.550, global_step=24.00, consumed_samples=192.0]
Epoch 0:  12%|█▏        | 26/220 [00:03<00:22,  8.55it/s, loss=4.47, v_num=, reduced_train_loss=3.550, global_step=24.00, consumed_samples=192.0]
Epoch 0:  12%|█▏        | 26/220 [00:03<00:22,  8.55it/s, loss=4.41, v_num=, reduced_train_loss=3.270, global_step=25.00, consumed_samples=200.0]
Epoch 0:  12%|█▏        | 27/220 [00:03<00:22,  8.74it/s, loss=4.41, v_num=, reduced_train_loss=3.270, global_step=25.00, consumed_samples=200.0]
Epoch 0:  12%|█▏        | 27/220 [00:03<00:22,  8.74it/s, loss=4.33, v_num=, reduced_train_loss=2.980, global_step=26.00, consumed_samples=208.0]
Epoch 0:  13%|█▎        | 28/220 [00:03<00:21,  8.94it/s, loss=4.33, v_num=, reduced_train_loss=2.980, global_step=26.00, consumed_samples=208.0]
Epoch 0:  13%|█▎        | 28/220 [00:03<00:21,  8.94it/s, loss=4.24, v_num=, reduced_train_loss=2.740, global_step=27.00, consumed_samples=216.0]
Epoch 0:  13%|█▎        | 29/220 [00:03<00:20,  9.10it/s, loss=4.24, v_num=, reduced_train_loss=2.740, global_step=27.00, consumed_samples=216.0]
Epoch 0:  13%|█▎        | 29/220 [00:03<00:20,  9.10it/s, loss=4.13, v_num=, reduced_train_loss=2.460, global_step=28.00, consumed_samples=224.0]
Epoch 0:  14%|█▎        | 30/220 [00:03<00:20,  9.27it/s, loss=4.13, v_num=, reduced_train_loss=2.460, global_step=28.00, consumed_samples=224.0]
Epoch 0:  14%|█▎        | 30/220 [00:03<00:20,  9.27it/s, loss=4.01, v_num=, reduced_train_loss=2.240, global_step=29.00, consumed_samples=232.0]
Epoch 0:  14%|█▍        | 31/220 [00:03<00:20,  9.41it/s, loss=4.01, v_num=, reduced_train_loss=2.240, global_step=29.00, consumed_samples=232.0]
Epoch 0:  14%|█▍        | 31/220 [00:03<00:20,  9.41it/s, loss=3.89, v_num=, reduced_train_loss=2.060, global_step=30.00, consumed_samples=240.0]
Epoch 0:  15%|█▍        | 32/220 [00:03<00:19,  9.58it/s, loss=3.89, v_num=, reduced_train_loss=2.060, global_step=30.00, consumed_samples=240.0]
Epoch 0:  15%|█▍        | 32/220 [00:03<00:19,  9.58it/s, loss=3.76, v_num=, reduced_train_loss=1.970, global_step=31.00, consumed_samples=248.0]
Epoch 0:  15%|█▌        | 33/220 [00:03<00:19,  9.75it/s, loss=3.76, v_num=, reduced_train_loss=1.970, global_step=31.00, consumed_samples=248.0]
Epoch 0:  15%|█▌        | 33/220 [00:03<00:19,  9.75it/s, loss=3.62, v_num=, reduced_train_loss=1.900, global_step=32.00, consumed_samples=256.0]
Epoch 0:  15%|█▌        | 34/220 [00:03<00:18,  9.91it/s, loss=3.62, v_num=, reduced_train_loss=1.900, global_step=32.00, consumed_samples=256.0]
Epoch 0:  15%|█▌        | 34/220 [00:03<00:18,  9.91it/s, loss=3.48, v_num=, reduced_train_loss=1.800, global_step=33.00, consumed_samples=264.0]
Epoch 0:  16%|█▌        | 35/220 [00:03<00:18, 10.07it/s, loss=3.48, v_num=, reduced_train_loss=1.800, global_step=33.00, consumed_samples=264.0]
Epoch 0:  16%|█▌        | 35/220 [00:03<00:18, 10.07it/s, loss=3.34, v_num=, reduced_train_loss=1.750, global_step=34.00, consumed_samples=272.0]
Epoch 0:  16%|█▋        | 36/220 [00:03<00:18, 10.22it/s, loss=3.34, v_num=, reduced_train_loss=1.750, global_step=34.00, consumed_samples=272.0]
Epoch 0:  16%|█▋        | 36/220 [00:03<00:18, 10.22it/s, loss=3.19, v_num=, reduced_train_loss=1.680, global_step=35.00, consumed_samples=280.0]
Epoch 0:  17%|█▋        | 37/220 [00:03<00:17, 10.37it/s, loss=3.19, v_num=, reduced_train_loss=1.680, global_step=35.00, consumed_samples=280.0]
Epoch 0:  17%|█▋        | 37/220 [00:03<00:17, 10.37it/s, loss=3.05, v_num=, reduced_train_loss=1.660, global_step=36.00, consumed_samples=288.0]
Epoch 0:  17%|█▋        | 38/220 [00:03<00:17, 10.52it/s, loss=3.05, v_num=, reduced_train_loss=1.660, global_step=36.00, consumed_samples=288.0]
Epoch 0:  17%|█▋        | 38/220 [00:03<00:17, 10.52it/s, loss=2.9, v_num=, reduced_train_loss=1.610, global_step=37.00, consumed_samples=296.0] 
Epoch 0:  18%|█▊        | 39/220 [00:03<00:16, 10.65it/s, loss=2.9, v_num=, reduced_train_loss=1.610, global_step=37.00, consumed_samples=296.0]
Epoch 0:  18%|█▊        | 39/220 [00:03<00:16, 10.65it/s, loss=2.75, v_num=, reduced_train_loss=1.590, global_step=38.00, consumed_samples=304.0]
Epoch 0:  18%|█▊        | 40/220 [00:03<00:16, 10.76it/s, loss=2.75, v_num=, reduced_train_loss=1.590, global_step=38.00, consumed_samples=304.0]
Epoch 0:  18%|█▊        | 40/220 [00:03<00:16, 10.76it/s, loss=2.59, v_num=, reduced_train_loss=1.570, global_step=39.00, consumed_samples=312.0]
Epoch 0:  19%|█▊        | 41/220 [00:03<00:16, 10.89it/s, loss=2.59, v_num=, reduced_train_loss=1.570, global_step=39.00, consumed_samples=312.0]
Epoch 0:  19%|█▊        | 41/220 [00:03<00:16, 10.89it/s, loss=2.44, v_num=, reduced_train_loss=1.550, global_step=40.00, consumed_samples=320.0]
Epoch 0:  19%|█▉        | 42/220 [00:03<00:16, 11.02it/s, loss=2.44, v_num=, reduced_train_loss=1.550, global_step=40.00, consumed_samples=320.0]
Epoch 0:  19%|█▉        | 42/220 [00:03<00:16, 11.02it/s, loss=2.3, v_num=, reduced_train_loss=1.530, global_step=41.00, consumed_samples=328.0] 
Epoch 0:  20%|█▉        | 43/220 [00:03<00:15, 11.15it/s, loss=2.3, v_num=, reduced_train_loss=1.530, global_step=41.00, consumed_samples=328.0]
Epoch 0:  20%|█▉        | 43/220 [00:03<00:15, 11.15it/s, loss=2.17, v_num=, reduced_train_loss=1.510, global_step=42.00, consumed_samples=336.0]
Epoch 0:  20%|██        | 44/220 [00:03<00:15, 11.27it/s, loss=2.17, v_num=, reduced_train_loss=1.510, global_step=42.00, consumed_samples=336.0]
Epoch 0:  20%|██        | 44/220 [00:03<00:15, 11.27it/s, loss=2.05, v_num=, reduced_train_loss=1.520, global_step=43.00, consumed_samples=344.0]
Epoch 0:  20%|██        | 45/220 [00:03<00:15, 11.39it/s, loss=2.05, v_num=, reduced_train_loss=1.520, global_step=43.00, consumed_samples=344.0]
Epoch 0:  20%|██        | 45/220 [00:03<00:15, 11.39it/s, loss=1.94, v_num=, reduced_train_loss=1.500, global_step=44.00, consumed_samples=352.0]
Epoch 0:  21%|██        | 46/220 [00:03<00:15, 11.50it/s, loss=1.94, v_num=, reduced_train_loss=1.500, global_step=44.00, consumed_samples=352.0]
Epoch 0:  21%|██        | 46/220 [00:03<00:15, 11.50it/s, loss=1.86, v_num=, reduced_train_loss=1.500, global_step=45.00, consumed_samples=360.0]
Epoch 0:  21%|██▏       | 47/220 [00:04<00:14, 11.62it/s, loss=1.86, v_num=, reduced_train_loss=1.500, global_step=45.00, consumed_samples=360.0]
Epoch 0:  21%|██▏       | 47/220 [00:04<00:14, 11.62it/s, loss=1.78, v_num=, reduced_train_loss=1.490, global_step=46.00, consumed_samples=368.0]
Epoch 0:  22%|██▏       | 48/220 [00:04<00:14, 11.73it/s, loss=1.78, v_num=, reduced_train_loss=1.490, global_step=46.00, consumed_samples=368.0]
Epoch 0:  22%|██▏       | 48/220 [00:04<00:14, 11.73it/s, loss=1.72, v_num=, reduced_train_loss=1.470, global_step=47.00, consumed_samples=376.0]
Epoch 0:  22%|██▏       | 49/220 [00:04<00:14, 11.85it/s, loss=1.72, v_num=, reduced_train_loss=1.470, global_step=47.00, consumed_samples=376.0]
Epoch 0:  22%|██▏       | 49/220 [00:04<00:14, 11.85it/s, loss=1.67, v_num=, reduced_train_loss=1.480, global_step=48.00, consumed_samples=384.0]
Epoch 0:  23%|██▎       | 50/220 [00:04<00:14, 11.96it/s, loss=1.67, v_num=, reduced_train_loss=1.480, global_step=48.00, consumed_samples=384.0]
Epoch 0:  23%|██▎       | 50/220 [00:04<00:14, 11.96it/s, loss=1.63, v_num=, reduced_train_loss=1.490, global_step=49.00, consumed_samples=392.0]
Epoch 0:  23%|██▎       | 51/220 [00:04<00:14, 12.06it/s, loss=1.63, v_num=, reduced_train_loss=1.490, global_step=49.00, consumed_samples=392.0]
Epoch 0:  23%|██▎       | 51/220 [00:04<00:14, 12.06it/s, loss=1.6, v_num=, reduced_train_loss=1.490, global_step=50.00, consumed_samples=400.0] 
Epoch 0:  24%|██▎       | 52/220 [00:04<00:13, 12.16it/s, loss=1.6, v_num=, reduced_train_loss=1.490, global_step=50.00, consumed_samples=400.0]
Epoch 0:  24%|██▎       | 52/220 [00:04<00:13, 12.16it/s, loss=1.58, v_num=, reduced_train_loss=1.460, global_step=51.00, consumed_samples=408.0]
Epoch 0:  24%|██▍       | 53/220 [00:04<00:13, 12.25it/s, loss=1.58, v_num=, reduced_train_loss=1.460, global_step=51.00, consumed_samples=408.0]
Epoch 0:  24%|██▍       | 53/220 [00:04<00:13, 12.25it/s, loss=1.56, v_num=, reduced_train_loss=1.490, global_step=52.00, consumed_samples=416.0]
Epoch 0:  25%|██▍       | 54/220 [00:04<00:13, 12.34it/s, loss=1.56, v_num=, reduced_train_loss=1.490, global_step=52.00, consumed_samples=416.0]
Epoch 0:  25%|██▍       | 54/220 [00:04<00:13, 12.34it/s, loss=1.54, v_num=, reduced_train_loss=1.460, global_step=53.00, consumed_samples=424.0]
Epoch 0:  25%|██▌       | 55/220 [00:04<00:13, 12.44it/s, loss=1.54, v_num=, reduced_train_loss=1.460, global_step=53.00, consumed_samples=424.0]
Epoch 0:  25%|██▌       | 55/220 [00:04<00:13, 12.44it/s, loss=1.52, v_num=, reduced_train_loss=1.440, global_step=54.00, consumed_samples=432.0]
Epoch 0:  25%|██▌       | 56/220 [00:04<00:13, 12.54it/s, loss=1.52, v_num=, reduced_train_loss=1.440, global_step=54.00, consumed_samples=432.0]
Epoch 0:  25%|██▌       | 56/220 [00:04<00:13, 12.54it/s, loss=1.51, v_num=, reduced_train_loss=1.480, global_step=55.00, consumed_samples=440.0]
Epoch 0:  26%|██▌       | 57/220 [00:04<00:12, 12.63it/s, loss=1.51, v_num=, reduced_train_loss=1.480, global_step=55.00, consumed_samples=440.0]
Epoch 0:  26%|██▌       | 57/220 [00:04<00:12, 12.63it/s, loss=1.5, v_num=, reduced_train_loss=1.450, global_step=56.00, consumed_samples=448.0] 
Epoch 0:  26%|██▋       | 58/220 [00:04<00:12, 12.71it/s, loss=1.5, v_num=, reduced_train_loss=1.450, global_step=56.00, consumed_samples=448.0]
Epoch 0:  26%|██▋       | 58/220 [00:04<00:12, 12.71it/s, loss=1.5, v_num=, reduced_train_loss=1.460, global_step=57.00, consumed_samples=456.0]
Epoch 0:  27%|██▋       | 59/220 [00:04<00:12, 12.80it/s, loss=1.5, v_num=, reduced_train_loss=1.460, global_step=57.00, consumed_samples=456.0]
Epoch 0:  27%|██▋       | 59/220 [00:04<00:12, 12.80it/s, loss=1.49, v_num=, reduced_train_loss=1.460, global_step=58.00, consumed_samples=464.0]
Epoch 0:  27%|██▋       | 60/220 [00:04<00:12, 12.89it/s, loss=1.49, v_num=, reduced_train_loss=1.460, global_step=58.00, consumed_samples=464.0]
Epoch 0:  27%|██▋       | 60/220 [00:04<00:12, 12.89it/s, loss=1.48, v_num=, reduced_train_loss=1.460, global_step=59.00, consumed_samples=472.0]
Epoch 0:  28%|██▊       | 61/220 [00:04<00:12, 12.97it/s, loss=1.48, v_num=, reduced_train_loss=1.460, global_step=59.00, consumed_samples=472.0]
Epoch 0:  28%|██▊       | 61/220 [00:04<00:12, 12.97it/s, loss=1.48, v_num=, reduced_train_loss=1.440, global_step=60.00, consumed_samples=480.0]
Epoch 0:  28%|██▊       | 62/220 [00:04<00:12, 13.06it/s, loss=1.48, v_num=, reduced_train_loss=1.440, global_step=60.00, consumed_samples=480.0]
Epoch 0:  28%|██▊       | 62/220 [00:04<00:12, 13.06it/s, loss=1.48, v_num=, reduced_train_loss=1.490, global_step=61.00, consumed_samples=488.0]
Epoch 0:  29%|██▊       | 63/220 [00:04<00:11, 13.15it/s, loss=1.48, v_num=, reduced_train_loss=1.490, global_step=61.00, consumed_samples=488.0]
Epoch 0:  29%|██▊       | 63/220 [00:04<00:11, 13.15it/s, loss=1.48, v_num=, reduced_train_loss=1.490, global_step=62.00, consumed_samples=496.0]
Epoch 0:  29%|██▉       | 64/220 [00:04<00:11, 13.23it/s, loss=1.48, v_num=, reduced_train_loss=1.490, global_step=62.00, consumed_samples=496.0]
Epoch 0:  29%|██▉       | 64/220 [00:04<00:11, 13.23it/s, loss=1.47, v_num=, reduced_train_loss=1.430, global_step=63.00, consumed_samples=504.0]
Epoch 0:  30%|██▉       | 65/220 [00:04<00:11, 13.31it/s, loss=1.47, v_num=, reduced_train_loss=1.430, global_step=63.00, consumed_samples=504.0]
Epoch 0:  30%|██▉       | 65/220 [00:04<00:11, 13.30it/s, loss=1.47, v_num=, reduced_train_loss=1.450, global_step=64.00, consumed_samples=512.0]
Epoch 0:  30%|███       | 66/220 [00:04<00:11, 13.38it/s, loss=1.47, v_num=, reduced_train_loss=1.450, global_step=64.00, consumed_samples=512.0]
Epoch 0:  30%|███       | 66/220 [00:04<00:11, 13.38it/s, loss=1.47, v_num=, reduced_train_loss=1.460, global_step=65.00, consumed_samples=520.0]
Epoch 0:  30%|███       | 67/220 [00:04<00:11, 13.45it/s, loss=1.47, v_num=, reduced_train_loss=1.460, global_step=65.00, consumed_samples=520.0]
Epoch 0:  30%|███       | 67/220 [00:04<00:11, 13.45it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=66.00, consumed_samples=528.0]
Epoch 0:  31%|███       | 68/220 [00:05<00:11, 13.53it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=66.00, consumed_samples=528.0]
Epoch 0:  31%|███       | 68/220 [00:05<00:11, 13.53it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=67.00, consumed_samples=536.0]
Epoch 0:  31%|███▏      | 69/220 [00:05<00:11, 13.60it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=67.00, consumed_samples=536.0]
Epoch 0:  31%|███▏      | 69/220 [00:05<00:11, 13.60it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=68.00, consumed_samples=544.0]
Epoch 0:  32%|███▏      | 70/220 [00:05<00:10, 13.67it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=68.00, consumed_samples=544.0]
Epoch 0:  32%|███▏      | 70/220 [00:05<00:10, 13.67it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=69.00, consumed_samples=552.0]
Epoch 0:  32%|███▏      | 71/220 [00:05<00:10, 13.68it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=69.00, consumed_samples=552.0]
Epoch 0:  32%|███▏      | 71/220 [00:05<00:10, 13.68it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=70.00, consumed_samples=560.0]
Epoch 0:  33%|███▎      | 72/220 [00:05<00:10, 13.72it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=70.00, consumed_samples=560.0]
Epoch 0:  33%|███▎      | 72/220 [00:05<00:10, 13.72it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=71.00, consumed_samples=568.0]
Epoch 0:  33%|███▎      | 73/220 [00:05<00:10, 13.78it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=71.00, consumed_samples=568.0]
Epoch 0:  33%|███▎      | 73/220 [00:05<00:10, 13.78it/s, loss=1.46, v_num=, reduced_train_loss=1.420, global_step=72.00, consumed_samples=576.0]
Epoch 0:  34%|███▎      | 74/220 [00:05<00:10, 13.85it/s, loss=1.46, v_num=, reduced_train_loss=1.420, global_step=72.00, consumed_samples=576.0]
Epoch 0:  34%|███▎      | 74/220 [00:05<00:10, 13.85it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=73.00, consumed_samples=584.0]
Epoch 0:  34%|███▍      | 75/220 [00:05<00:10, 13.90it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=73.00, consumed_samples=584.0]
Epoch 0:  34%|███▍      | 75/220 [00:05<00:10, 13.90it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=74.00, consumed_samples=592.0]
Epoch 0:  35%|███▍      | 76/220 [00:05<00:10, 13.96it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=74.00, consumed_samples=592.0]
Epoch 0:  35%|███▍      | 76/220 [00:05<00:10, 13.96it/s, loss=1.46, v_num=, reduced_train_loss=1.490, global_step=75.00, consumed_samples=600.0]
Epoch 0:  35%|███▌      | 77/220 [00:05<00:10, 14.00it/s, loss=1.46, v_num=, reduced_train_loss=1.490, global_step=75.00, consumed_samples=600.0]
Epoch 0:  35%|███▌      | 77/220 [00:05<00:10, 14.00it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=76.00, consumed_samples=608.0]
Epoch 0:  35%|███▌      | 78/220 [00:05<00:10, 14.05it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=76.00, consumed_samples=608.0]
Epoch 0:  35%|███▌      | 78/220 [00:05<00:10, 14.05it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=77.00, consumed_samples=616.0]
Epoch 0:  36%|███▌      | 79/220 [00:05<00:09, 14.12it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=77.00, consumed_samples=616.0]
Epoch 0:  36%|███▌      | 79/220 [00:05<00:09, 14.12it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=78.00, consumed_samples=624.0]
Epoch 0:  36%|███▋      | 80/220 [00:05<00:09, 14.18it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=78.00, consumed_samples=624.0]
Epoch 0:  36%|███▋      | 80/220 [00:05<00:09, 14.17it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=79.00, consumed_samples=632.0]
Epoch 0:  37%|███▋      | 81/220 [00:05<00:09, 14.23it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=79.00, consumed_samples=632.0]
Epoch 0:  37%|███▋      | 81/220 [00:05<00:09, 14.23it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=80.00, consumed_samples=640.0]
Epoch 0:  37%|███▋      | 82/220 [00:05<00:09, 14.28it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=80.00, consumed_samples=640.0]
Epoch 0:  37%|███▋      | 82/220 [00:05<00:09, 14.28it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=81.00, consumed_samples=648.0]
Epoch 0:  38%|███▊      | 83/220 [00:05<00:09, 14.34it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=81.00, consumed_samples=648.0]
Epoch 0:  38%|███▊      | 83/220 [00:05<00:09, 14.34it/s, loss=1.46, v_num=, reduced_train_loss=1.510, global_step=82.00, consumed_samples=656.0]
Epoch 0:  38%|███▊      | 84/220 [00:05<00:09, 14.40it/s, loss=1.46, v_num=, reduced_train_loss=1.510, global_step=82.00, consumed_samples=656.0]
Epoch 0:  38%|███▊      | 84/220 [00:05<00:09, 14.39it/s, loss=1.46, v_num=, reduced_train_loss=1.500, global_step=83.00, consumed_samples=664.0]
Epoch 0:  39%|███▊      | 85/220 [00:05<00:09, 14.44it/s, loss=1.46, v_num=, reduced_train_loss=1.500, global_step=83.00, consumed_samples=664.0]
Epoch 0:  39%|███▊      | 85/220 [00:05<00:09, 14.44it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=84.00, consumed_samples=672.0]
Epoch 0:  39%|███▉      | 86/220 [00:05<00:09, 14.48it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=84.00, consumed_samples=672.0]
Epoch 0:  39%|███▉      | 86/220 [00:05<00:09, 14.48it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=85.00, consumed_samples=680.0]
Epoch 0:  40%|███▉      | 87/220 [00:05<00:09, 14.53it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=85.00, consumed_samples=680.0]
Epoch 0:  40%|███▉      | 87/220 [00:05<00:09, 14.53it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=86.00, consumed_samples=688.0]
Epoch 0:  40%|████      | 88/220 [00:06<00:09, 14.59it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=86.00, consumed_samples=688.0]
Epoch 0:  40%|████      | 88/220 [00:06<00:09, 14.59it/s, loss=1.46, v_num=, reduced_train_loss=1.490, global_step=87.00, consumed_samples=696.0]
Epoch 0:  40%|████      | 89/220 [00:06<00:08, 14.65it/s, loss=1.46, v_num=, reduced_train_loss=1.490, global_step=87.00, consumed_samples=696.0]
Epoch 0:  40%|████      | 89/220 [00:06<00:08, 14.64it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=88.00, consumed_samples=704.0]
Epoch 0:  41%|████      | 90/220 [00:06<00:08, 14.69it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=88.00, consumed_samples=704.0]
Epoch 0:  41%|████      | 90/220 [00:06<00:08, 14.69it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=89.00, consumed_samples=712.0]
Epoch 0:  41%|████▏     | 91/220 [00:06<00:08, 14.72it/s, loss=1.46, v_num=, reduced_train_loss=1.460, global_step=89.00, consumed_samples=712.0]
Epoch 0:  41%|████▏     | 91/220 [00:06<00:08, 14.72it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=90.00, consumed_samples=720.0]
Epoch 0:  42%|████▏     | 92/220 [00:06<00:08, 14.77it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=90.00, consumed_samples=720.0]
Epoch 0:  42%|████▏     | 92/220 [00:06<00:08, 14.77it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=91.00, consumed_samples=728.0]
Epoch 0:  42%|████▏     | 93/220 [00:06<00:08, 14.83it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=91.00, consumed_samples=728.0]
Epoch 0:  42%|████▏     | 93/220 [00:06<00:08, 14.83it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=92.00, consumed_samples=736.0]
Epoch 0:  43%|████▎     | 94/220 [00:06<00:08, 14.88it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=92.00, consumed_samples=736.0]
Epoch 0:  43%|████▎     | 94/220 [00:06<00:08, 14.88it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=93.00, consumed_samples=744.0]
Epoch 0:  43%|████▎     | 95/220 [00:06<00:08, 14.93it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=93.00, consumed_samples=744.0]
Epoch 0:  43%|████▎     | 95/220 [00:06<00:08, 14.93it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=94.00, consumed_samples=752.0]
Epoch 0:  44%|████▎     | 96/220 [00:06<00:08, 14.98it/s, loss=1.46, v_num=, reduced_train_loss=1.440, global_step=94.00, consumed_samples=752.0]
Epoch 0:  44%|████▎     | 96/220 [00:06<00:08, 14.98it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=95.00, consumed_samples=760.0]
Epoch 0:  44%|████▍     | 97/220 [00:06<00:08, 15.02it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=95.00, consumed_samples=760.0]
Epoch 0:  44%|████▍     | 97/220 [00:06<00:08, 15.02it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=96.00, consumed_samples=768.0]
Epoch 0:  45%|████▍     | 98/220 [00:06<00:08, 15.07it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=96.00, consumed_samples=768.0]
Epoch 0:  45%|████▍     | 98/220 [00:06<00:08, 15.07it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=97.00, consumed_samples=776.0]
Epoch 0:  45%|████▌     | 99/220 [00:06<00:08, 15.11it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=97.00, consumed_samples=776.0]
Epoch 0:  45%|████▌     | 99/220 [00:06<00:08, 15.10it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=98.00, consumed_samples=784.0]
Epoch 0:  45%|████▌     | 100/220 [00:06<00:07, 15.14it/s, loss=1.46, v_num=, reduced_train_loss=1.470, global_step=98.00, consumed_samples=784.0]
Epoch 0:  45%|████▌     | 100/220 [00:06<00:07, 15.14it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]

Validation: 0it [00:00, ?it/s]

Validation:   0%|          | 0/10 [00:00<?, ?it/s]

Validation DataLoader 0:   0%|          | 0/10 [00:00<?, ?it/s]

Validation DataLoader 0:  10%|█         | 1/10 [00:00<00:01,  8.65it/s]
Epoch 0:  46%|████▌     | 101/220 [00:06<00:07, 14.98it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]

Validation DataLoader 0:  20%|██        | 2/10 [00:00<00:00, 15.52it/s]
Epoch 0:  46%|████▋     | 102/220 [00:06<00:07, 15.10it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]

Validation DataLoader 0:  30%|███       | 3/10 [00:00<00:00, 17.33it/s]
Epoch 0:  47%|████▋     | 103/220 [00:06<00:07, 15.15it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]

Validation DataLoader 0:  40%|████      | 4/10 [00:00<00:00, 20.76it/s]
Epoch 0:  47%|████▋     | 104/220 [00:06<00:07, 15.25it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]

Validation DataLoader 0:  50%|█████     | 5/10 [00:00<00:00, 23.95it/s]
Epoch 0:  48%|████▊     | 105/220 [00:06<00:07, 15.36it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]

Validation DataLoader 0:  60%|██████    | 6/10 [00:00<00:00, 26.99it/s]
Epoch 0:  48%|████▊     | 106/220 [00:06<00:07, 15.48it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]

Validation DataLoader 0:  70%|███████   | 7/10 [00:00<00:00, 29.75it/s]
Epoch 0:  49%|████▊     | 107/220 [00:06<00:07, 15.59it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]

Validation DataLoader 0:  80%|████████  | 8/10 [00:00<00:00, 32.18it/s]
Epoch 0:  49%|████▉     | 108/220 [00:06<00:07, 15.71it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]

Validation DataLoader 0:  90%|█████████ | 9/10 [00:00<00:00, 34.39it/s]
Epoch 0:  50%|████▉     | 109/220 [00:06<00:07, 15.82it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]

Validation DataLoader 0: 100%|██████████| 10/10 [00:00<00:00, 36.39it/s]
Epoch 0:  50%|█████     | 110/220 [00:06<00:06, 15.94it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0]
Epoch 0:  50%|█████     | 110/220 [00:06<00:06, 15.93it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0, val_loss=1.420]

                                                                        Epoch 0, global step 100: 'val_loss' reached 1.42065 (best 1.42065), saving model to '/result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining/checkpoints/megatron_bert--val_loss=1.42-step=100-consumed_samples=800.0.ckpt' as top 10

Epoch 0:  50%|█████     | 111/220 [00:08<00:08, 13.43it/s, loss=1.46, v_num=, reduced_train_loss=1.480, global_step=99.00, consumed_samples=792.0, val_loss=1.420]
Epoch 0:  50%|█████     | 111/220 [00:08<00:08, 13.43it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=100.0, consumed_samples=800.0, val_loss=1.420]
Epoch 0:  51%|█████     | 112/220 [00:08<00:08, 13.46it/s, loss=1.46, v_num=, reduced_train_loss=1.430, global_step=100.0, consumed_samples=800.0, val_loss=1.420]
Epoch 0:  51%|█████     | 112/220 [00:08<00:08, 13.46it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=101.0, consumed_samples=808.0, val_loss=1.420]
Epoch 0:  51%|█████▏    | 113/220 [00:08<00:07, 13.50it/s, loss=1.46, v_num=, reduced_train_loss=1.450, global_step=101.0, consumed_samples=808.0, val_loss=1.420]
Epoch 0:  51%|█████▏    | 113/220 [00:08<00:07, 13.50it/s, loss=1.46, v_num=, reduced_train_loss=1.420, global_step=102.0, consumed_samples=816.0, val_loss=1.420]
Epoch 0:  52%|█████▏    | 114/220 [00:08<00:07, 13.54it/s, loss=1.46, v_num=, reduced_train_loss=1.420, global_step=102.0, consumed_samples=816.0, val_loss=1.420]
Epoch 0:  52%|█████▏    | 114/220 [00:08<00:07, 13.54it/s, loss=1.45, v_num=, reduced_train_loss=1.420, global_step=103.0, consumed_samples=824.0, val_loss=1.420]
Epoch 0:  52%|█████▏    | 115/220 [00:08<00:07, 13.58it/s, loss=1.45, v_num=, reduced_train_loss=1.420, global_step=103.0, consumed_samples=824.0, val_loss=1.420]
Epoch 0:  52%|█████▏    | 115/220 [00:08<00:07, 13.58it/s, loss=1.45, v_num=, reduced_train_loss=1.420, global_step=104.0, consumed_samples=832.0, val_loss=1.420]
Epoch 0:  53%|█████▎    | 116/220 [00:08<00:07, 13.62it/s, loss=1.45, v_num=, reduced_train_loss=1.420, global_step=104.0, consumed_samples=832.0, val_loss=1.420]
Epoch 0:  53%|█████▎    | 116/220 [00:08<00:07, 13.62it/s, loss=1.45, v_num=, reduced_train_loss=1.410, global_step=105.0, consumed_samples=840.0, val_loss=1.420]
Epoch 0:  53%|█████▎    | 117/220 [00:08<00:07, 13.66it/s, loss=1.45, v_num=, reduced_train_loss=1.410, global_step=105.0, consumed_samples=840.0, val_loss=1.420]
Epoch 0:  53%|█████▎    | 117/220 [00:08<00:07, 13.66it/s, loss=1.45, v_num=, reduced_train_loss=1.410, global_step=106.0, consumed_samples=848.0, val_loss=1.420]
Epoch 0:  54%|█████▎    | 118/220 [00:08<00:07, 13.70it/s, loss=1.45, v_num=, reduced_train_loss=1.410, global_step=106.0, consumed_samples=848.0, val_loss=1.420]
Epoch 0:  54%|█████▎    | 118/220 [00:08<00:07, 13.70it/s, loss=1.44, v_num=, reduced_train_loss=1.410, global_step=107.0, consumed_samples=856.0, val_loss=1.420]
Epoch 0:  54%|█████▍    | 119/220 [00:08<00:07, 13.73it/s, loss=1.44, v_num=, reduced_train_loss=1.410, global_step=107.0, consumed_samples=856.0, val_loss=1.420]
Epoch 0:  54%|█████▍    | 119/220 [00:08<00:07, 13.73it/s, loss=1.44, v_num=, reduced_train_loss=1.430, global_step=108.0, consumed_samples=864.0, val_loss=1.420]
Epoch 0:  55%|█████▍    | 120/220 [00:08<00:07, 13.77it/s, loss=1.44, v_num=, reduced_train_loss=1.430, global_step=108.0, consumed_samples=864.0, val_loss=1.420]
Epoch 0:  55%|█████▍    | 120/220 [00:08<00:07, 13.77it/s, loss=1.44, v_num=, reduced_train_loss=1.480, global_step=109.0, consumed_samples=872.0, val_loss=1.420]
Epoch 0:  55%|█████▌    | 121/220 [00:08<00:07, 13.81it/s, loss=1.44, v_num=, reduced_train_loss=1.480, global_step=109.0, consumed_samples=872.0, val_loss=1.420]
Epoch 0:  55%|█████▌    | 121/220 [00:08<00:07, 13.81it/s, loss=1.44, v_num=, reduced_train_loss=1.410, global_step=110.0, consumed_samples=880.0, val_loss=1.420]
Epoch 0:  55%|█████▌    | 122/220 [00:08<00:07, 13.86it/s, loss=1.44, v_num=, reduced_train_loss=1.410, global_step=110.0, consumed_samples=880.0, val_loss=1.420]
Epoch 0:  55%|█████▌    | 122/220 [00:08<00:07, 13.86it/s, loss=1.44, v_num=, reduced_train_loss=1.430, global_step=111.0, consumed_samples=888.0, val_loss=1.420]
Epoch 0:  56%|█████▌    | 123/220 [00:08<00:06, 13.90it/s, loss=1.44, v_num=, reduced_train_loss=1.430, global_step=111.0, consumed_samples=888.0, val_loss=1.420]
Epoch 0:  56%|█████▌    | 123/220 [00:08<00:06, 13.90it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=112.0, consumed_samples=896.0, val_loss=1.420]
Epoch 0:  56%|█████▋    | 124/220 [00:08<00:06, 13.94it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=112.0, consumed_samples=896.0, val_loss=1.420]
Epoch 0:  56%|█████▋    | 124/220 [00:08<00:06, 13.94it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=113.0, consumed_samples=904.0, val_loss=1.420]
Epoch 0:  57%|█████▋    | 125/220 [00:08<00:06, 13.99it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=113.0, consumed_samples=904.0, val_loss=1.420]
Epoch 0:  57%|█████▋    | 125/220 [00:08<00:06, 13.99it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=114.0, consumed_samples=912.0, val_loss=1.420]
Epoch 0:  57%|█████▋    | 126/220 [00:08<00:06, 14.03it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=114.0, consumed_samples=912.0, val_loss=1.420]
Epoch 0:  57%|█████▋    | 126/220 [00:08<00:06, 14.02it/s, loss=1.43, v_num=, reduced_train_loss=1.400, global_step=115.0, consumed_samples=920.0, val_loss=1.420]
Epoch 0:  58%|█████▊    | 127/220 [00:09<00:06, 14.07it/s, loss=1.43, v_num=, reduced_train_loss=1.400, global_step=115.0, consumed_samples=920.0, val_loss=1.420]
Epoch 0:  58%|█████▊    | 127/220 [00:09<00:06, 14.07it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=116.0, consumed_samples=928.0, val_loss=1.420]
Epoch 0:  58%|█████▊    | 128/220 [00:09<00:06, 14.11it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=116.0, consumed_samples=928.0, val_loss=1.420]
Epoch 0:  58%|█████▊    | 128/220 [00:09<00:06, 14.11it/s, loss=1.43, v_num=, reduced_train_loss=1.390, global_step=117.0, consumed_samples=936.0, val_loss=1.420]
Epoch 0:  59%|█████▊    | 129/220 [00:09<00:06, 14.14it/s, loss=1.43, v_num=, reduced_train_loss=1.390, global_step=117.0, consumed_samples=936.0, val_loss=1.420]
Epoch 0:  59%|█████▊    | 129/220 [00:09<00:06, 14.14it/s, loss=1.42, v_num=, reduced_train_loss=1.440, global_step=118.0, consumed_samples=944.0, val_loss=1.420]
Epoch 0:  59%|█████▉    | 130/220 [00:09<00:06, 14.17it/s, loss=1.42, v_num=, reduced_train_loss=1.440, global_step=118.0, consumed_samples=944.0, val_loss=1.420]
Epoch 0:  59%|█████▉    | 130/220 [00:09<00:06, 14.17it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=119.0, consumed_samples=952.0, val_loss=1.420]
Epoch 0:  60%|█████▉    | 131/220 [00:09<00:06, 14.21it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=119.0, consumed_samples=952.0, val_loss=1.420]
Epoch 0:  60%|█████▉    | 131/220 [00:09<00:06, 14.21it/s, loss=1.42, v_num=, reduced_train_loss=1.420, global_step=120.0, consumed_samples=960.0, val_loss=1.420]
Epoch 0:  60%|██████    | 132/220 [00:09<00:06, 14.25it/s, loss=1.42, v_num=, reduced_train_loss=1.420, global_step=120.0, consumed_samples=960.0, val_loss=1.420]
Epoch 0:  60%|██████    | 132/220 [00:09<00:06, 14.25it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=121.0, consumed_samples=968.0, val_loss=1.420]
Epoch 0:  60%|██████    | 133/220 [00:09<00:06, 14.28it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=121.0, consumed_samples=968.0, val_loss=1.420]
Epoch 0:  60%|██████    | 133/220 [00:09<00:06, 14.28it/s, loss=1.42, v_num=, reduced_train_loss=1.470, global_step=122.0, consumed_samples=976.0, val_loss=1.420]
Epoch 0:  61%|██████    | 134/220 [00:09<00:06, 14.31it/s, loss=1.42, v_num=, reduced_train_loss=1.470, global_step=122.0, consumed_samples=976.0, val_loss=1.420]
Epoch 0:  61%|██████    | 134/220 [00:09<00:06, 14.31it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=123.0, consumed_samples=984.0, val_loss=1.420]
Epoch 0:  61%|██████▏   | 135/220 [00:09<00:05, 14.34it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=123.0, consumed_samples=984.0, val_loss=1.420]
Epoch 0:  61%|██████▏   | 135/220 [00:09<00:05, 14.34it/s, loss=1.42, v_num=, reduced_train_loss=1.470, global_step=124.0, consumed_samples=992.0, val_loss=1.420]
Epoch 0:  62%|██████▏   | 136/220 [00:09<00:05, 14.37it/s, loss=1.42, v_num=, reduced_train_loss=1.470, global_step=124.0, consumed_samples=992.0, val_loss=1.420]
Epoch 0:  62%|██████▏   | 136/220 [00:09<00:05, 14.37it/s, loss=1.42, v_num=, reduced_train_loss=1.430, global_step=125.0, consumed_samples=1e+3, val_loss=1.420] 
Epoch 0:  62%|██████▏   | 137/220 [00:09<00:05, 14.39it/s, loss=1.42, v_num=, reduced_train_loss=1.430, global_step=125.0, consumed_samples=1e+3, val_loss=1.420]
Epoch 0:  62%|██████▏   | 137/220 [00:09<00:05, 14.39it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=126.0, consumed_samples=1008.0, val_loss=1.420]
Epoch 0:  63%|██████▎   | 138/220 [00:09<00:05, 14.43it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=126.0, consumed_samples=1008.0, val_loss=1.420]
Epoch 0:  63%|██████▎   | 138/220 [00:09<00:05, 14.43it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=127.0, consumed_samples=1016.0, val_loss=1.420]
Epoch 0:  63%|██████▎   | 139/220 [00:09<00:05, 14.45it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=127.0, consumed_samples=1016.0, val_loss=1.420]
Epoch 0:  63%|██████▎   | 139/220 [00:09<00:05, 14.45it/s, loss=1.42, v_num=, reduced_train_loss=1.460, global_step=128.0, consumed_samples=1024.0, val_loss=1.420]
Epoch 0:  64%|██████▎   | 140/220 [00:09<00:05, 14.49it/s, loss=1.42, v_num=, reduced_train_loss=1.460, global_step=128.0, consumed_samples=1024.0, val_loss=1.420]
Epoch 0:  64%|██████▎   | 140/220 [00:09<00:05, 14.49it/s, loss=1.42, v_num=, reduced_train_loss=1.420, global_step=129.0, consumed_samples=1032.0, val_loss=1.420]
Epoch 0:  64%|██████▍   | 141/220 [00:09<00:05, 14.52it/s, loss=1.42, v_num=, reduced_train_loss=1.420, global_step=129.0, consumed_samples=1032.0, val_loss=1.420]
Epoch 0:  64%|██████▍   | 141/220 [00:09<00:05, 14.52it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=130.0, consumed_samples=1040.0, val_loss=1.420]
Epoch 0:  65%|██████▍   | 142/220 [00:09<00:05, 14.55it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=130.0, consumed_samples=1040.0, val_loss=1.420]
Epoch 0:  65%|██████▍   | 142/220 [00:09<00:05, 14.55it/s, loss=1.42, v_num=, reduced_train_loss=1.450, global_step=131.0, consumed_samples=1048.0, val_loss=1.420]
Epoch 0:  65%|██████▌   | 143/220 [00:09<00:05, 14.59it/s, loss=1.42, v_num=, reduced_train_loss=1.450, global_step=131.0, consumed_samples=1048.0, val_loss=1.420]
Epoch 0:  65%|██████▌   | 143/220 [00:09<00:05, 14.59it/s, loss=1.43, v_num=, reduced_train_loss=1.460, global_step=132.0, consumed_samples=1056.0, val_loss=1.420]
Epoch 0:  65%|██████▌   | 144/220 [00:09<00:05, 14.62it/s, loss=1.43, v_num=, reduced_train_loss=1.460, global_step=132.0, consumed_samples=1056.0, val_loss=1.420]
Epoch 0:  65%|██████▌   | 144/220 [00:09<00:05, 14.62it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=133.0, consumed_samples=1064.0, val_loss=1.420]
Epoch 0:  66%|██████▌   | 145/220 [00:09<00:05, 14.63it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=133.0, consumed_samples=1064.0, val_loss=1.420]
Epoch 0:  66%|██████▌   | 145/220 [00:09<00:05, 14.63it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=134.0, consumed_samples=1072.0, val_loss=1.420]
Epoch 0:  66%|██████▋   | 146/220 [00:09<00:05, 14.67it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=134.0, consumed_samples=1072.0, val_loss=1.420]
Epoch 0:  66%|██████▋   | 146/220 [00:09<00:05, 14.67it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=135.0, consumed_samples=1080.0, val_loss=1.420]
Epoch 0:  67%|██████▋   | 147/220 [00:10<00:04, 14.70it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=135.0, consumed_samples=1080.0, val_loss=1.420]
Epoch 0:  67%|██████▋   | 147/220 [00:10<00:04, 14.70it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=136.0, consumed_samples=1088.0, val_loss=1.420]
Epoch 0:  67%|██████▋   | 148/220 [00:10<00:04, 14.73it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=136.0, consumed_samples=1088.0, val_loss=1.420]
Epoch 0:  67%|██████▋   | 148/220 [00:10<00:04, 14.72it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=137.0, consumed_samples=1096.0, val_loss=1.420]
Epoch 0:  68%|██████▊   | 149/220 [00:10<00:04, 14.75it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=137.0, consumed_samples=1096.0, val_loss=1.420]
Epoch 0:  68%|██████▊   | 149/220 [00:10<00:04, 14.75it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=138.0, consumed_samples=1104.0, val_loss=1.420]
Epoch 0:  68%|██████▊   | 150/220 [00:10<00:04, 14.78it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=138.0, consumed_samples=1104.0, val_loss=1.420]
Epoch 0:  68%|██████▊   | 150/220 [00:10<00:04, 14.78it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=139.0, consumed_samples=1112.0, val_loss=1.420]
Epoch 0:  69%|██████▊   | 151/220 [00:10<00:04, 14.81it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=139.0, consumed_samples=1112.0, val_loss=1.420]
Epoch 0:  69%|██████▊   | 151/220 [00:10<00:04, 14.81it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=140.0, consumed_samples=1120.0, val_loss=1.420]
Epoch 0:  69%|██████▉   | 152/220 [00:10<00:04, 14.84it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=140.0, consumed_samples=1120.0, val_loss=1.420]
Epoch 0:  69%|██████▉   | 152/220 [00:10<00:04, 14.83it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=141.0, consumed_samples=1128.0, val_loss=1.420]
Epoch 0:  70%|██████▉   | 153/220 [00:10<00:04, 14.86it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=141.0, consumed_samples=1128.0, val_loss=1.420]
Epoch 0:  70%|██████▉   | 153/220 [00:10<00:04, 14.86it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=142.0, consumed_samples=1136.0, val_loss=1.420]
Epoch 0:  70%|███████   | 154/220 [00:10<00:04, 14.88it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=142.0, consumed_samples=1136.0, val_loss=1.420]
Epoch 0:  70%|███████   | 154/220 [00:10<00:04, 14.88it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=143.0, consumed_samples=1144.0, val_loss=1.420]
Epoch 0:  70%|███████   | 155/220 [00:10<00:04, 14.90it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=143.0, consumed_samples=1144.0, val_loss=1.420]
Epoch 0:  70%|███████   | 155/220 [00:10<00:04, 14.90it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=144.0, consumed_samples=1152.0, val_loss=1.420]
Epoch 0:  71%|███████   | 156/220 [00:10<00:04, 14.93it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=144.0, consumed_samples=1152.0, val_loss=1.420]
Epoch 0:  71%|███████   | 156/220 [00:10<00:04, 14.93it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=145.0, consumed_samples=1160.0, val_loss=1.420]
Epoch 0:  71%|███████▏  | 157/220 [00:10<00:04, 14.95it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=145.0, consumed_samples=1160.0, val_loss=1.420]
Epoch 0:  71%|███████▏  | 157/220 [00:10<00:04, 14.95it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=146.0, consumed_samples=1168.0, val_loss=1.420]
Epoch 0:  72%|███████▏  | 158/220 [00:10<00:04, 14.97it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=146.0, consumed_samples=1168.0, val_loss=1.420]
Epoch 0:  72%|███████▏  | 158/220 [00:10<00:04, 14.97it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=147.0, consumed_samples=1176.0, val_loss=1.420]
Epoch 0:  72%|███████▏  | 159/220 [00:10<00:04, 14.99it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=147.0, consumed_samples=1176.0, val_loss=1.420]
Epoch 0:  72%|███████▏  | 159/220 [00:10<00:04, 14.99it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=148.0, consumed_samples=1184.0, val_loss=1.420]
Epoch 0:  73%|███████▎  | 160/220 [00:10<00:03, 15.02it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=148.0, consumed_samples=1184.0, val_loss=1.420]
Epoch 0:  73%|███████▎  | 160/220 [00:10<00:03, 15.02it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=149.0, consumed_samples=1192.0, val_loss=1.420]
Epoch 0:  73%|███████▎  | 161/220 [00:10<00:03, 15.04it/s, loss=1.43, v_num=, reduced_train_loss=1.440, global_step=149.0, consumed_samples=1192.0, val_loss=1.420]
Epoch 0:  73%|███████▎  | 161/220 [00:10<00:03, 15.04it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=150.0, consumed_samples=1200.0, val_loss=1.420]
Epoch 0:  74%|███████▎  | 162/220 [00:10<00:03, 15.07it/s, loss=1.43, v_num=, reduced_train_loss=1.420, global_step=150.0, consumed_samples=1200.0, val_loss=1.420]
Epoch 0:  74%|███████▎  | 162/220 [00:10<00:03, 15.07it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=151.0, consumed_samples=1208.0, val_loss=1.420]
Epoch 0:  74%|███████▍  | 163/220 [00:10<00:03, 15.10it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=151.0, consumed_samples=1208.0, val_loss=1.420]
Epoch 0:  74%|███████▍  | 163/220 [00:10<00:03, 15.09it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=152.0, consumed_samples=1216.0, val_loss=1.420]
Epoch 0:  75%|███████▍  | 164/220 [00:10<00:03, 15.12it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=152.0, consumed_samples=1216.0, val_loss=1.420]
Epoch 0:  75%|███████▍  | 164/220 [00:10<00:03, 15.12it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=153.0, consumed_samples=1224.0, val_loss=1.420]
Epoch 0:  75%|███████▌  | 165/220 [00:10<00:03, 15.15it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=153.0, consumed_samples=1224.0, val_loss=1.420]
Epoch 0:  75%|███████▌  | 165/220 [00:10<00:03, 15.15it/s, loss=1.43, v_num=, reduced_train_loss=1.470, global_step=154.0, consumed_samples=1232.0, val_loss=1.420]
Epoch 0:  75%|███████▌  | 166/220 [00:10<00:03, 15.18it/s, loss=1.43, v_num=, reduced_train_loss=1.470, global_step=154.0, consumed_samples=1232.0, val_loss=1.420]
Epoch 0:  75%|███████▌  | 166/220 [00:10<00:03, 15.18it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=155.0, consumed_samples=1240.0, val_loss=1.420]
Epoch 0:  76%|███████▌  | 167/220 [00:10<00:03, 15.21it/s, loss=1.43, v_num=, reduced_train_loss=1.450, global_step=155.0, consumed_samples=1240.0, val_loss=1.420]
Epoch 0:  76%|███████▌  | 167/220 [00:10<00:03, 15.21it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=156.0, consumed_samples=1248.0, val_loss=1.420]
Epoch 0:  76%|███████▋  | 168/220 [00:11<00:03, 15.23it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=156.0, consumed_samples=1248.0, val_loss=1.420]
Epoch 0:  76%|███████▋  | 168/220 [00:11<00:03, 15.23it/s, loss=1.43, v_num=, reduced_train_loss=1.390, global_step=157.0, consumed_samples=1256.0, val_loss=1.420]
Epoch 0:  77%|███████▋  | 169/220 [00:11<00:03, 15.26it/s, loss=1.43, v_num=, reduced_train_loss=1.390, global_step=157.0, consumed_samples=1256.0, val_loss=1.420]
Epoch 0:  77%|███████▋  | 169/220 [00:11<00:03, 15.26it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=158.0, consumed_samples=1264.0, val_loss=1.420]
Epoch 0:  77%|███████▋  | 170/220 [00:11<00:03, 15.28it/s, loss=1.43, v_num=, reduced_train_loss=1.430, global_step=158.0, consumed_samples=1264.0, val_loss=1.420]
Epoch 0:  77%|███████▋  | 170/220 [00:11<00:03, 15.28it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=159.0, consumed_samples=1272.0, val_loss=1.420]
Epoch 0:  78%|███████▊  | 171/220 [00:11<00:03, 15.30it/s, loss=1.43, v_num=, reduced_train_loss=1.410, global_step=159.0, consumed_samples=1272.0, val_loss=1.420]
Epoch 0:  78%|███████▊  | 171/220 [00:11<00:03, 15.30it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=160.0, consumed_samples=1280.0, val_loss=1.420]
Epoch 0:  78%|███████▊  | 172/220 [00:11<00:03, 15.32it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=160.0, consumed_samples=1280.0, val_loss=1.420]
Epoch 0:  78%|███████▊  | 172/220 [00:11<00:03, 15.32it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=161.0, consumed_samples=1288.0, val_loss=1.420]
Epoch 0:  79%|███████▊  | 173/220 [00:11<00:03, 15.34it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=161.0, consumed_samples=1288.0, val_loss=1.420]
Epoch 0:  79%|███████▊  | 173/220 [00:11<00:03, 15.33it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=162.0, consumed_samples=1296.0, val_loss=1.420]
Epoch 0:  79%|███████▉  | 174/220 [00:11<00:02, 15.36it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=162.0, consumed_samples=1296.0, val_loss=1.420]
Epoch 0:  79%|███████▉  | 174/220 [00:11<00:02, 15.36it/s, loss=1.42, v_num=, reduced_train_loss=1.430, global_step=163.0, consumed_samples=1304.0, val_loss=1.420]
Epoch 0:  80%|███████▉  | 175/220 [00:11<00:02, 15.38it/s, loss=1.42, v_num=, reduced_train_loss=1.430, global_step=163.0, consumed_samples=1304.0, val_loss=1.420]
Epoch 0:  80%|███████▉  | 175/220 [00:11<00:02, 15.38it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=164.0, consumed_samples=1312.0, val_loss=1.420]
Epoch 0:  80%|████████  | 176/220 [00:11<00:02, 15.40it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=164.0, consumed_samples=1312.0, val_loss=1.420]
Epoch 0:  80%|████████  | 176/220 [00:11<00:02, 15.40it/s, loss=1.42, v_num=, reduced_train_loss=1.390, global_step=165.0, consumed_samples=1320.0, val_loss=1.420]
Epoch 0:  80%|████████  | 177/220 [00:11<00:02, 15.40it/s, loss=1.42, v_num=, reduced_train_loss=1.390, global_step=165.0, consumed_samples=1320.0, val_loss=1.420]
Epoch 0:  80%|████████  | 177/220 [00:11<00:02, 15.40it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=166.0, consumed_samples=1328.0, val_loss=1.420]
Epoch 0:  81%|████████  | 178/220 [00:11<00:02, 15.42it/s, loss=1.42, v_num=, reduced_train_loss=1.400, global_step=166.0, consumed_samples=1328.0, val_loss=1.420]
Epoch 0:  81%|████████  | 178/220 [00:11<00:02, 15.42it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=167.0, consumed_samples=1336.0, val_loss=1.420]
Epoch 0:  81%|████████▏ | 179/220 [00:11<00:02, 15.44it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=167.0, consumed_samples=1336.0, val_loss=1.420]
Epoch 0:  81%|████████▏ | 179/220 [00:11<00:02, 15.44it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=168.0, consumed_samples=1344.0, val_loss=1.420]
Epoch 0:  82%|████████▏ | 180/220 [00:11<00:02, 15.46it/s, loss=1.42, v_num=, reduced_train_loss=1.410, global_step=168.0, consumed_samples=1344.0, val_loss=1.420]
Epoch 0:  82%|████████▏ | 180/220 [00:11<00:02, 15.46it/s, loss=1.41, v_num=, reduced_train_loss=1.380, global_step=169.0, consumed_samples=1352.0, val_loss=1.420]
Epoch 0:  82%|████████▏ | 181/220 [00:11<00:02, 15.48it/s, loss=1.41, v_num=, reduced_train_loss=1.380, global_step=169.0, consumed_samples=1352.0, val_loss=1.420]
Epoch 0:  82%|████████▏ | 181/220 [00:11<00:02, 15.48it/s, loss=1.41, v_num=, reduced_train_loss=1.420, global_step=170.0, consumed_samples=1360.0, val_loss=1.420]
Epoch 0:  83%|████████▎ | 182/220 [00:11<00:02, 15.49it/s, loss=1.41, v_num=, reduced_train_loss=1.420, global_step=170.0, consumed_samples=1360.0, val_loss=1.420]
Epoch 0:  83%|████████▎ | 182/220 [00:11<00:02, 15.49it/s, loss=1.41, v_num=, reduced_train_loss=1.450, global_step=171.0, consumed_samples=1368.0, val_loss=1.420]
Epoch 0:  83%|████████▎ | 183/220 [00:11<00:02, 15.50it/s, loss=1.41, v_num=, reduced_train_loss=1.450, global_step=171.0, consumed_samples=1368.0, val_loss=1.420]
Epoch 0:  83%|████████▎ | 183/220 [00:11<00:02, 15.50it/s, loss=1.41, v_num=, reduced_train_loss=1.430, global_step=172.0, consumed_samples=1376.0, val_loss=1.420]
Epoch 0:  84%|████████▎ | 184/220 [00:11<00:02, 15.52it/s, loss=1.41, v_num=, reduced_train_loss=1.430, global_step=172.0, consumed_samples=1376.0, val_loss=1.420]
Epoch 0:  84%|████████▎ | 184/220 [00:11<00:02, 15.52it/s, loss=1.41, v_num=, reduced_train_loss=1.390, global_step=173.0, consumed_samples=1384.0, val_loss=1.420]
Epoch 0:  84%|████████▍ | 185/220 [00:11<00:02, 15.54it/s, loss=1.41, v_num=, reduced_train_loss=1.390, global_step=173.0, consumed_samples=1384.0, val_loss=1.420]
Epoch 0:  84%|████████▍ | 185/220 [00:11<00:02, 15.54it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=174.0, consumed_samples=1392.0, val_loss=1.420]
Epoch 0:  85%|████████▍ | 186/220 [00:11<00:02, 15.56it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=174.0, consumed_samples=1392.0, val_loss=1.420]
Epoch 0:  85%|████████▍ | 186/220 [00:11<00:02, 15.56it/s, loss=1.41, v_num=, reduced_train_loss=1.410, global_step=175.0, consumed_samples=1400.0, val_loss=1.420]
Epoch 0:  85%|████████▌ | 187/220 [00:12<00:02, 15.58it/s, loss=1.41, v_num=, reduced_train_loss=1.410, global_step=175.0, consumed_samples=1400.0, val_loss=1.420]
Epoch 0:  85%|████████▌ | 187/220 [00:12<00:02, 15.58it/s, loss=1.41, v_num=, reduced_train_loss=1.410, global_step=176.0, consumed_samples=1408.0, val_loss=1.420]
Epoch 0:  85%|████████▌ | 188/220 [00:12<00:02, 15.60it/s, loss=1.41, v_num=, reduced_train_loss=1.410, global_step=176.0, consumed_samples=1408.0, val_loss=1.420]
Epoch 0:  85%|████████▌ | 188/220 [00:12<00:02, 15.60it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=177.0, consumed_samples=1416.0, val_loss=1.420]
Epoch 0:  86%|████████▌ | 189/220 [00:12<00:01, 15.61it/s, loss=1.41, v_num=, reduced_train_loss=1.400, global_step=177.0, consumed_samples=1416.0, val_loss=1.420]
Epoch 0:  86%|████████▌ | 189/220 [00:12<00:01, 15.61it/s, loss=1.41, v_num=, reduced_train_loss=1.380, global_step=178.0, consumed_samples=1424.0, val_loss=1.420]
Epoch 0:  86%|████████▋ | 190/220 [00:12<00:01, 15.63it/s, loss=1.41, v_num=, reduced_train_loss=1.380, global_step=178.0, consumed_samples=1424.0, val_loss=1.420]
Epoch 0:  86%|████████▋ | 190/220 [00:12<00:01, 15.63it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=179.0, consumed_samples=1432.0, val_loss=1.420] 
Epoch 0:  87%|████████▋ | 191/220 [00:12<00:01, 15.65it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=179.0, consumed_samples=1432.0, val_loss=1.420]
Epoch 0:  87%|████████▋ | 191/220 [00:12<00:01, 15.65it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=180.0, consumed_samples=1440.0, val_loss=1.420]
Epoch 0:  87%|████████▋ | 192/220 [00:12<00:01, 15.66it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=180.0, consumed_samples=1440.0, val_loss=1.420]
Epoch 0:  87%|████████▋ | 192/220 [00:12<00:01, 15.66it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=181.0, consumed_samples=1448.0, val_loss=1.420]
Epoch 0:  88%|████████▊ | 193/220 [00:12<00:01, 15.68it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=181.0, consumed_samples=1448.0, val_loss=1.420]
Epoch 0:  88%|████████▊ | 193/220 [00:12<00:01, 15.68it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=182.0, consumed_samples=1456.0, val_loss=1.420]
Epoch 0:  88%|████████▊ | 194/220 [00:12<00:01, 15.70it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=182.0, consumed_samples=1456.0, val_loss=1.420]
Epoch 0:  88%|████████▊ | 194/220 [00:12<00:01, 15.70it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=183.0, consumed_samples=1464.0, val_loss=1.420]
Epoch 0:  89%|████████▊ | 195/220 [00:12<00:01, 15.72it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=183.0, consumed_samples=1464.0, val_loss=1.420]
Epoch 0:  89%|████████▊ | 195/220 [00:12<00:01, 15.71it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=184.0, consumed_samples=1472.0, val_loss=1.420]
Epoch 0:  89%|████████▉ | 196/220 [00:12<00:01, 15.73it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=184.0, consumed_samples=1472.0, val_loss=1.420]
Epoch 0:  89%|████████▉ | 196/220 [00:12<00:01, 15.73it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=185.0, consumed_samples=1480.0, val_loss=1.420]
Epoch 0:  90%|████████▉ | 197/220 [00:12<00:01, 15.75it/s, loss=1.4, v_num=, reduced_train_loss=1.410, global_step=185.0, consumed_samples=1480.0, val_loss=1.420]
Epoch 0:  90%|████████▉ | 197/220 [00:12<00:01, 15.75it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=186.0, consumed_samples=1488.0, val_loss=1.420]
Epoch 0:  90%|█████████ | 198/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.390, global_step=186.0, consumed_samples=1488.0, val_loss=1.420]
Epoch 0:  90%|█████████ | 198/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=187.0, consumed_samples=1496.0, val_loss=1.420]
Epoch 0:  90%|█████████ | 199/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=187.0, consumed_samples=1496.0, val_loss=1.420]
Epoch 0:  90%|█████████ | 199/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.370, global_step=188.0, consumed_samples=1504.0, val_loss=1.420]
Epoch 0:  91%|█████████ | 200/220 [00:12<00:01, 15.75it/s, loss=1.4, v_num=, reduced_train_loss=1.370, global_step=188.0, consumed_samples=1504.0, val_loss=1.420]
Epoch 0:  91%|█████████ | 200/220 [00:12<00:01, 15.75it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=189.0, consumed_samples=1512.0, val_loss=1.420]
Epoch 0:  91%|█████████▏| 201/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.380, global_step=189.0, consumed_samples=1512.0, val_loss=1.420]
Epoch 0:  91%|█████████▏| 201/220 [00:12<00:01, 15.76it/s, loss=1.4, v_num=, reduced_train_loss=1.420, global_step=190.0, consumed_samples=1520.0, val_loss=1.420]
Epoch 0:  92%|█████████▏| 202/220 [00:12<00:01, 15.78it/s, loss=1.4, v_num=, reduced_train_loss=1.420, global_step=190.0, consumed_samples=1520.0, val_loss=1.420]
Epoch 0:  92%|█████████▏| 202/220 [00:12<00:01, 15.78it/s, loss=1.4, v_num=, reduced_train_loss=1.400, global_step=191.0, consumed_samples=1528.0, val_loss=1.420]
Epoch 0:  92%|█████████▏| 203/220 [00:12<00:01, 15.79it/s, loss=1.4, v_num=, reduced_train_loss=1.400, global_step=191.0, consumed_samples=1528.0, val_loss=1.420]
Epoch 0:  92%|█████████▏| 203/220 [00:12<00:01, 15.79it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=192.0, consumed_samples=1536.0, val_loss=1.420]
Epoch 0:  93%|█████████▎| 204/220 [00:12<00:01, 15.81it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=192.0, consumed_samples=1536.0, val_loss=1.420]
Epoch 0:  93%|█████████▎| 204/220 [00:12<00:01, 15.81it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=193.0, consumed_samples=1544.0, val_loss=1.420]
Epoch 0:  93%|█████████▎| 205/220 [00:12<00:00, 15.83it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=193.0, consumed_samples=1544.0, val_loss=1.420]
Epoch 0:  93%|█████████▎| 205/220 [00:12<00:00, 15.83it/s, loss=1.39, v_num=, reduced_train_loss=1.410, global_step=194.0, consumed_samples=1552.0, val_loss=1.420]
Epoch 0:  94%|█████████▎| 206/220 [00:13<00:00, 15.84it/s, loss=1.39, v_num=, reduced_train_loss=1.410, global_step=194.0, consumed_samples=1552.0, val_loss=1.420]
Epoch 0:  94%|█████████▎| 206/220 [00:13<00:00, 15.84it/s, loss=1.39, v_num=, reduced_train_loss=1.380, global_step=195.0, consumed_samples=1560.0, val_loss=1.420]
Epoch 0:  94%|█████████▍| 207/220 [00:13<00:00, 15.86it/s, loss=1.39, v_num=, reduced_train_loss=1.380, global_step=195.0, consumed_samples=1560.0, val_loss=1.420]
Epoch 0:  94%|█████████▍| 207/220 [00:13<00:00, 15.86it/s, loss=1.39, v_num=, reduced_train_loss=1.410, global_step=196.0, consumed_samples=1568.0, val_loss=1.420]
Epoch 0:  95%|█████████▍| 208/220 [00:13<00:00, 15.87it/s, loss=1.39, v_num=, reduced_train_loss=1.410, global_step=196.0, consumed_samples=1568.0, val_loss=1.420]
Epoch 0:  95%|█████████▍| 208/220 [00:13<00:00, 15.87it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=197.0, consumed_samples=1576.0, val_loss=1.420]
Epoch 0:  95%|█████████▌| 209/220 [00:13<00:00, 15.89it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=197.0, consumed_samples=1576.0, val_loss=1.420]
Epoch 0:  95%|█████████▌| 209/220 [00:13<00:00, 15.89it/s, loss=1.39, v_num=, reduced_train_loss=1.400, global_step=198.0, consumed_samples=1584.0, val_loss=1.420]
Epoch 0:  95%|█████████▌| 210/220 [00:13<00:00, 15.90it/s, loss=1.39, v_num=, reduced_train_loss=1.400, global_step=198.0, consumed_samples=1584.0, val_loss=1.420]
Epoch 0:  95%|█████████▌| 210/220 [00:13<00:00, 15.90it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]

Validation: 0it [00:00, ?it/s]

Validation:   0%|          | 0/10 [00:00<?, ?it/s]

Validation DataLoader 0:   0%|          | 0/10 [00:00<?, ?it/s]

Validation DataLoader 0:  10%|█         | 1/10 [00:00<00:00, 12.19it/s]
Epoch 0:  96%|█████████▌| 211/220 [00:13<00:00, 15.84it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]

Validation DataLoader 0:  20%|██        | 2/10 [00:00<00:00, 16.00it/s]
Epoch 0:  96%|█████████▋| 212/220 [00:13<00:00, 15.86it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]

Validation DataLoader 0:  30%|███       | 3/10 [00:00<00:00, 16.09it/s]
Epoch 0:  97%|█████████▋| 213/220 [00:13<00:00, 15.86it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]

Validation DataLoader 0:  40%|████      | 4/10 [00:00<00:00, 19.97it/s]
Epoch 0:  97%|█████████▋| 214/220 [00:13<00:00, 15.92it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]

Validation DataLoader 0:  50%|█████     | 5/10 [00:00<00:00, 22.20it/s]
Epoch 0:  98%|█████████▊| 215/220 [00:13<00:00, 15.97it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]

Validation DataLoader 0:  60%|██████    | 6/10 [00:00<00:00, 25.03it/s]
Epoch 0:  98%|█████████▊| 216/220 [00:13<00:00, 16.02it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]

Validation DataLoader 0:  70%|███████   | 7/10 [00:00<00:00, 27.69it/s]
Epoch 0:  99%|█████████▊| 217/220 [00:13<00:00, 16.08it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]

Validation DataLoader 0:  80%|████████  | 8/10 [00:00<00:00, 29.94it/s]
Epoch 0:  99%|█████████▉| 218/220 [00:13<00:00, 16.14it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]

Validation DataLoader 0:  90%|█████████ | 9/10 [00:00<00:00, 32.10it/s]
Epoch 0: 100%|█████████▉| 219/220 [00:13<00:00, 16.20it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]

Validation DataLoader 0: 100%|██████████| 10/10 [00:00<00:00, 34.09it/s]
Epoch 0: 100%|██████████| 220/220 [00:13<00:00, 16.25it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.420]
Epoch 0: 100%|██████████| 220/220 [00:13<00:00, 16.25it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.380]

                                                                        Epoch 0, global step 200: 'val_loss' reached 1.38422 (best 1.38422), saving model to '/result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining/checkpoints/megatron_bert--val_loss=1.38-step=200-consumed_samples=1600.0.ckpt' as top 10
[NeMo I 2023-08-17 16:22:48 nlp_overrides:226] Removing checkpoint: /result/nemo_experiments/esm1nv-oas/esm1nv-oas_pretraining/checkpoints/megatron_bert--val_loss=1.42-step=100-consumed_samples=800.0-last.ckpt

Epoch 0: 100%|██████████| 220/220 [00:14<00:00, 14.79it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.380]`Trainer.fit` stopped: `max_steps=200` reached.

Epoch 0: 100%|██████████| 220/220 [00:14<00:00, 14.79it/s, loss=1.39, v_num=, reduced_train_loss=1.390, global_step=199.0, consumed_samples=1592.0, val_loss=1.380]
[NeMo I 2023-08-17 16:22:49 pretrain_oas:24] ************** Finished Training ***********

Results#

The training run will create a directory called esm1nv-oas_pretraining in /result/nemo_experiments/esm1nv-oas containing the files (logs, checkpoints, etc.) for the training run:

checkpoints
cmd-args.log
events.out.tfevents.1692289352.drugdiscovery3-dt.335.0
git-info.log
hparams.yaml
lightning_logs.txt
nemo_error_log.txt
nemo_log_globalrank-0_localrank-0.txt