Training and Scaling#
This page provides detailed information on training speechlm2 models, including setup requirements, running experiments at scale, debugging, and parallelism strategies.
Running Experiments#
The speechlm2 collection includes several scripts to facilitate running experiments, especially on SLURM-based clusters.
SLURM Job Submission#
For training on SLURM clusters, use the following workflow:
# Submit 8 consecutive jobs with random seeds
scripts/speechlm2/auto_launcher_with_seed.sh -n8 s2s_tinyllama_repro.sub
The auto_launcher_with_seed.sh
script:
Generates a random seed for each submitted job
Leverages
shard_seed="randomized"
in Lhotse to ensure each data parallel rank is seeded differentlyEnsures each tensor parallel rank is seeded identically
SLURM Submission Script#
Example s2s_tinyllama_repro.sub
script:
#!/bin/bash
#SBATCH --job-name=s2s_training
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
#SBATCH --exclusive
#SBATCH --output=s2s_tinyllama_repro_%j.out
# Check that the global random seed base is provided
if [ -z "$1" ]; then
echo "Usage: $0 <global_random_seed_base>"
exit 1
fi
SEED=${1}
EXP_NAME="s2s_training"
RESULTS_DIR="results/${EXP_NAME}"
srun --ntasks=${SLURM_NTASKS} --ntasks-per-node=${SLURM_NTASKS_PER_NODE} \
python -u examples/speechlm2/s2s_duplex_train.py \
--config-path=/path/to/config/dir \
--config-name=s2s_training.yaml \
exp_manager.name=${EXP_NAME} \
exp_manager.wandb_logger_kwargs.name=${EXP_NAME} \
trainer.num_nodes=$SLURM_JOB_NUM_NODES \
exp_manager.explicit_log_dir=${RESULTS_DIR} \
data.train_ds.seed=$SEED \
data.validation_ds.seed=$SEED
Configuration Files#
The main configuration file (s2s_training.yaml
) contains all model, training, and data parameters. See Configuration Files for more details. It’s recommended to copy and modify this file rather than overriding options in the SLURM script to maintain versioning and configuration clarity.
Debugging#
Running Locally with torchrun#
For local debugging and profiling, use torchrun
:
# Run with 4 GPUs locally
torchrun --nproc_per_node=4 examples/speechlm2/s2s_duplex_train.py \
--config-path=/path/to/config/dir \
--config-name=s2s_training.yaml
Scaling Strategies#
The speechlm2 collection includes support for model parallelism to scale training to large models across multiple GPUs.
Model Parallel Strategies#
The collection supports multiple parallelism strategies:
Fully Sharded Data Parallel (FSDP2): Distributes model parameters across GPUs
Tensor Parallelism (TP): Splits individual tensors across GPUs
Sequence Parallelism (SP): Splits sequence processing across GPUs
2D Parallelism: Combination of FSDP2 with TP/SP
Configuration#
To configure parallelism, modify the trainer.strategy
section in your YAML config:
trainer:
strategy:
_target_: nemo.core.ModelParallelStrategy
find_unused_parameters: False
data_parallel: 1 # World size for data parallelism (FSDP2)
tensor_parallel: 8 # World size for tensor parallelism
devices: 8
num_nodes: 1
accelerator: gpu
precision: bf16-true
The model’s configure_model
method automatically sets up the appropriate parallelization based on this configuration.
FSDP2 Configuration#
For Fully Sharded Data Parallel training:
Set
data_parallel
to the number of GPUs you want to use for data parallelismSet
tensor_parallel
to 1 (disabled)
FSDP2 shards the model parameters across GPUs, all-gathers them for forward/backward passes, and then de-allocates after computation. This allows training of larger models with limited GPU memory. See :doc:`PyTorch FSDP2 <https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html>`_ for more details.
Tensor Parallelism Configuration#
For Tensor Parallelism:
Set
tensor_parallel
to the number of GPUs you want to use for tensor parallelismSet
data_parallel
to 1 (or higher for 2D parallelism)
The parallelize_module
function applies a parallelization plan to specific model components, like splitting attention heads or embedding dimensions across GPUs.
See :doc:`PyTorch TP <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`_ for more details.
Implementation Details#
The core implementation of model parallelism is in the configure_model
method of the model classes. Key aspects include:
Module Sharding: Calling
fully_shard
on modules to distribute parameters across data parallel ranksParallelization Plans: Creating and applying plans that specify how different layers should be parallelized
Model-Specific Adaptations: Handling architectural differences between different LLMs
Advanced Usage#
Script Customization#
When customizing the training scripts, keep these points in mind:
Path Overrides: Override paths in the YAML configuration files with your own, as needed
W&B Keys: Update Weights & Biases API keys in configuration files
Batch Size Tuning: Adjust batch size based on your GPU memory and model size