Model Training on Databricks#

Databricks is a widely-used platform for managing data, models, applications, and compute on the cloud. This guide shows how to use Automodel for scalable, performant model training on Databricks.

The specific example here fine-tunes a Llama-3.2-1B model using the SQuAD dataset from Hugging Face, but any Automodel functionality (for example, model pre-training, VLMs, other support models) can also be run on Databricks.

Compute#

Let’s start by provisioning a Databricks classic compute cluster with the following setup:

  • Databricks runtime: 17.3 LTS (Machine Learning version)

  • Worker instance type: g6e.12xlarge on AWS (4x L40S GPU per node)

  • Number of workers: 2

  • Global environment variable: GLOO_SOCKET_IFNAME=eth0 (see this for details)

  • Cluster-scoped init script:

#!/bin/bash

# Install Automodel + upgrade transformers to newer, compatible version
/databricks/python3/bin/pip install --upgrade \
    transformers \
    git+https://github.com/NVIDIA-NeMo/Automodel

This will provision three compute nodes – one driver node we’ll attach a notebook to, and two worker nodes we’ll use for multi-node training.

Note that we’ve selected a small number of instances for demo purposes, but you can adjust the specific instance type and number of workers for your actual use case.

Training#

With the above compute resources provisioned, we’re ready to fine-tune a model using Automodel.

Automodel uses YAML file recipes to configure various settings for the training process (for example, model, dataset, loss function, optimizer, etc.). Here we’ll use this preconfigured recipe for fine-tuning a Llama-3.2-1B model using the SQuAD dataset from Hugging Face. In a notebook connected to our compute resource, download the training script and configuration file with these curl commands:

# Download training script
!curl -O https://raw.githubusercontent.com/NVIDIA-NeMo/Automodel/refs/heads/main/examples/llm_finetune/finetune.py
# Download configuration file
!curl -O https://raw.githubusercontent.com/NVIDIA-NeMo/Automodel/refs/heads/main/examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml

Here’s what the model, dataset, and optimizer portions of the config file look like:

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: meta-llama/Llama-3.2-1B

dataset:
  _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset
  dataset_name: rajpurkar/squad
  split: train

optimizer:
  _target_: torch.optim.Adam
  betas: [0.9, 0.999]
  eps: 1e-8
  lr: 1.0e-5
  weight_decay: 0

...

See the full file for complete details (!cat llama3_2_1b_squad.yaml).

Single-node#

To run fine-tuning, we’ll use the finetune.py script from the Automodel repository and our config file.

To run training on a single GPU, use this command:

!python finetune.py \
    --config llama3_2_1b_squad.yaml \
    --step_scheduler.max_steps 20 \
    --checkpoint.checkpoint_dir /Volumes/<catalog_name>/<schema_name>/<volume_name>/checkpoints_single/

The --step_scheduler.max_steps 20 option limits the number of training steps taken (again, this is for example purposes – adapt for your actual use case as needed) and the --checkpoint.checkpoint_dir option tells Automodel where to save model checkpoints while training. We recommend saving model checkpoints in a Databricks’ Unity Catalog volume.

Looking at GPU metrics in Databricks, we see our single GPU is being well utilized (~95% utilization).

Single GPU utilization of ~95% during model training.

Single GPU utilization of ~95% during model training.#

To utilize all four GPUs available on this g6e.12xlarge instance, use torchrun --nproc-per-node=4 with our same training script and config file:

!torchrun --nproc-per-node=4 finetune.py \
    --config llama3_2_1b_squad.yaml \
    --step_scheduler.max_steps 20 \
    --checkpoint.checkpoint_dir /Volumes/<catalog_name>/<schema_name>/<volume_name>/checkpoints_multi/ \
    --checkpoint.is_async True

This uses PyTorch’s Elastic Launch functionality to spawn and coordinate multiple training processes on the VM. Each training process runs on a separate GPU, and we can now see all four GPUs are being used (~95% utilization for each GPU). We also enable asynchronous checkpointing to support training in parallel.

Multi-GPU, single-node utilization of ~95% during model training.

Multi-GPU, single-node utilization of ~95% during model training.#

Multi-node#

To scale further to multi-node training, we need to submit training jobs to the instances in our Databricks cluster. We can use PySpark’s TorchDistributor to run the same training job across multiple instances like this:

from pyspark.ml.torch.distributor import TorchDistributor

num_executor = 2            # Number of workers in cluster
num_gpus_per_executor = 4   # Number of GPUs per worker
distributor = TorchDistributor(
    num_processes=num_executor * num_gpus_per_executor,
    local_mode=False,
    use_gpu=True,
)

train_file = "finetune.py"
args = [
    "--config", "llama3_2_1b_squad.yaml",
    "--step_scheduler.max_steps", "20",
    "--checkpoint.checkpoint_dir", "/Volumes/<catalog_name>/<schema_name>/<volume_name>/checkpoints_dist/",
    "--checkpoint.is_async", "True",
]
distributor.run(train_file, *args)

TorchDistributor uses torchrun internally and also handles constructing and submitting training jobs to the cluster.

We now see GPU utilization is ~95% for all GPUs on all worker nodes during training (8 GPUs in this particular case).

Multi-GPU, multi-node utilization of ~95% during model training.

Multi-GPU, multi-node utilization of ~95% during model training.#

Conclusion#

This guide showed how to use Automodel for model training on Databricks-managed compute. It’s relatively straightforward to scale from a single-GPU to multi-GPU to multi-node training to best suit your needs.

While the example here fine-tunes a Llama-3.2-1B model using the SQuAD dataset, any supported Automodel functionality (like model pre-training, VLMs, etc.) can also run, and scale, on Databricks. Check out additional recipes and end-to-end examples to learn more.