Gemma 3#

Google’s Gemma 3 is a family of lightweight, state-of-the-art open models built on the same research and technology used to create Gemini models. The Gemma 3 architecture builds on the transformer decoder framework with enhancements including pre-normalization with RMSNorm, GeGLU activations, Rotary Positional Embeddings (RoPE), and hybrid attention patterns (sliding window and global attention).

Gemma 3 models are designed for a wide range of text generation tasks and are available in multiple sizes to suit different computational budgets.

Gemma family models are supported via the Bridge system with auto-detected configuration and weight mapping.

Available Models#

Text-Only Models#

  • Gemma 3 1B (google/gemma-3-1b-it): Compact 1B parameter model optimized for efficiency

    • 26 layers, 1152 hidden size

    • 8 attention heads, 2 query groups (GQA)

    • Sequence length: 131,072 tokens

    • Ideal for single-GPU deployment

All models support a sequence length of 131,072 tokens and use hybrid attention patterns (sliding window + global).

Model Architecture Features#

Gemma 3 introduces several architectural innovations:

  • Hybrid Attention Pattern: Alternates between global and local sliding window attention for efficient long-context processing

  • GeGLU Activation: Uses gated linear units with GELU activation for improved performance

  • RMSNorm: Layer normalization without mean centering for faster computation

  • Rotary Embeddings: Separate RoPE configurations for local and global attention layers

    • Local attention: Uses sliding window with rotary base 10,000

    • Global attention: Extended rotary base for better long-range dependencies

Conversion with 🤗 Hugging Face#

Load HF → Megatron#

from megatron.bridge import AutoBridge

# Example: Gemma 3 1B
bridge = AutoBridge.from_hf_pretrained("google/gemma-3-1b-it")
provider = bridge.to_megatron_provider()

# Configure parallelism before instantiating the model
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1

provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)

Import HF → Megatron#

To import the HF model to your desired Megatron path:

python examples/conversion/convert_checkpoints.py import \
--hf-model google/gemma-3-1b-it \
--megatron-path /models/gemma-3-1b-it

Export Megatron → HF#

python examples/conversion/convert_checkpoints.py export \
--hf-model google/gemma-3-1b-it \
--megatron-path /results/gemma3_1b/checkpoints/iter_00001000 \
--hf-path ./gemma3-hf-export

Run Inference on Converted Checkpoint#

python examples/conversion/hf_to_megatron_generate_text.py \
--hf_model_path google/gemma-3-1b-it \
--megatron_model_path /models/gemma-3-1b-it \
--prompt "What is artificial intelligence?" \
--max_new_tokens 100

Note:

  • --megatron_model_path is optional. If not specified, the script will convert the model and then run forward.

Recipes#

See: bridge.recipes.gemma

Available Recipes#

  • Pretrain recipes:

    • gemma3_1b_pretrain_config: Pre-training for Gemma 3 1B

  • Finetune recipes:

    • gemma3_1b_finetune_config: Finetuning for Gemma 3 1B with PEFT support (LoRA, DoRA)

Before training, ensure these environment variables are set:

  • SAVE_DIR: checkpoint and log saving directory

  • HF_TOKEN: to download models from HF Hub (if required)

  • HF_HOME: (optional) to avoid re-downloading models and datasets

  • WANDB_API_KEY: (optional) to enable WandB logging

Parallelism Configurations#

Model

Mode

TP

PP

Total GPUs

Use Case

Gemma 3 1B

Pretrain

1

1

8

Pre-training (single node)

Gemma 3 1B

Full SFT

1

1

8

Full supervised finetuning

Gemma 3 1B

LoRA/DoRA

1

1

8

PEFT finetuning (single node)

Pre-training Example#

from megatron.bridge.recipes.gemma import gemma3_1b_pretrain_config

config = gemma3_1b_pretrain_config(
    name="gemma3_1b_pretrain",
    data_paths=["path/to/data"],
    train_iters=100000,
    global_batch_size=256,
    # Uses TP=1, PP=1 (8 GPUs) automatically
)

Finetuning Examples#

Full Finetuning#

from megatron.bridge.recipes.gemma import gemma3_1b_finetune_config

config = gemma3_1b_finetune_config(
    name="gemma3_1b_full_finetune",
    pretrained_checkpoint="/models/gemma-3-1b-it",
    peft=None,
    train_iters=1000,
    global_batch_size=64,
    finetune_lr=5e-6,
    # Uses TP=1, PP=1 (8 GPUs) automatically
)

LoRA Finetuning#

from megatron.bridge.recipes.gemma import gemma3_1b_finetune_config

config = gemma3_1b_finetune_config(
    name="gemma3_1b_lora_finetune",
    pretrained_checkpoint="/models/gemma-3-1b-it",
    peft="lora",  # or "dora"
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=1e-4,
    # Uses TP=1, PP=1 (8 GPUs) automatically
)

Command-Line Training#

Full Finetuning:

torchrun --nproc-per-node=8 run/run_recipe.py \
  --pretrained-checkpoint /models/gemma-3-1b-it \
  --recipe gemma3_1b_finetune_config \
  train.global_batch_size=64 \
  train.train_iters=1000 \
  checkpoint.save=$SAVE_DIR/gemma3_1b_finetune

LoRA Finetuning:

torchrun --nproc-per-node=8 run/run_recipe.py \
  --pretrained-checkpoint /models/gemma-3-1b-it \
  --recipe gemma3_1b_finetune_config \
  --peft_scheme lora \
  train.global_batch_size=128 \
  checkpoint.save=$SAVE_DIR/gemma3_1b_lora

Examples#

Hugging Face Model Cards#

  • Gemma 3 1B: https://huggingface.co/google/gemma-3-1b-it