Gemma 3#

Google’s Gemma 3 is a family of lightweight, state-of-the-art open models built on the same research and technology used to create Gemini models. The Gemma 3 architecture builds on the transformer decoder framework with enhancements including pre-normalization with RMSNorm, GeGLU activations, Rotary Positional Embeddings (RoPE), and hybrid attention patterns (sliding window and global attention).

Gemma 3 models are designed for a wide range of text generation tasks and are available in multiple sizes to suit different computational budgets.

Gemma family models are supported via the Bridge system with auto-detected configuration and weight mapping.

Available Models#

Text-Only Models#

Gemma 3 1B (google/gemma-3-1b-it): Compact 1B parameter model optimized for efficiency
- 26 layers, 1152 hidden size
- 8 attention heads, 2 query groups (GQA)
- Sequence length: 131,072 tokens
- Ideal for single-GPU deployment

All models support a sequence length of 131,072 tokens and use hybrid attention patterns (sliding window + global).

Model Architecture Features#

Gemma 3 introduces several architectural innovations:

Hybrid Attention Pattern: Alternates between global and local sliding window attention for efficient long-context processing
GeGLU Activation: Uses gated linear units with GELU activation for improved performance
RMSNorm: Layer normalization without mean centering for faster computation
Rotary Embeddings: Separate RoPE configurations for local and global attention layers
- Local attention: Uses sliding window with rotary base 10,000
- Global attention: Extended rotary base for better long-range dependencies

Conversion with 🤗 Hugging Face#

Load HF → Megatron#

from megatron.bridge import AutoBridge

# Example: Gemma 3 1B
bridge = AutoBridge.from_hf_pretrained("google/gemma-3-1b-it")
provider = bridge.to_megatron_provider()

# Configure parallelism before instantiating the model
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1

provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)

Import HF → Megatron#

To import the HF model to your desired Megatron path:

python examples/conversion/convert_checkpoints.py import \
--hf-model google/gemma-3-1b-it \
--megatron-path /models/gemma-3-1b-it

Export Megatron → HF#

python examples/conversion/convert_checkpoints.py export \
--hf-model google/gemma-3-1b-it \
--megatron-path /results/gemma3_1b/checkpoints/iter_00001000 \
--hf-path ./gemma3-hf-export

Run Inference on Converted Checkpoint#

python examples/conversion/hf_to_megatron_generate_text.py \
--hf_model_path google/gemma-3-1b-it \
--megatron_model_path /models/gemma-3-1b-it \
--prompt "What is artificial intelligence?" \
--max_new_tokens 100

Note:

--megatron_model_path is optional. If not specified, the script will convert the model and then run forward.

Recipes#

See: bridge.recipes.gemma

Available Recipes#

Pretrain recipes:
- gemma3_1b_pretrain_config: Pre-training for Gemma 3 1B
SFT recipes:
- gemma3_1b_sft_config: Full SFT for Gemma 3 1B
PEFT recipes (LoRA, DoRA):
- gemma3_1b_peft_config: PEFT for Gemma 3 1B

Before training, ensure these environment variables are set:

SAVE_DIR: checkpoint and log saving directory
HF_TOKEN: to download models from HF Hub (if required)
HF_HOME: (optional) to avoid re-downloading models and datasets
WANDB_API_KEY: (optional) to enable WandB logging

Parallelism Configurations#

Model	Mode	TP	PP	Total GPUs	Use Case
Gemma 3 1B	Pretrain	1	1	8	Pre-training (single node)
Gemma 3 1B	Full SFT	1	1	8	Full supervised finetuning
Gemma 3 1B	LoRA/DoRA	1	1	8	PEFT finetuning (single node)

Pre-training Example#

from megatron.bridge.recipes.gemma import gemma3_1b_pretrain_config

config = gemma3_1b_pretrain_config(
    name="gemma3_1b_pretrain",
    data_paths=["path/to/data"],
    train_iters=100000,
    global_batch_size=256,
    # Uses TP=1, PP=1 (8 GPUs) automatically
)

Finetuning Examples#

Full Finetuning#

from megatron.bridge.recipes.gemma import gemma3_1b_sft_config

config = gemma3_1b_sft_config(
    name="gemma3_1b_full_finetune",
    pretrained_checkpoint="/models/gemma-3-1b-it",
    train_iters=1000,
    global_batch_size=64,
    finetune_lr=5e-6,
    # Uses TP=1, PP=1 (8 GPUs) automatically
)

LoRA Finetuning#

from megatron.bridge.recipes.gemma import gemma3_1b_peft_config

config = gemma3_1b_peft_config(
    name="gemma3_1b_lora_finetune",
    pretrained_checkpoint="/models/gemma-3-1b-it",
    peft_scheme="lora",  # or "dora"
    train_iters=1000,
    global_batch_size=128,
    finetune_lr=1e-4,
    # Uses TP=1, PP=1 (8 GPUs) automatically
)

Command-Line Training#

Full Finetuning:

torchrun --nproc-per-node=8 run/run_recipe.py \
  --pretrained-checkpoint /models/gemma-3-1b-it \
  --recipe gemma3_1b_sft_config \
  train.global_batch_size=64 \
  train.train_iters=1000 \
  checkpoint.save=$SAVE_DIR/gemma3_1b_finetune

LoRA Finetuning:

torchrun --nproc-per-node=8 run/run_recipe.py \
  --pretrained-checkpoint /models/gemma-3-1b-it \
  --recipe gemma3_1b_peft_config \
  --peft_scheme lora \
  train.global_batch_size=128 \
  checkpoint.save=$SAVE_DIR/gemma3_1b_lora

Examples#

Checkpoint import/export: examples/conversion/convert_checkpoints.py
Generate text (HF→Megatron): examples/conversion/hf_to_megatron_generate_text.py

Hugging Face Model Cards#

Gemma 3 1B: https://huggingface.co/google/gemma-3-1b-it