Gemma 2#

Google’s Gemma 2 is a family of lightweight, open models built on the same research and technology used to create Gemini models. The Gemma 2 architecture builds on the transformer decoder framework with enhancements including pre-normalization with RMSNorm, GeGLU activations, Rotary Positional Embeddings (RoPE), attention logit softcapping, and sliding window attention.

Gemma 2 models are designed for a wide range of text generation tasks and are available in multiple sizes to suit different computational budgets.

Gemma family models are supported via the Bridge system with auto-detected configuration and weight mapping.

Available Models#

Text-Only Models#

Gemma 2 2B (google/gemma-2-2b): Compact 2B parameter model optimized for efficiency
- 26 layers, 2304 hidden size
- 8 attention heads, 4 query groups (GQA)
- Sequence length: 8,192 tokens
- Ideal for single-GPU deployment
Gemma 2 9B (google/gemma-2-9b): Medium-scale 9B parameter model balancing performance and efficiency
- 42 layers, 3584 hidden size
- 16 attention heads, 8 query groups (GQA)
- Sequence length: 8,192 tokens
- Recommended: 4-8 GPUs with tensor parallelism
Gemma 2 27B (google/gemma-2-27b): Large-scale 27B parameter model for maximum performance
- 46 layers, 4608 hidden size
- 32 attention heads, 16 query groups (GQA)
- Sequence length: 8,192 tokens
- Recommended: 8-16 GPUs with tensor and pipeline parallelism

All models support a sequence length of 8,192 tokens and use sliding window attention for efficient context processing.

Model Architecture Features#

Gemma 2 introduces several architectural innovations:

Sliding Window Attention: Local attention with a 4,096 token window for efficient long-context processing
Attention Logit Softcapping: Caps attention logits at 50.0 to prevent extreme values
Final Logit Softcapping: Caps final output logits at 30.0 for stable generation
GeGLU Activation: Uses gated linear units with GELU activation for improved performance
RMSNorm: Layer normalization without mean centering for faster computation
Rotary Embeddings: RoPE with base frequency 10,000
Query Pre-Attention Scaling: Scales queries by 224 before attention computation

Conversion with 🤗 Hugging Face#

Load HF → Megatron#

from megatron.bridge import AutoBridge

# Example: Gemma 2 9B
bridge = AutoBridge.from_hf_pretrained("google/gemma-2-9b")
provider = bridge.to_megatron_provider()

# Configure parallelism before instantiating the model
provider.tensor_model_parallel_size = 8
provider.pipeline_model_parallel_size = 1

provider.finalize()
model = provider.provide_distributed_model(wrap_with_ddp=False)

Import HF → Megatron#

To import the HF model to your desired Megatron path:

# Gemma 2 2B
python examples/conversion/convert_checkpoints.py import \
--hf-model google/gemma-2-2b \
--megatron-path /models/gemma-2-2b

# Gemma 2 9B
python examples/conversion/convert_checkpoints.py import \
--hf-model google/gemma-2-9b \
--megatron-path /models/gemma-2-9b

# Gemma 2 27B
python examples/conversion/convert_checkpoints.py import \
--hf-model google/gemma-2-27b \
--megatron-path /models/gemma-2-27b

Export Megatron → HF#

# Gemma 2 9B example
python examples/conversion/convert_checkpoints.py export \
--hf-model google/gemma-2-9b \
--megatron-path /results/gemma2_9b/checkpoints/iter_00001000 \
--hf-path ./gemma2-9b-hf-export

Run Inference on Converted Checkpoint#

python examples/conversion/hf_to_megatron_generate_text.py \
--hf_model_path google/gemma-2-9b \
--megatron_model_path /models/gemma-2-9b \
--prompt "What is artificial intelligence?" \
--max_new_tokens 100

Note:

--megatron_model_path is optional. If not specified, the script will convert the model and then run forward.

Pretrain and Finetune Recipes#

See: bridge.recipes.gemma
Available recipes:
- Pretraining:
  - gemma2_2b_pretrain_config: Pre-training configuration for Gemma 2 2B
  - gemma2_9b_pretrain_config: Pre-training configuration for Gemma 2 9B
  - gemma2_27b_pretrain_config: Pre-training configuration for Gemma 2 27B
- Finetuning:
  - gemma2_2b_finetune_config: Finetuning configuration for Gemma 2 2B with PEFT support (LoRA, DoRA)
  - gemma2_9b_finetune_config: Finetuning configuration for Gemma 2 9B with PEFT support (LoRA, DoRA)
  - gemma2_27b_finetune_config: Finetuning configuration for Gemma 2 27B with PEFT support (LoRA, DoRA)

Before training, ensure the following environment variables are set:

SAVE_DIR: checkpoint and log saving directory
HF_TOKEN: to download models from HF Hub (if required)
HF_HOME: (optional) to avoid re-downloading models and datasets
WANDB_API_KEY: (optional) to enable WandB logging

Pretraining#

Gemma 2 2B#

from megatron.bridge.recipes.gemma import gemma2_2b_pretrain_config

# Create a pretraining configuration
config = gemma2_2b_pretrain_config(
    name="my_gemma2_2b_pretrain",
    data_paths=["path/to/data"],
    train_iters=100000,
    global_batch_size=32,
)

Gemma 2 9B#

from megatron.bridge.recipes.gemma import gemma2_9b_pretrain_config

config = gemma2_9b_pretrain_config(
    name="my_gemma2_9b_pretrain",
    data_paths=["path/to/data"],
    train_iters=100000,
    global_batch_size=32,
)

Gemma 2 27B#

from megatron.bridge.recipes.gemma import gemma2_27b_pretrain_config

config = gemma2_27b_pretrain_config(
    name="my_gemma2_27b_pretrain",
    data_paths=["path/to/data"],
    train_iters=100000,
    global_batch_size=32,
)

Full Finetuning#

Gemma 2 2B#

torchrun --nproc-per-node=8 run/run_recipe.py \
--pretrained-checkpoint /models/gemma-2-2b \
--recipe gemma2_2b_finetune_config \
--peft none \
train.global_batch_size=64 \
train.train_iters=1000 \
checkpoint.save=$SAVE_DIR/gemma2_2b_finetune

Or programmatically:

from megatron.bridge.recipes.gemma import gemma2_2b_finetune_config

config = gemma2_2b_finetune_config(
    name="gemma2_2b_full_finetune",
    pretrained_checkpoint="/models/gemma-2-2b",
    peft="none",
    train_iters=1000,
    global_batch_size=64,
)

Gemma 2 9B#

torchrun --nproc-per-node=8 run/run_recipe.py \
--pretrained-checkpoint /models/gemma-2-9b \
--recipe gemma2_9b_finetune_config \
--peft none \
train.global_batch_size=64 \
train.train_iters=1000 \
checkpoint.save=$SAVE_DIR/gemma2_9b_finetune

Gemma 2 27B#

torchrun --nproc-per-node=16 run/run_recipe.py \
--pretrained-checkpoint /models/gemma-2-27b \
--recipe gemma2_27b_finetune_config \
--peft none \
train.global_batch_size=64 \
train.train_iters=1000 \
checkpoint.save=$SAVE_DIR/gemma2_27b_finetune

Parameter-Efficient Finetuning (PEFT) with LoRA#

Gemma 2 2B#

torchrun --nproc-per-node=8 run/run_recipe.py \
--pretrained-checkpoint /models/gemma-2-2b \
--recipe gemma2_2b_finetune_config \
--peft lora \
train.global_batch_size=128 \
checkpoint.save=$SAVE_DIR/gemma2_2b_lora

PEFT options:

--peft: Set to lora for LoRA or dora for DoRA. Set to none for full finetuning.

Or programmatically:

from megatron.bridge.recipes.gemma import gemma2_2b_finetune_config

# LoRA finetuning
config = gemma2_2b_finetune_config(
    name="gemma2_2b_lora_finetune",
    pretrained_checkpoint="/models/gemma-2-2b",
    peft="lora",  # or "dora"
    train_iters=1000,
    global_batch_size=128,
)

Gemma 2 9B LoRA#

from megatron.bridge.recipes.gemma import gemma2_9b_finetune_config

config = gemma2_9b_finetune_config(
    name="gemma2_9b_lora_finetune",
    pretrained_checkpoint="/models/gemma-2-9b",
    peft="lora",
    train_iters=1000,
    global_batch_size=128,
)

Gemma 2 27B LoRA#

from megatron.bridge.recipes.gemma import gemma2_27b_finetune_config

config = gemma2_27b_finetune_config(
    name="gemma2_27b_lora_finetune",
    pretrained_checkpoint="/models/gemma-2-27b",
    peft="lora",
    train_iters=1000,
    global_batch_size=128,
)

Recommended Configurations#

Model	Mode	TP	PP	Global Batch Size	Learning Rate
Gemma 2 2B	Full SFT	1	1	64-128	5e-6
Gemma 2 2B	LoRA/DoRA	1	1	128-256	1e-4
Gemma 2 9B	Full SFT	4	1	64-128	5e-6
Gemma 2 9B	LoRA/DoRA	1	1	128-256	1e-4
Gemma 2 27B	Full SFT	8	2	64-128	5e-6
Gemma 2 27B	LoRA/DoRA	4	1	128-256	1e-4

Examples#

Checkpoint import/export: examples/conversion/convert_checkpoints.py
Generate text (HF→Megatron): examples/conversion/hf_to_megatron_generate_text.py

Hugging Face Model Cards#

Gemma 2 2B: https://huggingface.co/google/gemma-2-2b
Gemma 2 9B: https://huggingface.co/google/gemma-2-9b
Gemma 2 27B: https://huggingface.co/google/gemma-2-27b
Gemma 2 Collection: https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315

Gemma 2#

Available Models#

Text-Only Models#

Model Architecture Features#

Conversion with 🤗 Hugging Face#

Load HF → Megatron#

Import HF → Megatron#

Export Megatron → HF#

Run Inference on Converted Checkpoint#

Pretrain and Finetune Recipes#

Pretraining#

Gemma 2 2B#

Gemma 2 9B#

Gemma 2 27B#

Full Finetuning#

Gemma 2 2B#

Gemma 2 9B#

Gemma 2 27B#

Parameter-Efficient Finetuning (PEFT) with LoRA#

Gemma 2 2B#

Gemma 2 9B LoRA#

Gemma 2 27B LoRA#

Recommended Configurations#

Examples#

Hugging Face Model Cards#

Related Docs#