Griffin (Recurrent Gemma)

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide (Latest) Griffin (Recurrent Gemma)

User Guide (Latest Version)

Released in April 2024, Google’s Recurrent Gemma which is baesd on the Griffin architecture, is an open model based on the work (Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models). Griffin is a hybrid architecture that combines gated linear recurrences with local sliding window attention, and is designed for a variety of text generation tasks, including question answering, summarization, and reasoning. Its unique architecture offers several advantages over its predecessor, Gemma, including reduced memory usage, which allows for the generation of longer samples on devices with limited memory, such as single GPUs or CPUs. Additionally, Griffin architecture achieves higher throughput, enabling it to perform inference at significantly higher batch sizes and generate more tokens per second, especially for long sequences. Recurrent Gemma is currently offered at a 2.7B checkpoint, underscoring the model’s efficiency and scalability for the Griffin architecture.

Feature	Status
Data parallelism	✓
Tensor parallelism	✗
Pipeline parallelism	✗
Interleaved Pipeline Parallelism Sched	N/A
Sequence parallelism	✗
Selective activation checkpointing	✓
Gradient checkpointing	✓
Partial gradient checkpointing	✓
FP32/TF32	✓
AMP/FP16	✗
BF16	✓
TransformerEngine/FP8	✗
Multi-GPU	✓
Slurm	✓
Base Command Manager	✓
Base Command Platform	✓
Distributed data preprocessing	✓
NVfuser	✗
P-Tuning and Prompt Tuning	✗
Adapter learning	✓
Distributed Optimizer	✓
Distributed Checkpoint	✗

Previous Parameter Efficient Fine-Tuning (PEFT)

Next Data Preparation for SFT and PEFT