Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Griffin (Recurrent Gemma)

Released in April 2024, Google’s Recurrent Gemma which is baesd on the Griffin architecture, is an open model based on the work (Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models). Griffin is a hybrid architecture that combines gated linear recurrences with local sliding window attention, and is designed for a variety of text generation tasks, including question answering, summarization, and reasoning. Its unique architecture offers several advantages over its predecessor, Gemma, including reduced memory usage, which allows for the generation of longer samples on devices with limited memory, such as single GPUs or CPUs. Additionally, Griffin architecture achieves higher throughput, enabling it to perform inference at significantly higher batch sizes and generate more tokens per second, especially for long sequences. Recurrent Gemma is currently offered at a 2.7B checkpoint, underscoring the model’s efficiency and scalability for the Griffin architecture.

Feature

Status

Data parallelism

Tensor parallelism

Pipeline parallelism

Interleaved Pipeline Parallelism Sched

N/A

Sequence parallelism

Selective activation checkpointing

Gradient checkpointing

Partial gradient checkpointing

FP32/TF32

AMP/FP16

BF16

TransformerEngine/FP8

Multi-GPU

Slurm

Base Command Manager

Base Command Platform

Distributed data preprocessing

NVfuser

P-Tuning and Prompt Tuning

Adapter learning

Distributed Optimizer

Distributed Checkpoint