Fine-Tune DeepSeek V4 Flash#
Introduction#
deepseek-ai/DeepSeek-V4-Flash is the latest fine-grained Mixture-of-Experts language model from DeepSeek. It uses a 43-layer all-MoE backbone (no dense MLP layers) with 256 routed experts plus one shared expert per block and top-6 routing. The architecture introduces a hybrid per-layer attention zoo — Sliding-Window Attention (SWA), Compressed Sparse Attention (CSA, Compressor + Indexer), and Hierarchical Compressed Attention (HCA, Compressor only) — selectable per layer through compress_ratios. The first num_hash_layers blocks use a hash-clustering gate (DeepseekV4HashGate) for token-to-expert routing, and every block maintains hc_mult=4 Hyper-Connection streams mixed via a learned col-norm-first Sinkhorn router.
This guide walks you through fine-tuning DeepSeek V4 Flash on HellaSwag using NVIDIA NeMo Automodel. You will learn how to configure the recipe, launch training, and inspect the results.
To set up your environment to run NeMo Automodel, follow the installation guide.
Data#
HellaSwag#
We use HellaSwag, a commonsense natural-language-inference dataset consisting of context + four candidate continuations. The version used here is the standard rowan/hellaswag HuggingFace split, formatted for next-token-prediction fine-tuning.
Train / validation splits taken directly from the HuggingFace dataset.
Tokenizer: shared with the base model (
AutoTokenizer.from_pretrainedon the DeepSeek V4 Flash checkpoint).Padding:
pad_seq_len_divisible=64via the default collater.
For the full HellaSwag dataset wrapper used in NeMo Automodel, see nemo_automodel.components.datasets.llm.hellaswag.HellaSwag.
Architecture Notes#
DeepSeek V4 Flash differs from V3 / V3.2 in several load-bearing ways. The state-dict adapter and pipeline-parallel forward in NeMo Automodel handle each of these transparently:
Attention: GQA with a single KV head broadcast to all 64 attention heads, Q-LoRA (
q_lora_rank=1024), and grouped O-LoRA (o_lora_rank=1024,o_groups=8) — not MLA. Per-head non-learnable rsqrt on Q afterwq_bmatches the inference reference.Hybrid attention via
compress_ratios:compress_ratio=0→ pure SWA with a learned per-head attention sink.compress_ratio=4→ CSA: Compressor (overlap mode, pools2 * ratioraw tokens per compressed token) plus Indexer (selects top-k compressed positions per query). An explicit additive[B, 1, S, P_total]mask enforces per-query causal correctness.compress_ratio=128→ HCA: Compressor only (non-overlap pooling), deterministicp < (q + 1) // ratiocausal mask.
Dual RoPE bases:
theta=10000forcompress_ratio==0layers;theta=160000(with YaRN scaling) forcompress_ratio>0layers, applied to both the main attention Q/KV and the Compressor sub-module on those layers. RoPE is encoded as INTERLEAVED pairs (view_as_complexstyle) to match the released checkpoint.Hash-routing first layers: the first
num_hash_layers(default 3) blocks use aDeepseekV4HashGatewith atid2eidlookup table.input_idsis threaded through the model and the V4-aware pipeline forward; under pipeline parallelism, hash layers live on stage 0 whereinput_idsis available.Hyper-Connections (HC): every block maintains
hc_mult=4streams of the hidden state. The mixer follows the releasedhc_split_sinkhornformulas:pre = sigmoid + eps,post = 2 * sigmoid(no+eps),comb = softmax(dim=-1) + epsfollowed by a col-norm-first Sinkhorn (iters - 1alternating row/col passes), producing a doubly-stochastic mixing matrix per block.MoE routing:
sqrtsoftplusscoring withnoaux_tctopk method and clamped SwiGLU on routed experts (swiglu_limit=10.0).Optional MTP layers via
num_nextn_predict_layers.
Checkpoint format#
The released DSV4-Flash safetensors mix several quantization formats. The state-dict adapter handles all of them transparently:
Routed experts: FP4
e2m1fnpacked two values per int8 byte, with per-row 32-col FP8e8m0fnuscales — unpacked on load, re-emitted in matching packed placeholders onto_hfso DCP shape/dtype validation lines up with the on-disk layout.Shared experts + non-expert weights: standard FP8
e4m3fn128×128 block scales.Hash layers’ gate has no bias on disk: the adapter reads
num_hash_layersfrom the checkpoint’sconfig.jsonand drops the corresponding bias keys before DCP load.Indexer / Compressor key flattening: on disk the Indexer sits as a sibling of the Compressor with its own nested compressor (
indexer.compressor.{ape,norm,wgate,wkv}+indexer.{wq_b,weights_proj}); the adapter renames these to land at the flatcompressor.indexer.*layout.
A new in-tree HuggingFaceStorageReader recognizes F8_E8M0 and F8_E5M2 dtypes (the upstream reader silently dropped them), restoring DCP metadata on every rank for these checkpoints.
Launch Training#
A ready-to-use recipe ships at examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml. The yaml header documents how to scale num_hidden_layers and ep_size for the full 43-layer multi-node run.
NeMo Automodel supports several ways to launch training — via the Automodel CLI with Slurm, interactive sessions, torchrun, and more. For full details on all launch options (Slurm batch jobs, multi-node configuration, environment variables, etc.), see the Run on a Cluster guide.
Standalone Slurm Script#
Below is a standalone Slurm script example for the HellaSwag recipe. Before running it, ensure your cluster environment is configured following the Run on a Cluster guide. Then submit the job:
export TRANSFORMERS_OFFLINE=1
export HF_HOME=your/path/to/hf_cache
export HF_DATASETS_OFFLINE=1
export WANDB_API_KEY=your_wandb_key
srun --output=output.out \
--error=output.err \
--container-image /your/path/to/automodel.image.sqsh --no-container-mount-home bash -c "
CUDA_DEVICE_MAX_CONNECTIONS=1 automodel \
examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml \
--nproc-per-node=8 \
--model.config.pretrained_model_name_or_path=/your/local/dsv4-flash \
--model.config.name_or_path=/your/local/dsv4-flash "
Before you start:
Hugging Face applies rate limits on downloads. We recommend cloning the model repository to your local filesystem beforehand.
Ensure your Hugging Face cache (
HF_HOME) is configured and that the dataset is already cached locally.To enable Weights & Biases logging, set your
WANDB_API_KEYand configure thewandbsection in the YAML file.For the full 43-layer schedule, increase
ep_size(and addpp_size) per the cluster you are running on; see the yaml header for guidance.
Layer-Parity Validation#
The bringup was validated against the official DeepSeek inference reference (dsv4flash/inference/model.py) by per-tensor dump bisection. On the 4-layer parity harness (compress_ratios=[0, 0, 4, 128], num_hash_layers=2, PP=1, EP=8):
Final-logits cosine similarity: 0.998 vs reference, top-1 token matches.
Every block cosine similarity ≥ 0.987.
Training Results#
The training loss curve below is from a 43-layer full-finetune run on HellaSwag with the full attention zoo (SWA + CSA + HCA) live.