Frequently Asked Questions (FAQ)

FP8 checkpoint compatibility

Transformer Engine starts to support FP8 attention in 1.6. It stores the FP8 metadata, i.e. scaling factors and amax histories, under a ._extra_state key in the checkpoint. As the FP8 attention support expands from one backend to multiple backends, the location of the ._extra_state key has also shifted.

Here, we take the MultiheadAttention module as an example. Its FP8 attention metadata in Transformer Engine 1.11 is stored as core_attention._extra_state as shown below.

>>> from transformer_engine.pytorch import MultiheadAttention, fp8_model_init
>>> with fp8_model_init(enabled=True):
...     mha = MultiheadAttention(
...         hidden_size=1024,
...         num_attention_heads=16,
...         bias=True,
...         params_dtype=torch.bfloat16,
...         input_layernorm=False,
...         fuse_qkv_params=True,
...         attention_type="self",
...         qkv_weight_interleaved=True,
...     ).to(dtype=torch.bfloat16, device="cuda")
...
>>> state_dict = mha.state_dict()
>>> print(state_dict.keys())
odict_keys(['qkv.weight', 'qkv.bias', 'qkv._extra_state', 'core_attention._extra_state', 'proj.weight', 'proj.bias', 'proj._extra_state'])

Here is a full list of the checkpoint save/load behaviors from all Transformer Engine versions.

Version: <= 1.5

Saves no FP8 metadata since FP8 attention is not supported

Loading behavior for checkpoints created by the following versions:

<= 1.5:

Loads no FP8 metadata

> 1.5:

Error: unexpected key

Version: 1.6, 1.7

Saves FP8 metadata to core_attention.fused_attention._extra_state

Loading behavior for checkpoints created by the following versions:

<= 1.5:

Initializes FP8 metadata to the default, i.e. 1s for scaling factors, and 0s for amaxes

1.6, 1.7:

Loads FP8 metadata from checkpoint

>= 1.8:

Error: unexpected key

Version: >=1.8, <= 1.11

Saves FP8 metadata to core_attention._extra_state
Loading behavior for checkpoints created by the following versions:
<= 1.5:

Initializes FP8 metadata to the default, i.e. 1s for scaling factors, and 0s for amaxes

1.6, 1.7:
This save/load combination relies on users to map the 1.6/1.7 key to the 1.8-1.11 key. Otherwise, it initializes FP8 metadata to the default, i.e. 1s for scaling factors, and 0s for amaxes. The mapping can be done, in this MultiheadAttention example, by
>>> state_dict["core_attention._extra_state"] = \
        state_dict["core_attention.fused_attention._extra_state"]
>>> del state_dict["core_attention.fused_attention._extra_state"]
>= 1.8:

Loads FP8 metadata from checkpoint

Version: >=1.12

Saves FP8 metadata to core_attention._extra_state

Loading behavior for checkpoints created by the following versions:

<= 1.5:

Initializes FP8 metadata to the default, i.e. 1s for scaling factors, and 0s for amaxes

>= 1.6:

Loads FP8 metadata from checkpoint