`nemo_automodel.components.models.deepseek_v3.state_dict_adapter`#

Module Contents#

Classes#

DeepSeekV3StateDictAdapter

Functions#

`should_quantize_key`	Check if a key should be quantized based on its name.
`create_scale_inv_for_weight`	Create a scale_inv tensor for a weight.
`_slice_scale_for_dtensor`	Slice scale_inv tensor to match a DTensor weight’s local portion.
`calculate_scale_shape`
`_dequantize_with_torch`
`_dequantize_with_triton`
`dequantize_from_fp8`

Data#

`logger`
`BLOCK_SIZE`
`NON_QUANTIZED_KEY_PATTERNS`

API#

nemo_automodel.components.models.deepseek_v3.state_dict_adapter.logger#: ‘getLogger(…)’

nemo_automodel.components.models.deepseek_v3.state_dict_adapter.BLOCK_SIZE#: 128

nemo_automodel.components.models.deepseek_v3.state_dict_adapter.NON_QUANTIZED_KEY_PATTERNS#: [‘input_layernorm.weight’, ‘post_attention_layernorm.weight’, ‘norm.weight’, ‘lm_head.weight’, ‘embe…

nemo_automodel.components.models.deepseek_v3.state_dict_adapter.should_quantize_key(key: str) → bool#: Check if a key should be quantized based on its name.

nemo_automodel.components.models.deepseek_v3.state_dict_adapter.create_scale_inv_for_weight( weight: torch.Tensor, block_size: int = BLOCK_SIZE, ) → torch.Tensor#

Create a scale_inv tensor for a weight.

Note: scale_inv is always created as a regular tensor (not DTensor) because the scale_inv shape (based on 128x128 blocks) doesn’t align with DTensor sharding boundaries. During dequantization, _slice_scale_for_dtensor handles extracting the correct scale blocks for DTensor weights.

Parameters:

weight – The weight tensor (may be a DTensor)
block_size – The FP8 quantization block size

Returns:

scale_inv tensor with shape based on GLOBAL weight shape

class nemo_automodel.components.models.deepseek_v3.state_dict_adapter.DeepSeekV3StateDictAdapter( config: transformers.DeepseekV3Config, moe_config: nemo_automodel.components.moe.config.MoEConfig, backend: nemo_automodel.components.models.common.BackendConfig, dtype: torch.dtype = torch.float32, )#

Bases: nemo_automodel.components.moe.state_dict_mixin.MoESplitExpertsStateDictMixin, nemo_automodel.components.checkpoint.state_dict_adapter.StateDictAdapter

_dequantize( state_dict: dict[str, Any], ) → dict[str, Any]#

to_hf(

state_dict: dict[str, Any],

exclude_key_regex: Optional[str] = None,

quantization: bool = False,

**kwargs,

) → dict[str, Any]#: Convert from native model state dict to HuggingFace format. Automatically detects format based on backend.dispatcher configuration.

from_hf(

hf_state_dict: dict[str, Any],

device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,

**kwargs,

) → dict[str, Any]#

Convert HF checkpoint to native format.

Dequantize FP8 tensors if scale_inv buffers are provided
Aggregate per-expert weights into grouped tensors
If device_mesh is provided, only load experts needed for the current rank

convert_single_tensor_to_hf(

fqn: str,

tensor: Any,

**kwargs,

) → list[tuple[str, Any]]#

Convert a single tensor from native format to HuggingFace format.

Parameters:

fqn – Fully qualified name of the tensor in native format
tensor – The tensor to convert
**kwargs – Additional arguments for conversion

Returns:

List of (fqn, tensor) tuples in HuggingFace format

nemo_automodel.components.models.deepseek_v3.state_dict_adapter._slice_scale_for_dtensor( scale_inv: torch.Tensor, weight_dtensor: torch.Tensor, weight_local: torch.Tensor, block_size: int = BLOCK_SIZE, ) → torch.Tensor#

Slice scale_inv tensor to match a DTensor weight’s local portion.

When weight is sharded via DTensor but scale_inv is a regular tensor, we need to extract only the scale blocks that correspond to the local portion of the weight.

Parameters:

scale_inv – The full (global) scale_inv tensor
weight_dtensor – The DTensor weight (has device_mesh and placements)
weight_local – The local portion of the weight
block_size – The FP8 quantization block size (default 128)

Returns:

The sliced scale_inv tensor matching the local weight’s blocks

nemo_automodel.components.models.deepseek_v3.state_dict_adapter.calculate_scale_shape( weight: torch.Tensor, BLOCK_SIZE: int = BLOCK_SIZE, ) → torch.Size#

nemo_automodel.components.models.deepseek_v3.state_dict_adapter._dequantize_with_torch( weight: torch.Tensor, scale_inv: torch.Tensor, dtype: torch.dtype, block_size: int, ) → torch.Tensor#

nemo_automodel.components.models.deepseek_v3.state_dict_adapter._dequantize_with_triton( weight: torch.Tensor, scale_inv: torch.Tensor, dtype: torch.dtype, block_size: int, ) → torch.Tensor#

nemo_automodel.components.models.deepseek_v3.state_dict_adapter.dequantize_from_fp8( weight: torch.Tensor, scale_inv: torch.Tensor, dtype=torch.bfloat16, BLOCK_SIZE: int = BLOCK_SIZE, name: str = '', ) → torch.Tensor#

nemo_automodel.components.models.deepseek_v3.state_dict_adapter#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_automodel.components.models.deepseek_v3.state_dict_adapter`#