bridge.recipes.deepseek.deepseek_v3
#
Module Contents#
Functions#
Configure the DeepSeek-V3 (671B) model. |
|
Create a pre-training configuration for DeepSeek-V3 (671B) model. |
|
Create a pre-training configuration for DeepSeek-V3 (671B) model with minimal number of nodes (32). |
Data#
API#
- bridge.recipes.deepseek.deepseek_v3.logger#
‘getLogger(…)’
- bridge.recipes.deepseek.deepseek_v3.model_config(
- tensor_parallelism: int = 2,
- pipeline_parallelism: int = 16,
- pipeline_parallelism_dtype: Optional[torch.dtype] = None,
- virtual_pipeline_parallelism: Optional[int] = None,
- context_parallelism: int = 1,
- expert_parallelism: int = 64,
- sequence_parallelism: bool = True,
- mtp_num_layers: Optional[int] = 1,
- mtp_loss_scaling_factor: Optional[float] = 0.1,
- recompute_granularity: str = 'selective',
- recompute_modules: Optional[List[str]] = None,
- recompute_method: Optional[str] = None,
- recompute_num_layers: Optional[int] = None,
- enable_deepep: bool = False,
- apply_rope_fusion: bool = True,
- layout: Optional[List[List[str]]] = None,
Configure the DeepSeek-V3 (671B) model.
- Parameters:
tensor_parallelism – Degree of tensor model parallelism.
pipeline_parallelism – Degree of pipeline model parallelism.
pipeline_parallelism_dtype – Data type for pipeline parallelism.
virtual_pipeline_parallelism – Size of virtual pipeline parallelism.
context_parallelism – Degree of context parallelism.
expert_parallelism – Degree of expert model parallelism.
sequence_parallelism – Whether to use sequence parallelism.
mtp_num_layers – Number of MTP layers.
mtp_loss_scaling_factor – Loss scaling factor for MTP.
recompute_granularity – Recomputation granularity. For V3 we recommend “selective”.
recompute_modules – Modules to selectively recompute when granularity is “selective”.
recompute_method – Method for activation recomputation.
recompute_num_layers – Number of layers to recompute.
apply_rope_fusion – Whether to apply MLA Yarn fusion.
- Returns:
Configuration for the DeepSeek-V3 model.
- Return type:
- bridge.recipes.deepseek.deepseek_v3.pretrain_config(
- dir: Optional[str] = None,
- name: str = 'default',
- data_paths: Optional[List[str]] = None,
- data_args_path: Optional[str] = None,
- train_data_path: Optional[List[str]] = None,
- valid_data_path: Optional[List[str]] = None,
- test_data_path: Optional[List[str]] = None,
- per_split_data_args_path: Optional[str] = None,
- mock: bool = False,
- tensor_parallelism: int = 2,
- pipeline_parallelism: int = 16,
- pipeline_parallelism_dtype: Optional[torch.dtype] = torch.bfloat16,
- virtual_pipeline_parallelism: Optional[int] = None,
- context_parallelism: int = 1,
- expert_parallelism: int = 64,
- sequence_parallelism: bool = True,
- use_megatron_fsdp: bool = False,
- mtp_num_layers: Optional[int] = 1,
- mtp_loss_scaling_factor: Optional[float] = 0.1,
- train_iters: int = 1000000,
- global_batch_size: int = 4096,
- micro_batch_size: int = 1,
- seq_length: int = 4096,
- lr: float = 0.0003,
- min_lr: float = 3e-05,
- lr_warmup_iters: int = 2000,
- lr_decay_iters: Optional[int] = None,
- precision_config: Optional[Union[megatron.bridge.training.mixed_precision.MixedPrecisionConfig, str]] = None,
- comm_overlap_config: Optional[megatron.bridge.training.comm_overlap.CommOverlapConfig] = None,
- enable_deepep: bool = False,
- recompute_granularity: str = 'selective',
- recompute_modules: Optional[List[str]] = None,
- recompute_method: Optional[str] = None,
- recompute_num_layers: Optional[int] = None,
- apply_rope_fusion: bool = False,
- layout: Optional[List[List[str]]] = None,
Create a pre-training configuration for DeepSeek-V3 (671B) model.
- Returns:
Configuration for pre-training.
- Return type:
- bridge.recipes.deepseek.deepseek_v3.pretrain_config_32nodes(**kwargs)#
Create a pre-training configuration for DeepSeek-V3 (671B) model with minimal number of nodes (32).
- Returns:
Configuration for pre-training.
- Return type: