> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.deepseek_v4.cp

Context-parallel helpers for the DeepSeek V4 custom model.

This implements the Miles-style training path: each CP rank owns a contiguous
query shard, while K/V and compressed K/V are all-gathered with autograd-aware
collectives before DSV4 sparse attention consumes them.

## Module Contents

### Functions

| Name                                                                                                                                          | Description                                                                 |
| --------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| [`_lcm`](#nemo_automodel-components-models-deepseek_v4-cp-_lcm)                                                                               | -                                                                           |
| [`_pad_position_ids_seq_dim_`](#nemo_automodel-components-models-deepseek_v4-cp-_pad_position_ids_seq_dim_)                                   | -                                                                           |
| [`_pad_tensor_seq_dim_`](#nemo_automodel-components-models-deepseek_v4-cp-_pad_tensor_seq_dim_)                                               | -                                                                           |
| [`build_dsv4_cp_causal_padding_mask`](#nemo_automodel-components-models-deepseek_v4-cp-build_dsv4_cp_causal_padding_mask)                     | Build local-query/global-key additive mask for Miles-style DSV4 CP.         |
| [`dsv4_cp_all_gather`](#nemo_automodel-components-models-deepseek_v4-cp-dsv4_cp_all_gather)                                                   | All-gather activation tensors across CP ranks and concatenate on `dim`.     |
| [`dsv4_cp_all_gather_metadata`](#nemo_automodel-components-models-deepseek_v4-cp-dsv4_cp_all_gather_metadata)                                 | All-gather non-differentiable metadata such as padding masks.               |
| [`dsv4_cp_enabled`](#nemo_automodel-components-models-deepseek_v4-cp-dsv4_cp_enabled)                                                         | Return whether a real CP process group is active.                           |
| [`dsv4_cp_local_seq_multiple`](#nemo_automodel-components-models-deepseek_v4-cp-dsv4_cp_local_seq_multiple)                                   | Required per-CP-rank sequence-length multiple for DSV4 Miles-style CP.      |
| [`dsv4_cp_rank`](#nemo_automodel-components-models-deepseek_v4-cp-dsv4_cp_rank)                                                               | Return this rank's index in the DSV4 CP group, or 0 without CP.             |
| [`dsv4_cp_size`](#nemo_automodel-components-models-deepseek_v4-cp-dsv4_cp_size)                                                               | Return the DSV4 CP group size, or 1 without CP.                             |
| [`make_dsv4_contiguous_shard_cp_batch_and_ctx`](#nemo_automodel-components-models-deepseek_v4-cp-make_dsv4_contiguous_shard_cp_batch_and_ctx) | Contiguously shard a batch for DeepSeek V4 Miles-style context parallelism. |

### API

```python
nemo_automodel.components.models.deepseek_v4.cp._lcm(
    a: int,
    b: int
) -> int
```

```python
nemo_automodel.components.models.deepseek_v4.cp._pad_position_ids_seq_dim_(
    position_ids: torch.Tensor,
    seq_dim: int,
    pad_len: int
) -> torch.Tensor
```

```python
nemo_automodel.components.models.deepseek_v4.cp._pad_tensor_seq_dim_(
    tensor: torch.Tensor,
    seq_dim: int,
    pad_len: int,
    value
) -> torch.Tensor
```

```python
nemo_automodel.components.models.deepseek_v4.cp.build_dsv4_cp_causal_padding_mask(
    position_ids: torch.Tensor,
    key_len: int,
    dtype: torch.dtype,
    device: torch.device,
    cp_group,
    padding_mask: torch.Tensor | None = None,
    sliding_window: int | None = None
) -> torch.Tensor
```

Build local-query/global-key additive mask for Miles-style DSV4 CP.

`position_ids` are the local query positions after contiguous CP slicing.
Keys are in global sequence order because DSV4 gathers K/V along sequence.
`padding_mask` follows the internal convention True=padding.

```python
nemo_automodel.components.models.deepseek_v4.cp.dsv4_cp_all_gather(
    tensor: torch.Tensor,
    dim: int,
    cp_group
) -> torch.Tensor
```

All-gather activation tensors across CP ranks and concatenate on `dim`.

The distributed.nn functional collective preserves autograd, so backward
routes gradients for gathered remote slices back to their owning ranks.

```python
nemo_automodel.components.models.deepseek_v4.cp.dsv4_cp_all_gather_metadata(
    tensor: torch.Tensor | None,
    dim: int,
    cp_group
) -> torch.Tensor | None
```

All-gather non-differentiable metadata such as padding masks.

```python
nemo_automodel.components.models.deepseek_v4.cp.dsv4_cp_enabled(
    cp_group
) -> bool
```

Return whether a real CP process group is active.

```python
nemo_automodel.components.models.deepseek_v4.cp.dsv4_cp_local_seq_multiple(
    model_or_config
) -> int
```

Required per-CP-rank sequence-length multiple for DSV4 Miles-style CP.

Compress-ratio layers constrain how the sequence may be split across CP ranks:
a ratio-R layer needs each local shard divisible by R, and ratio-4 layers use
cross-window overlap so they need `2*R`. The returned value is the LCM across
all configured `compress_ratios` (1 when none are configured).

```python
nemo_automodel.components.models.deepseek_v4.cp.dsv4_cp_rank(
    cp_group
) -> int
```

Return this rank's index in the DSV4 CP group, or 0 without CP.

```python
nemo_automodel.components.models.deepseek_v4.cp.dsv4_cp_size(
    cp_group
) -> int
```

Return the DSV4 CP group size, or 1 without CP.

```python
nemo_automodel.components.models.deepseek_v4.cp.make_dsv4_contiguous_shard_cp_batch_and_ctx(
    cp_mesh,
    tp_mesh,
    batch,
    loss_mask = None,
    padding_token_id: int = 0,
    pad_multiple: int | None = None
)
```

Contiguously shard a batch for DeepSeek V4 Miles-style context parallelism.

Attached to the batch as `_cp_make_batch_fn` (via `functools.partial` to bind
`pad_multiple`) and invoked by `cp_utils.make_cp_batch_and_ctx`. Each CP rank
keeps one `seq_start:seq_end` slice; DSV4 attention all-gathers K/V across CP
ranks during forward. No collective happens here -- this is the batch-side
counterpart of `cp.py`'s activation gathers. Returns `(nullcontext, batch)`.

`pad_multiple` is the required *per-CP-rank* shard multiple (from
`dsv4_cp_local_seq_multiple`); the global sequence is padded so it is divisible
by `cp_size` and each local shard is divisible by `pad_multiple` (>= 2).