`bridge.data.hf_datasets.text_collate`#

Generic text-only HF chat collator for the conversation dataset path.

Module Contents#

Functions#

`_normalize_text_conversation`
`_render_chat`
`_call_tokenizer`
`_tokenize_texts`
`_as_2d_long_tensor`
`_tensorize_batch`
`_ensure_attention_mask`
`_metadata_from_example`
`text_chat_collate_fn`	Collate text-only HF chat examples using the shared assistant-mask path.

Data#

_CONVERSATION_KEYS

API#

bridge.data.hf_datasets.text_collate._CONVERSATION_KEYS#: (‘conversation’, ‘messages’, ‘conversations’)

bridge.data.hf_datasets.text_collate._normalize_text_conversation( example: collections.abc.Mapping[str, Any], ) → list[dict[str, Any]]#

bridge.data.hf_datasets.text_collate._render_chat( conversation: list[dict[str, Any]], processor: Any, tokenizer: Any, ) → str#

bridge.data.hf_datasets.text_collate._call_tokenizer( tokenizer_or_processor: Any, texts: list[str], tokenizer_kwargs: dict[str, Any], ) → collections.abc.Mapping[str, Any]#

bridge.data.hf_datasets.text_collate._tokenize_texts( texts: list[str], processor: Any, tokenizer: Any, *, max_length: int | None, pad_to_max_length: bool, ) → dict[str, Any]#

bridge.data.hf_datasets.text_collate._as_2d_long_tensor(value: Any) → torch.Tensor#

bridge.data.hf_datasets.text_collate._tensorize_batch( batch: collections.abc.Mapping[str, Any], ) → dict[str, Any]#

bridge.data.hf_datasets.text_collate._ensure_attention_mask( batch: dict[str, Any], tokenizer: Any, ) → None#

bridge.data.hf_datasets.text_collate._metadata_from_example( example: collections.abc.Mapping[str, Any], ) → dict[str, Any]#

bridge.data.hf_datasets.text_collate.text_chat_collate_fn( examples: list[collections.abc.Mapping[str, Any]], processor: Any, *, max_length: int | None = None, pad_to_max_length: bool = False, warn_on_all_masked: bool = True, ignore_index: int = IGNORE_INDEX, pack_sequences: bool = False, pack_sequences_pad_to_multiple_of: int = 1, ) → dict[str, Any]#

Collate text-only HF chat examples using the shared assistant-mask path.

Parameters:

examples – HF-style chat rows containing messages, conversation, or legacy conversations.
processor – A HF tokenizer or processor. It must expose apply_chat_template directly or through processor.tokenizer.
max_length – Optional tokenizer truncation length.
pad_to_max_length – If set with max_length, pad every row to max_length instead of the longest row in the batch.
warn_on_all_masked – Forwarded to assistant-mask construction.
ignore_index – Label ignore value for masked targets.
pack_sequences – If True, flatten the padded microbatch and emit packed-sequence metadata for GPT-style training steps.
pack_sequences_pad_to_multiple_of – Optional per-sequence length multiple used when pack_sequences inserts padding for CP/SP constraints.

Returns:

Batch dictionary with VLM-style input_ids and GPT-style tokens aliases, shifted labels and loss_mask, position_ids, and optional tokenizer fields such as attention_mask.

bridge.data.hf_datasets.text_collate#

Module Contents#

Functions#

Data#

API#

`bridge.data.hf_datasets.text_collate`#