bridge.data.hf_datasets.text_collate#
Generic text-only HF chat collator for the conversation dataset path.
Module Contents#
Functions#
Collate text-only HF chat examples using the shared assistant-mask path. |
Data#
API#
- bridge.data.hf_datasets.text_collate._CONVERSATION_KEYS#
(‘conversation’, ‘messages’, ‘conversations’)
- bridge.data.hf_datasets.text_collate._normalize_text_conversation(
- example: collections.abc.Mapping[str, Any],
- bridge.data.hf_datasets.text_collate._render_chat(
- conversation: list[dict[str, Any]],
- processor: Any,
- tokenizer: Any,
- bridge.data.hf_datasets.text_collate._call_tokenizer(
- tokenizer_or_processor: Any,
- texts: list[str],
- tokenizer_kwargs: dict[str, Any],
- bridge.data.hf_datasets.text_collate._tokenize_texts(
- texts: list[str],
- processor: Any,
- tokenizer: Any,
- *,
- max_length: int | None,
- pad_to_max_length: bool,
- bridge.data.hf_datasets.text_collate._as_2d_long_tensor(value: Any) torch.Tensor#
- bridge.data.hf_datasets.text_collate._tensorize_batch(
- batch: collections.abc.Mapping[str, Any],
- bridge.data.hf_datasets.text_collate._ensure_attention_mask(
- batch: dict[str, Any],
- tokenizer: Any,
- bridge.data.hf_datasets.text_collate._metadata_from_example(
- example: collections.abc.Mapping[str, Any],
- bridge.data.hf_datasets.text_collate.text_chat_collate_fn(
- examples: list[collections.abc.Mapping[str, Any]],
- processor: Any,
- *,
- max_length: int | None = None,
- pad_to_max_length: bool = False,
- warn_on_all_masked: bool = True,
- ignore_index: int = IGNORE_INDEX,
- pack_sequences: bool = False,
- pack_sequences_pad_to_multiple_of: int = 1,
Collate text-only HF chat examples using the shared assistant-mask path.
- Parameters:
examples – HF-style chat rows containing
messages,conversation, or legacyconversations.processor – A HF tokenizer or processor. It must expose
apply_chat_templatedirectly or throughprocessor.tokenizer.max_length – Optional tokenizer truncation length.
pad_to_max_length – If set with
max_length, pad every row tomax_lengthinstead of the longest row in the batch.warn_on_all_masked – Forwarded to assistant-mask construction.
ignore_index – Label ignore value for masked targets.
pack_sequences – If True, flatten the padded microbatch and emit packed-sequence metadata for GPT-style training steps.
pack_sequences_pad_to_multiple_of – Optional per-sequence length multiple used when
pack_sequencesinserts padding for CP/SP constraints.
- Returns:
Batch dictionary with VLM-style
input_idsand GPT-styletokensaliases, shiftedlabelsandloss_mask,position_ids, and optional tokenizer fields such asattention_mask.