bridge.data.hf_datasets.text_collate#

Generic text-only HF chat collator for the conversation dataset path.

Module Contents#

Functions#

Data#

API#

bridge.data.hf_datasets.text_collate._CONVERSATION_KEYS#

(‘conversation’, ‘messages’, ‘conversations’)

bridge.data.hf_datasets.text_collate._normalize_text_conversation(
example: collections.abc.Mapping[str, Any],
) list[dict[str, Any]]#
bridge.data.hf_datasets.text_collate._render_chat(
conversation: list[dict[str, Any]],
processor: Any,
tokenizer: Any,
) str#
bridge.data.hf_datasets.text_collate._call_tokenizer(
tokenizer_or_processor: Any,
texts: list[str],
tokenizer_kwargs: dict[str, Any],
) collections.abc.Mapping[str, Any]#
bridge.data.hf_datasets.text_collate._tokenize_texts(
texts: list[str],
processor: Any,
tokenizer: Any,
*,
max_length: int | None,
pad_to_max_length: bool,
) dict[str, Any]#
bridge.data.hf_datasets.text_collate._as_2d_long_tensor(value: Any) torch.Tensor#
bridge.data.hf_datasets.text_collate._tensorize_batch(
batch: collections.abc.Mapping[str, Any],
) dict[str, Any]#
bridge.data.hf_datasets.text_collate._ensure_attention_mask(
batch: dict[str, Any],
tokenizer: Any,
) None#
bridge.data.hf_datasets.text_collate._metadata_from_example(
example: collections.abc.Mapping[str, Any],
) dict[str, Any]#
bridge.data.hf_datasets.text_collate.text_chat_collate_fn(
examples: list[collections.abc.Mapping[str, Any]],
processor: Any,
*,
max_length: int | None = None,
pad_to_max_length: bool = False,
warn_on_all_masked: bool = True,
ignore_index: int = IGNORE_INDEX,
pack_sequences: bool = False,
pack_sequences_pad_to_multiple_of: int = 1,
) dict[str, Any]#

Collate text-only HF chat examples using the shared assistant-mask path.

Parameters:
  • examples – HF-style chat rows containing messages, conversation, or legacy conversations.

  • processor – A HF tokenizer or processor. It must expose apply_chat_template directly or through processor.tokenizer.

  • max_length – Optional tokenizer truncation length.

  • pad_to_max_length – If set with max_length, pad every row to max_length instead of the longest row in the batch.

  • warn_on_all_masked – Forwarded to assistant-mask construction.

  • ignore_index – Label ignore value for masked targets.

  • pack_sequences – If True, flatten the padded microbatch and emit packed-sequence metadata for GPT-style training steps.

  • pack_sequences_pad_to_multiple_of – Optional per-sequence length multiple used when pack_sequences inserts padding for CP/SP constraints.

Returns:

Batch dictionary with VLM-style input_ids and GPT-style tokens aliases, shifted labels and loss_mask, position_ids, and optional tokenizer fields such as attention_mask.