bridge.data.hf_processors.squad#

Processing functions for Squad dataset.

Module Contents#

Functions#

process_squad_example

Process a single Squad example into the required format.

API#

bridge.data.hf_processors.squad.process_squad_example(
example: dict[str, Any],
tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None,
) megatron.bridge.data.builders.hf_dataset.ProcessExampleOutput#

Process a single Squad example into the required format.

This function transforms a raw Squad dataset example into the standard format expected by the HFDatasetBuilder for fine-tuning.

Parameters:
  • example – Raw Squad example containing ‘context’, ‘question’, and ‘answers’

  • tokenizer – Optional tokenizer (not used in this example)

Returns:

ProcessExampleOutput with formatted input/output and original answers

.. rubric:: Example

example = { 
 “context”: “The Amazon rainforest is a moist broadleaf forest.”, 
 “question”: “What type of forest is the Amazon rainforest?”, 
 “answers”: { 
 “text”: [“moist broadleaf forest”, “broadleaf forest”], 
 “answer_start”: [25, 31] 
 } 
 } result = process_squad_example(example) print(result[“input”]) Context: The Amazon rainforest is a moist broadleaf forest. Question: What type of forest is the Amazon rainforest? Answer: