bridge.data.hf_processors.gsm8k#

Processing functions for GSM8K (Grade School Math 8K) dataset.

Dataset: https://huggingface.co/datasets/openai/gsm8k

GSM8K contains 8.5K grade school math word problems. Each example has a question and an answer field where the answer contains chain-of-thought reasoning followed by #### and the final numerical answer.

Module Contents#

Functions#

_extract_final_answer

Extract the final numerical answer after the #### delimiter.

process_gsm8k_example

Process a single GSM8K example into the required format.

API#

bridge.data.hf_processors.gsm8k._extract_final_answer(answer: str) str#

Extract the final numerical answer after the #### delimiter.

bridge.data.hf_processors.gsm8k.process_gsm8k_example(
example: dict[str, Any],
_tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None,
) megatron.bridge.data.builders.hf_dataset.ProcessExampleOutput#

Process a single GSM8K example into the required format.

Transforms a raw GSM8K dataset example into the standard format expected by the HFDatasetBuilder for fine-tuning.

Parameters:
  • example – Raw GSM8K example containing ‘question’ and ‘answer’

  • tokenizer – Optional tokenizer (not used in this processor)

Returns:

ProcessExampleOutput with formatted input/output and original answers

.. rubric:: Example

example = { 
 “question”: “Janet has 3 apples. She buys 2 more. How many does she have?”, 
 “answer”: “Janet starts with 3 apples and buys 2 more. 3 + 2 = <<3+2=5>>5.\n#### 5”, 
 } result = process_gsm8k_example(example) print(result[“input”]) Question: Janet has 3 apples. She buys 2 more. How many does she have? Answer: