`bridge.data.hf_processors.gsm8k`#

Processing functions for GSM8K (Grade School Math 8K) dataset.

Dataset: https://huggingface.co/datasets/openai/gsm8k

GSM8K contains 8.5K grade school math word problems. Each example has a question and an answer field where the answer contains chain-of-thought reasoning followed by #### and the final numerical answer.

Module Contents#

Functions#

`_extract_final_answer`	Extract the final numerical answer after the `####` delimiter.
`process_gsm8k_example`	Process a single GSM8K example into the required format.

API#

bridge.data.hf_processors.gsm8k._extract_final_answer(answer: str) → str#: Extract the final numerical answer after the #### delimiter.

bridge.data.hf_processors.gsm8k.process_gsm8k_example( example: dict[str, Any], _tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None, ) → megatron.bridge.data.builders.hf_dataset.ProcessExampleOutput#

Process a single GSM8K example into the required format.

Transforms a raw GSM8K dataset example into the standard format expected by the HFDatasetBuilder for fine-tuning.

Parameters:

example – Raw GSM8K example containing ‘question’ and ‘answer’
tokenizer – Optional tokenizer (not used in this processor)

Returns:

ProcessExampleOutput with formatted input/output and original answers

.. rubric:: Example

example = { … “question”: “Janet has 3 apples. She buys 2 more. How many does she have?”, … “answer”: “Janet starts with 3 apples and buys 2 more. 3 + 2 = <<3+2=5>>5.\n#### 5”, … } result = process_gsm8k_example(example) print(result[“input”]) Question: Janet has 3 apples. She buys 2 more. How many does she have? Answer:

bridge.data.hf_processors.gsm8k#

Module Contents#

Functions#

API#

`bridge.data.hf_processors.gsm8k`#