bridge.data.hf_processors.gsm8k#
Processing functions for GSM8K (Grade School Math 8K) dataset.
Dataset: https://huggingface.co/datasets/openai/gsm8k
GSM8K contains 8.5K grade school math word problems. Each example has a
question and an answer field where the answer contains chain-of-thought
reasoning followed by #### and the final numerical answer.
Module Contents#
Functions#
Extract the final numerical answer after the |
|
Process a single GSM8K example into the required format. |
API#
- bridge.data.hf_processors.gsm8k._extract_final_answer(answer: str) str#
Extract the final numerical answer after the
####delimiter.
- bridge.data.hf_processors.gsm8k.process_gsm8k_example(
- example: dict[str, Any],
- _tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None,
Process a single GSM8K example into the required format.
Transforms a raw GSM8K dataset example into the standard format expected by the HFDatasetBuilder for fine-tuning.
- Parameters:
example â Raw GSM8K example containing âquestionâ and âanswerâ
tokenizer â Optional tokenizer (not used in this processor)
- Returns:
ProcessExampleOutput with formatted input/output and original answers
.. rubric:: Example
example = { ⊠âquestionâ: âJanet has 3 apples. She buys 2 more. How many does she have?â, ⊠âanswerâ: âJanet starts with 3 apples and buys 2 more. 3 + 2 = <<3+2=5>>5.\n#### 5â, ⊠} result = process_gsm8k_example(example) print(result[âinputâ]) Question: Janet has 3 apples. She buys 2 more. How many does she have? Answer: