bridge.data.hf_processors.openmathinstruct2#
Processing functions for OpenMathInstruct-2 dataset.
Dataset: https://huggingface.co/datasets/nvidia/OpenMathInstruct-2
OpenMathInstruct-2 contains math problems with generated solutions. Each example
has problem, generated_solution, and expected_answer fields.
Module Contents#
Functions#
Process a single OpenMathInstruct-2 example into the required format. |
|
Replace all \boxed{content} occurrences in text with just content. |
|
Process OpenMathInstruct-2 example into analysis+final channel format. |
API#
- bridge.data.hf_processors.openmathinstruct2.process_openmathinstruct2_example(
- example: dict[str, Any],
- _tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None,
Process a single OpenMathInstruct-2 example into the required format.
Transforms a raw OpenMathInstruct-2 dataset example into the standard format expected by the HFDatasetBuilder for fine-tuning.
- Parameters:
example â Raw example containing âproblemâ, âgenerated_solutionâ, and âexpected_answerâ
tokenizer â Optional tokenizer (not used in this processor)
- Returns:
ProcessExampleOutput with formatted input/output and original answers
.. rubric:: Example
example = { ⊠âproblemâ: âWhat is 2 + 3?â, ⊠âgenerated_solutionâ: âWe add 2 and 3 to get 5.â, ⊠âexpected_answerâ: â5â, ⊠} result = process_openmathinstruct2_example(example) print(result[âinputâ]) Problem: What is 2 + 3? Solution:
- bridge.data.hf_processors.openmathinstruct2._strip_intermediate_boxed(text: str) str#
Replace all \boxed{content} occurrences in text with just content.
Uses brace-depth counting to handle nested braces correctly (e.g. \boxed{\frac{1}{2}} â \frac{1}{2}).
- bridge.data.hf_processors.openmathinstruct2.process_openmathinstruct2_thinking_packed_example(
- example: dict,
- _tokenizer=None,
Process OpenMathInstruct-2 example into analysis+final channel format.
Puts the CoT reasoning (generated_solution without the trailing \boxed{N}) into the âthinkingâ field (rendered as <|channel|>analysis by the GPT-OSS chat template) and the final answer as â#### Nâ in the âcontentâ field (rendered as <|channel|>final).
This separates the reasoning chain from the answer delivery, matching the intended GPT-OSS channel structure for math problem solving.