`bridge.data.hf_processors.openmathinstruct2`#

Processing functions for OpenMathInstruct-2 dataset.

Dataset: https://huggingface.co/datasets/nvidia/OpenMathInstruct-2

OpenMathInstruct-2 contains math problems with generated solutions. Each example has problem, generated_solution, and expected_answer fields.

Module Contents#

Functions#

`process_openmathinstruct2_example`	Process a single OpenMathInstruct-2 example into the required format.
`_strip_intermediate_boxed`	Replace all \boxed{content} occurrences in text with just content.
`process_openmathinstruct2_thinking_packed_example`	Process OpenMathInstruct-2 example into analysis+final channel format.

API#

bridge.data.hf_processors.openmathinstruct2.process_openmathinstruct2_example( example: dict[str, Any], _tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None, ) → megatron.bridge.data.builders.hf_dataset.ProcessExampleOutput#

Process a single OpenMathInstruct-2 example into the required format.

Transforms a raw OpenMathInstruct-2 dataset example into the standard format expected by the HFDatasetBuilder for fine-tuning.

Parameters:

example – Raw example containing ‘problem’, ‘generated_solution’, and ‘expected_answer’
tokenizer – Optional tokenizer (not used in this processor)

Returns:

ProcessExampleOutput with formatted input/output and original answers

.. rubric:: Example

example = { … “problem”: “What is 2 + 3?”, … “generated_solution”: “We add 2 and 3 to get 5.”, … “expected_answer”: “5”, … } result = process_openmathinstruct2_example(example) print(result[“input”]) Problem: What is 2 + 3? Solution:

bridge.data.hf_processors.openmathinstruct2._strip_intermediate_boxed(text: str) → str#

Replace all \boxed{content} occurrences in text with just content.

Uses brace-depth counting to handle nested braces correctly (e.g. \boxed{\frac{1}{2}} → \frac{1}{2}).

bridge.data.hf_processors.openmathinstruct2.process_openmathinstruct2_thinking_packed_example( example: dict, _tokenizer=None, ) → dict#

Process OpenMathInstruct-2 example into analysis+final channel format.

Puts the CoT reasoning (generated_solution without the trailing \boxed{N}) into the ‘thinking’ field (rendered as <|channel|>analysis by the GPT-OSS chat template) and the final answer as ‘#### N’ in the ‘content’ field (rendered as <|channel|>final).

This separates the reasoning chain from the answer delivery, matching the intended GPT-OSS channel structure for math problem solving.

bridge.data.hf_processors.openmathinstruct2#

Module Contents#

Functions#

API#

`bridge.data.hf_processors.openmathinstruct2`#