bridge.data.hf_processors.openmathinstruct2#

Processing functions for OpenMathInstruct-2 dataset.

Dataset: https://huggingface.co/datasets/nvidia/OpenMathInstruct-2

OpenMathInstruct-2 contains math problems with generated solutions. Each example has problem, generated_solution, and expected_answer fields.

Module Contents#

Functions#

process_openmathinstruct2_example

Process a single OpenMathInstruct-2 example into the required format.

_strip_intermediate_boxed

Replace all \boxed{content} occurrences in text with just content.

process_openmathinstruct2_thinking_packed_example

Process OpenMathInstruct-2 example into analysis+final channel format.

API#

bridge.data.hf_processors.openmathinstruct2.process_openmathinstruct2_example(
example: dict[str, Any],
_tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None,
) megatron.bridge.data.builders.hf_dataset.ProcessExampleOutput#

Process a single OpenMathInstruct-2 example into the required format.

Transforms a raw OpenMathInstruct-2 dataset example into the standard format expected by the HFDatasetBuilder for fine-tuning.

Parameters:
  • example – Raw example containing ‘problem’, ‘generated_solution’, and ‘expected_answer’

  • tokenizer – Optional tokenizer (not used in this processor)

Returns:

ProcessExampleOutput with formatted input/output and original answers

.. rubric:: Example

example = { 
 “problem”: “What is 2 + 3?”, 
 “generated_solution”: “We add 2 and 3 to get 5.”, 
 “expected_answer”: “5”, 
 } result = process_openmathinstruct2_example(example) print(result[“input”]) Problem: What is 2 + 3? Solution:

bridge.data.hf_processors.openmathinstruct2._strip_intermediate_boxed(text: str) str#

Replace all \boxed{content} occurrences in text with just content.

Uses brace-depth counting to handle nested braces correctly (e.g. \boxed{\frac{1}{2}} → \frac{1}{2}).

bridge.data.hf_processors.openmathinstruct2.process_openmathinstruct2_thinking_packed_example(
example: dict,
_tokenizer=None,
) dict#

Process OpenMathInstruct-2 example into analysis+final channel format.

Puts the CoT reasoning (generated_solution without the trailing \boxed{N}) into the ‘thinking’ field (rendered as <|channel|>analysis by the GPT-OSS chat template) and the final answer as ‘#### N’ in the ‘content’ field (rendered as <|channel|>final).

This separates the reasoning chain from the answer delivery, matching the intended GPT-OSS channel structure for math problem solving.