bridge.data.hf_processors.openmathinstruct2#

Processing functions for OpenMathInstruct-2 dataset.

Dataset: https://huggingface.co/datasets/nvidia/OpenMathInstruct-2

OpenMathInstruct-2 contains math problems with generated solutions. Each example has problem, generated_solution, and expected_answer fields.

Module Contents#

Functions#

process_openmathinstruct2_example

Process a single OpenMathInstruct-2 example into the required format.

API#

bridge.data.hf_processors.openmathinstruct2.process_openmathinstruct2_example(
example: dict[str, Any],
_tokenizer: Optional[megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer] = None,
) megatron.bridge.data.builders.hf_dataset.ProcessExampleOutput#

Process a single OpenMathInstruct-2 example into the required format.

Transforms a raw OpenMathInstruct-2 dataset example into the standard format expected by the HFDatasetBuilder for fine-tuning.

Parameters:
  • example – Raw example containing ‘problem’, ‘generated_solution’, and ‘expected_answer’

  • tokenizer – Optional tokenizer (not used in this processor)

Returns:

ProcessExampleOutput with formatted input/output and original answers

.. rubric:: Example

example = { 
 “problem”: “What is 2 + 3?”, 
 “generated_solution”: “We add 2 and 3 to get 5.”, 
 “expected_answer”: “5”, 
 } result = process_openmathinstruct2_example(example) print(result[“input”]) Problem: What is 2 + 3? Solution: