In a recent submission to the arXiv* server, researchers presented a method for fine-tuning open-source language models, enabling them to employ code for modeling and deriving mathematical equations, thereby enhancing their mathematical reasoning capabilities.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
In recent times, proprietary large language models (LLMs) such as Generative Pre-Trained Transformers (GPT)-4 and Pathways Language Models (PaLM)-2, in conjunction with techniques such as Chain-of-Thought (CoT) and Program-Aided Language (PAL) models, have demonstrated exceptional prowess in mathematical reasoning tasks.
Previous works have focused on instruction-following skills, but the current study emphasizes using high-quality math problem solutions generated by models to improve math-solving abilities. Mathematical reasoning benchmarks measure LLMs' math-solving abilities. Various approaches, such as CoT and code generation, enhance multistep reasoning and complex math computation. The current study combines natural language and codes seamlessly in the dataset to train models efficiently in math problem-solving.
MathCodeInstruct Dataset
Combining seed data with the data generated using the problem interpolation prompting (PIP) method yields the MathCodeInstruct dataset, which is used to fine-tune the base large language model Meta AI (Llama)-2 and CodeLlama models, resulting in MathCoder-L and MathCoder-CL, respectively.
Seed Data: Solutions for the Grade School Math 8K (GSM8K) and MATH training sets are obtained from GPT-4, expressed as solution and question pairs. Each solution includes code for execution (C), natural language for reasoning (L), and execution results (E). These elements are intricately interconnected within the solutions, resulting in a unified composition represented as (L, C, E, L, C, E, …). Researchers call these solutions natural language, code, and execution (LCE) solutions. The seed data is filtered to ensure each solution matches the ground truth answer, and it is used to fine-tune CodeLlama-34B, resulting in MathCoder-Initial.
Generating Data Using the PIP Method: Using MathCoder-Initial, LCE solutions are generated for new problems. A novel prompting method bridges the gap in difficulty between GSM8K and MATH problems by pairing a simple problem from GSM8K with a challenging problem from MATH. This prompts the model to generate new problems with intermediate difficulty levels. GPT-4 evaluates these new problems, ensuring their appropriateness in difficulty.
Self-distillation: MathCoder-Initial generates three different LCE solutions for each problem since ground truth answers are unavailable. Only the solutions that match the answers are retained to ensure dataset quality.
Supervised fine-tuning and inference
Special tokens are used to identify reasoning language, math code, and execution results in LCE solutions. These tokens help the model differentiate between components. Cross-entropy loss is applied during supervised fine-tuning on math code and reasoning language, while loss on execution results is zeroed out. Post-supervised fine-tuning, the model can identify and generate code and natural language enclosed by special tokens. Execution results are concatenated with the math code, and the model continues to generate reasoning language in an autoregressive manner, resembling the behavior of the GPT-4 Code Interpreter.
The evaluation of MathCoder encompasses five datasets, comprising two in-domain datasets, MATH and GSM8K, and three out-of-domain datasets, Mathematics, Simple Variations on Arithmetic Math Word Problems (SVAMP), and SimulEq. GSM8K and MATH are considered in-domain due to their utilization in supervised fine-tuning, whereas Mathematics, SVAMP, and SimulEq remain out-of-domain as they are not used in fine-tuning. These datasets span various mathematical challenges from elementary to collegiate levels, covering subjects such as geometry, formal logic, and commonsense reasoning. The selection aims to comprehensively assess model generalization across diverse mathematical fields and unfamiliar scenarios.
Unveiling MathCoder's performance
In the current study, MathCoders are compared with various competitive benchmarks, encompassing closed-source and open-source models. To enhance performance, the baselines employ CoT prompting and few-shot in-context learning, while MathCoders are assessed in a zero-shot setting without additional prompts.
Comparing with State-of-the-Art Open-Source Models: The results reveal MathCoder's superiority over other open-source math-solving models, achieving state-of-the-art (SOTA) performance across all datasets. Nevertheless, a significant performance gap persists when compared to the SOTA closed-source technique, GPT-4 Code Interpreter. Notable findings include MathCoder-L-7B outperforming WizardMath-70B on three of five datasets, highlighting the advantages of incorporating LCE blocks in solutions. Moreover, CodeLlama-34B models based on Llama-2-70B exhibit superiority over CodeLlama-34B models, in contrast to concurrent discoveries.
Comparison Between CodeLlama and Llama-2: Experimental results emphasize the substantial enhancements achieved by MathCoder-CL when using CodeLlama as the base model in contrast to MathCoder-L with Llama-2. MathCoder-CL-7B and MathCoder-CL-13B demonstrate accuracy improvements of 4.1 percent and 3.0 percent, respectively. The extended code data training of CodeLlama is credited for its improved coding and reasoning capabilities, particularly in coding-related tasks and advanced mathematical reasoning.
Comparison Between Subjects and Levels: A performance comparison across different subjects and difficulty levels in the MATH dataset is presented. MathCoder excels in algebra and pre-algebra problems but faces challenges in geometry problems, especially those with higher difficulty levels. This underscores the significant role of code in computationally intensive questions.
Conclusion
In summary, researchers introduced MathCoder, an open-source language model for math reasoning. It leverages GSM8K and MATH datasets with GPT-4 to generate problems encompassing reasoning, code generation, and execution. Customized, supervised fine-tuning focuses on natural language and code. MathCoder achieves SOTA results among open-source models, surpassing closed-source models such as ChatGPT-3.5 and PaLM-2. However, it has limitations, relying on GPT-4 and facing challenges in solving theorem-proving and complex geometry problems, areas for future research.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.