Built with synthetic data and a refined curriculum, Phi-4 outshines larger models in STEM reasoning and coding benchmarks, setting a new standard for efficiency and accuracy.
Research: Phi-4 Technical Report. Image Credit: Wright Studio / Shutterstock
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, researchers at Microsoft Research introduced Phi-4, a 14-billion (B) parameter language model designed to focus on data quality. Unlike typical models, Phi-4 integrated synthetic data throughout training, surpassing its teacher model, generative pre-trained transformers (GPT)-4, in science, technology, engineering, and mathematics (STEM)-focused reasoning tasks. Synthetic data was pivotal not just in pretraining but also in midtraining and post-training, marking a unique aspect of Phi-4's design.
Despite retaining Phi-3's architecture, improvements in data, training curriculum, and post-training techniques enabled Phi-4 to excel in reasoning benchmarks, demonstrating the effectiveness of its innovative training approach. Phi-4's development prioritized a systematic curriculum designed to maximize reasoning capabilities by combining synthetic data with curated organic datasets.
Background
Large language models (LLMs) have made significant strides in natural language processing, with advancements in data quality now rivaling improvements traditionally achieved through increased model size and computational resources.
The Phi family of models has been pivotal in demonstrating the impact of high-quality data curation and synthesis on model performance. However, existing models often face challenges such as overfitting, data contamination, and suboptimal reasoning capabilities, especially on complex benchmarks. To address these issues, Phi-4 implemented advanced decontamination techniques and original benchmarks, ensuring fair and comprehensive evaluations.
Previous models in the Phi family leveraged teacher-student distillation methods, primarily replicating GPT-4’s capabilities. While successful, they fell short of fully addressing reasoning-focused tasks. For example, Phi-3 demonstrated limitations in following detailed instructions or strict formatting requirements, a challenge partially carried forward to Phi-4.
Other contemporary models, like Llama and Qwen, have improved reasoning benchmarks but at the cost of increased token usage and inference latency, making them less efficient. Phi-4 distinguishes itself by achieving superior reasoning performance while maintaining efficiency, outperforming models with significantly larger parameter counts.
To bridge these gaps, Phi-4 introduced innovations in synthetic data generation, training curriculum optimization, and post-training refinement. It employed diverse techniques like multi-agent prompting and self-revision to create high-quality datasets, coupled with improved decontamination processes to mitigate overfitting. Synthetic datasets generated using instruction reversal and iterative refinement workflows ensured high diversity and complexity.
By balancing synthetic and curated organic data, Phi-4 achieved superior reasoning performance, particularly on STEM-focused benchmarks, outperforming larger models while maintaining efficiency.
Average performance of different models on the November 2024 AMC-10 and AMC-12 tests. This is the average score (with maximum score 150) over the four tests on 100 runs with temperature t = 0.5. We chose t = 0.5 to follow simple-evals [Ope24b]. Error bars are 2σ of the estimate. On competition math, phi-4 scores well above its weight-class even compared to non–open-weight models.
Pre-training and Post-training Details
Initially pre-trained with a four thousand (K) context length, the model’s capacity was extended to 16K during mid-training. It used the tiktoken tokenizer for enhanced multilingual support and features a vocabulary size of 100,352.
Pretraining involved processing 10 trillion tokens with full attention over 4K contexts, employing a peak learning rate of 0.0003, weight decay of 0.1, and a global batch size of 5760. Compared to its predecessor, Phi-3-medium, Phi-4 demonstrated significant improvements across benchmarks such as massive multi-task language understanding (MMLU), grade school mathematics (GSM)8k, and Mostly Basic Python programming (MBPP).
The pretraining dataset combined synthetic data (40%), web data (15%), web rewrites (15%), code (20%), and targeted academic and book sources (10%). Synthetic data incorporated chain-of-thought workflows, facilitating systematic reasoning through structured datasets.
Synthetic data contributed to strong performance on reasoning-heavy tasks but showed limitations in knowledge-based evaluations like Trivia question-answer (QA), highlighting the importance of a balanced dataset. During mid-training, the context length was extended to 16K by reweighting data to prioritize longer sequences and introducing new synthetic datasets. This phase used 250 B tokens with reduced learning rates for stability.
Phi-4’s evaluation revealed strong recall and in-context learning results but mixed performance in long-context reasoning compared to models like GPT-4o. Synthetic data was beneficial for reasoning but required supplementation with web and curated datasets for better knowledge-based task performance.
Post-training, Phi-4 was aligned to human preferences using supervised fine-tuning (SFT) and direct preference optimization (DPO). An innovative technique called pivotal token search (PTS) identified critical tokens that disproportionately influenced model performance, refining reasoning and coding outputs. This process refined the model’s reasoning, coding, and safety abilities, making its outputs more robust and human-aligned.
Challenges, Benchmarks, and Safety Considerations
Benchmarking LLMs faced challenges, such as data contamination, limited skill scope, and biases in evaluation methods. Many academic benchmarks overlapped with training data, risking data contamination despite efforts like deduplication. They often measured narrow skills, missing broader model capabilities. Evaluation biases in generation-based tasks could favor style over reasoning accuracy, and multiple-choice tests might enable pattern-matching guesses rather than true understanding.
To address these issues, Phi-4’s performance was evaluated using PhiBench, an internal benchmark designed for originality and skill diversity. PhiBench assessed diverse tasks, from debugging code to identifying mathematical proof errors, with rigorous scoring to minimize stylistic biases. These internal benchmarks provided a high-signal framework for detecting strengths and weaknesses, ensuring robust evaluations.
Phi-4 outperformed comparable models in most benchmarks, excelling in STEM Q&A and coding tasks, where it even surpassed its teacher model, GPT-4o. However, it struggled with strict instruction-following and some reasoning scenarios, such as producing outputs in predefined formats or adhering to detailed stylistic constraints. This reflects its emphasis on Q&A tasks and reasoning over instruction-following, which could be addressed in future iterations.
Phi-4 adhered to Microsoft’s Responsible Artificial Intelligence (AI) principles in safety, undergoing rigorous testing and post-training alignment to minimize risks. Collaborations with red-teaming exercises identified and addressed vulnerabilities, but challenges like factual hallucinations and bias remained. Efforts like adversarial testing and red-teaming further bolstered its defenses, although these risks cannot be entirely eliminated.
Despite strong defenses, further efforts are needed to mitigate these issues entirely. Phi-4’s development emphasized enhancing reasoning, coding, and user experience while recognizing limitations in instruction-following and potential safety concerns.
Conclusion
In conclusion, the researchers introduced Phi-4, a 14-billion-parameter language model designed to focus on data quality and innovative training methods.
By incorporating synthetic data and optimizing its training curriculum, Phi-4 surpassed larger models, including GPT-4o, in STEM reasoning and coding tasks. Its development highlighted the effectiveness of balanced datasets, combining synthetic and curated data to enhance reasoning performance. The introduction of techniques like pivotal token search further enhanced its robustness, particularly in critical tasks like coding and reasoning.
The introduction of PhiBench allowed for rigorous, unbiased benchmarking, addressing traditional evaluation challenges.
While excelling in reasoning and user experience, Phi-4 recognized limitations in strict instruction-following and safety challenges like factual hallucinations. Its development underscored the potential of quality-driven approaches in advancing AI capabilities.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.