By targeting the most reliable rewards during training, PF-PPO sets a new standard in enhancing large language models, driving breakthroughs in complex code generation.
Study: Policy Filtration in RLHF to Fine-Tune LLM for Code Generation.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article submitted to the arXiv* server, researchers introduced a novel method called policy filtration for proximal policy optimization (PF-PPO), a technique aimed at improving reinforcement learning from human feedback (RLHF) in large language models (LLMs).
By filtering out samples associated with unreliable reward signals during policy learning, PF-PPO enhanced performance, especially in code generation tasks requiring complex reasoning. This filtering approach was carefully chosen based on the observed reliability of the reward model, particularly in regions where high rewards were given. Extensive experiments demonstrated that PF-PPO variants outperformed existing methods, achieving state-of-the-art results on benchmarks like HumanEval, mostly basic Python programming (MBPP), and a LeetCode Contest.
Background
RLHF is crucial for aligning LLMs with human values and preferences, aiding them in generating helpful and safe responses. However, a significant challenge lies in the accuracy of the intermediate reward models, especially when applied to tasks involving long and complex reasoning, such as code generation. It has proven effective in models like generative pre-trained transformer (GPT)-4 and Claude, despite challenges related to reward model accuracy. Prior work has focused on improving these reward models, yet they often misalign with human preferences, leading to inaccuracies in complex tasks like code generation.
The researchers observed that reward models tend to be more reliable in specific regions, particularly when they assign higher rewards. This paper addresses the gap by proposing a new technique, PF-PPO, which selectively filtered out unreliable samples during training, thereby improving the signal-to-noise ratio in the learning process. The method demonstrated a notable increase in reward model accuracy in certain regions, particularly when high rewards were given, and showed that filtering could significantly enhance performance in challenging tasks. Extensive experiments validated the effectiveness of PF-PPO, achieving state-of-the-art results in code generation tasks across benchmarks like HumanEval and LeetCode Contest.
PF Approach for Enhanced RLHF Optimization
The proposed PF-PPO method enhanced the reliability of the reward model used in RLHF by filtering responses during policy optimization. This filtering was driven by a detailed analysis showing that the reward model's accuracy was significantly higher for responses with extreme rewards, either high or low, compared to those in moderate reward regions. The method involved filtering responses generated by the unfiltered policy using ranking and weighting strategies, producing a filtered policy that prioritized more reliable responses.
PF-PPO utilized several filtering techniques, including best-of-N (BoN), best-random (BR), and best-worst (BW) filters, to refine the reward signal. These filtering strategies were not chosen arbitrarily but were based on their ability to maximize the alignment (measured by the R-squared metric) between the reward model’s predictions and the actual performance of the responses. These filters selectively ranked and sampled responses based on reward model feedback, improving policy learning.
The R-squared (R²) metric was employed to measure the alignment between the reward model and actual scores, ensuring that PF-PPO optimized performance effectively. Notably, the BR and BW filters were found to be particularly effective, as they helped avoid the over-optimization issues that often arise when relying on moderate rewards, which tend to be less reliable. Experimental results showed that BR and BW filters enhanced reward model accuracy, while BoN did not significantly improve accuracy, though it performed well in supervised learning settings.
Benchmark Evaluation and Experimentation
In the experiments, the effectiveness of the proposed method was evaluated using the code generation task, specifically comparing different algorithms on three benchmarks: HumanEval, MBPP, and LeetCode Contest. HumanEval and MBPP were widely recognized benchmarks for evaluating LLMs.
The researchers carefully selected datasets and filtering strategies tailored to these benchmarks, highlighting the importance of context-specific strategy selection in optimizing RLHF performance. HumanEval comprised 164 hand-written Python problems validated using test cases, while MBPP consisted of 378 problems with problem descriptions and test cases. These benchmarks provided a robust measure of LLM performance in code generation. The LeetCode Contest benchmark, with 160 real-world, competition-level problems, was designed to assess models on more challenging programming tasks, emphasizing human-level problem-solving skills.
To train the model, datasets from Magicoder-open source software (OSS)-instruct and evol-codealpaca-v1 were used. From 130 thousand (k) training samples, 7k prompts were selected to create response pairs for training the reward model, while 3k prompts were used for policy optimization in the PPO phase. For LeetCode, self-generated correct answers formed the supervised fine-tuning (SFT) dataset.
Deepseek-6.7B was employed as the base model, trained in three phases, SFT, reward model training, and PPO. The model underwent five epochs in the SFT phase, and PPO utilized reward normalization and advantage normalization to ensure stable training.
Results on the benchmarks indicated that PF-PPO (BR) and PF-PPO (BW) achieved the highest scores, particularly excelling in the challenging LeetCode benchmark, demonstrating the superiority of the RL approach over SFT-based methods. The paper highlights how these strategies specifically improved performance by focusing on more reliable reward regions, avoiding the pitfalls of reward model inaccuracies in moderate regions. Furthermore, policy filtering strategies like PF-PPO (BR) and PF-PPO (BW) significantly outperformed other methods, particularly in complex reasoning tasks, by avoiding unreliable rewards.
Conclusion
In conclusion, the introduction of PF-PPO marked a substantial improvement in RLHF for LLMs. This innovative approach enhanced model performance by filtering out unreliable reward signals during training, thus refining the quality of the feedback received. The selection of filtering strategies was guided by a careful analysis of the reward model's reliability, leading to substantial improvements in task performance. The PF-PPO method, utilizing BR and BW filters, demonstrated remarkable advancements in code generation tasks, achieving state-of-the-art results on benchmarks such as HumanEval, MBPP, and LeetCode Contest.
These filters not only improved reward model accuracy but also significantly enhanced the alignment of LLMs with human preferences. By addressing the challenges of reward noise and optimizing policy learning, PF-PPO set a new benchmark for RLHF techniques. The authors also noted that while PF-PPO performed exceptionally well in their experiments, the effectiveness of different policy filtering strategies may vary across different tasks and models, suggesting avenues for further research and optimization in other contexts. Its effectiveness in handling complex reasoning tasks highlighted its potential to drive future advancements in RL and model training methodologies.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Shen, W., & Zhang, C. (2024). Policy Filtration in RLHF to Fine-Tune LLM for Code Generation. ArXiv.org, DOI: 10.48550/arXiv.2409.06957, https://arxiv.org/abs/2409.06957v1