Meta GenAI Boosts AI Learning with CGPO, Tackling Reward Hacking and Improving Multi-Task Performance

Meta GenAI unveils CGPO, a breakthrough method that enhances AI performance across multiple tasks by eliminating reward hacking, offering more accurate and scalable solutions for coding, STEM, and general chat tasks.

Research: The Perfect Blend: Redefining RLHF with Mixture of Judges. Anggalih Prasetya / ShutterstockResearch: The Perfect Blend: Redefining RLHF with Mixture of Judges. Anggalih Prasetya / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, researchers at Meta GenAI introduced constrained generative policy optimization (CGPO), a novel post-training paradigm for reinforcement learning from human feedback (RLHF) in multi-task learning (MTL). CGPO consistently outperformed popular RLHF methods such as proximal policy optimization (PPO) and direct preference optimization (DPO) across a wide range of benchmarks, improving performance by up to 12.5% in STEM tasks and 7.4% in general chat. CGPO addressed issues like reward hacking and multi-objective optimization by using a mixture of judges (MoJ) approach. It improved performance across tasks such as general chat, science, technology, engineering and mathematics (STEM), and coding, surpassing common RLHF algorithms like PPO and DPO while reducing the need for hyper-parameter tuning.

Background

Large language models (LLMs) have revolutionized natural language processing, performing exceptionally across various expert-level domains. MTL is a popular approach used in LLMs, enabling the model to handle several tasks simultaneously by sharing representations. Prior research has focused on MTL during pre-training and fine-tuning, but the application of RLHF in post-training for MTL remained underexplored.

Previous methods combined multiple reward models into a linear framework, leading to issues like reward hacking, where models exploit flaws in reward systems and conflicts between different task objectives. These gaps highlighted the challenges of optimizing across tasks without compromising results.

This paper introduced CGPO, a novel method designed to mitigate reward hacking and conflicting task objectives in MTL. CGPO introduced a more sophisticated use of constraints and calibrations, including optimizers like the Calibrated-Regularized Policy Gradient (CRPG), which enhanced the ability to fine-tune models across various tasks. This fine-tuning framework offered more effective task-specific optimization. CGPO featured a MoJ approach and advanced constrained optimization strategies, which provided more effective fine-tuning. This framework outperformed state-of-the-art RLHF algorithms, demonstrating better task management, enhanced scalability, and superior empirical results in complex multi-task environments.

Preliminaries and Limitations

The RLHF fine-tuning phase is formulated as a Markov decision process (MDP) where each prompt represents the state, and the response is the action. Initially, a pre-trained LLM is fine-tuned using supervised learning to obtain a policy (πSFT) relevant to downstream tasks. Then, a reward model is trained using pairwise preferences, enabling an exploration-based RL alignment.

Techniques such as PPO are employed to optimize the LLM policy, though reward-free methods like DPO have gained traction. Despite the success of RLHF, challenges like reward hacking and contradictory optimization objectives remain. Additionally, reward models often lack fine-grained alignment, especially in tasks requiring precise reasoning, such as mathematics and coding. Reward models often fail to align with fine-grained criteria, and multi-task optimization strategies need more flexibility to account for varying task requirements.

CGPO in Single Task with Single Objective

In the CGPO framework, constraints were integrated into the model's reward system to prevent "reward hacking," which occurred when the model generated incorrect or undesirable outputs but was still awarded high rewards. In single-task settings, CGPO was optimized for both reward maximization and constraint adherence.

For instance, in mathematical reasoning tasks, the model was required to provide correct answers, and in chat tasks, it must respond to user queries consistently. These constraints ensured the model did not exhibit problematic behavior, such as refusing to respond or providing incorrect answers.

The framework evaluated model outputs through constraint judgments, separating them into "positive" and "negative" groups. The policy was then updated by maximizing the likelihood of constraint-compliant responses while minimizing non-compliant ones. CGPO differed from traditional approaches by focusing on primal-type policy optimization, avoiding the hyperparameter tuning complexities of primal-dual methods.

Additionally, the use of the Calibrated-Regularized Policy Gradient (CRPG) optimizer further refined CGPO by calibrating rewards across tasks, ensuring consistency in response generation while adhering to constraints. This optimizer improved reward calibration by comparing current outputs to a baseline response, ensuring consistency across various tasks. The goal was to enhance the model’s adherence to constraints while maximizing calibrated rewards.

Benchmarking and Safety

The evaluation benchmarks assessed LLMs across various tasks, including general chat, instruction following, math and coding reasoning, and world knowledge. For general chat, benchmarks like AlpacaEval-2 and Chat-Arena-Hard test model performance in problem-solving, real-world applications, and response specificity. The instruction following capability was evaluated using IFeval, which measured task accuracy with different prompt levels.

Math and coding reasoning were tested through benchmarks like MATH, grade school math (GSM)8K, mostly basic Python programming (MBPP), and HumanEval, which assessed models' abilities in mathematics and programming tasks. In particular, CGPO achieved a 2% performance improvement in both MATH and GSM8K benchmarks and a 5% improvement in HumanEval, surpassing the performance of PPO, which struggled with reward hacking issues in coding tasks. World knowledge was measured using the massive multitask language understanding (MMLU) benchmark, while TruthfulQA and abstraction and reasoning corpus (ARC)-challenge focused on testing factual accuracy.

Additionally, engagement and safety metrics evaluated model interaction quality and potential violations. CGPO also achieved superior results in reducing safety violations, thanks to its integration of multiple types of judges to detect potentially harmful outputs. The CGPO framework demonstrated superior performance in multi-task fine-tuning, outperforming baseline RLHF methods across these benchmarks.

Conclusion

In conclusion, the CGPO framework introduced a novel solution for addressing challenges in MTL with RLHF. By incorporating constraints into policy optimization, CGPO effectively mitigated reward hacking and conflicting task objectives. Through the use of task-specific constraint judgments and the introduction of the CRPG optimizer, CGPO ensured more calibrated and consistent rewards across tasks. Using a MoJ approach, the framework ensured improved performance across various tasks, such as general chat, coding, and STEM.

CGPO surpassed traditional RLHF methods, like PPO and DPO, while minimizing hyper-parameter tuning requirements. The results showed that CGPO reduced the frequency of reward hacking and significantly improved coding benchmarks, making it a more robust solution for complex multi-task learning environments. This breakthrough in post-training optimization for LLMs offers enhanced scalability, safety, and task management, setting a new benchmark for RLHF in MTL environments.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
  • Preliminary scientific report. Xu, T., Helenowski, E., Sankararaman, K. A., Jin, D., Peng, K., Han, E., Nie, S., Zhu, C., Zhang, H., Zhou, W., Zeng, Z., He, Y., Mandyam, K., Talabzadeh, A., Khabsa, M., Cohen, G., Tian, Y., Ma, H., Wang, S., . . . Fang, H. (2024). The Perfect Blend: Redefining RLHF with Mixture of Judges. ArXiv. DOI: abs/2409.20370, https://arxiv.org/abs/2409.20370
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2024, October 07). Meta GenAI Boosts AI Learning with CGPO, Tackling Reward Hacking and Improving Multi-Task Performance. AZoAi. Retrieved on January 15, 2025 from https://www.azoai.com/news/20241007/Meta-GenAI-Boosts-AI-Learning-with-CGPO-Tackling-Reward-Hacking-and-Improving-Multi-Task-Performance.aspx.

  • MLA

    Nandi, Soham. "Meta GenAI Boosts AI Learning with CGPO, Tackling Reward Hacking and Improving Multi-Task Performance". AZoAi. 15 January 2025. <https://www.azoai.com/news/20241007/Meta-GenAI-Boosts-AI-Learning-with-CGPO-Tackling-Reward-Hacking-and-Improving-Multi-Task-Performance.aspx>.

  • Chicago

    Nandi, Soham. "Meta GenAI Boosts AI Learning with CGPO, Tackling Reward Hacking and Improving Multi-Task Performance". AZoAi. https://www.azoai.com/news/20241007/Meta-GenAI-Boosts-AI-Learning-with-CGPO-Tackling-Reward-Hacking-and-Improving-Multi-Task-Performance.aspx. (accessed January 15, 2025).

  • Harvard

    Nandi, Soham. 2024. Meta GenAI Boosts AI Learning with CGPO, Tackling Reward Hacking and Improving Multi-Task Performance. AZoAi, viewed 15 January 2025, https://www.azoai.com/news/20241007/Meta-GenAI-Boosts-AI-Learning-with-CGPO-Tackling-Reward-Hacking-and-Improving-Multi-Task-Performance.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.