In an article recently submitted to the arXiv* server, researchers demonstrated that alignment methods could achieve superior adherence to guardrails in conversational agents, commonly known as bots. They compared traditional training methods like instruction fine-tuning with newer approaches such as identity preference optimization (IPO) and Kahneman-Tversky optimization (KTO).
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
The study emphasized how these alignment techniques, applied before and after instruction fine-tuning, effectively optimized conversational bots for domains like customer care that demanded strict adherence to predefined rules.
Background
Past work has highlighted that 'instruction-tuning' procedures enable large language models (LLMs) to generalize beyond their training sets, enhancing overall usability. However, these approaches often lead to challenges such as catastrophic forgetting, overfitting, potential hallucination induction, and compromise in model safety. Alignment training methods, including direct preference optimization, IPO, and KTO, have shown promise in improving model helpfulness and reducing harm, optimizing neural network reward functions without explicitly training a reward model.
Alignment Techniques Overview
In this section, various techniques for aligning LLMs are introduced. Alignment training aims to subtly adjust models to favor selected outputs over others, akin to contrastive learning for embeddings, which clusters similar examples closer in embedding space. One approach discussed is reinforcement learning with human feedback using proximal policy optimization (RLHF with PPO), which utilizes reinforcement learning due to the non-differentiable nature of PPO loss. However, it is noted for its high computational expense and instability during training.
Another method explored is direct preference optimization (DPO), which circumvents the need for a separate reward model by directly optimizing a specific objective. This approach leverages human-labeled preferences to construct datasets and optimizes language models, minimizing divergence from a reference policy while maintaining generation diversity.
IPO introduces a non-linear function to optimize preference probabilities, balanced by a Kullback-Leibler (KL) regularization term. It aligns policies with a reference distribution while exploring various forms of objective functions to mitigate overfitting observed in DPO. Additionally, KTO introduces a loss function that decouples preferred and rejected outputs for the same prompt using human-centered loss functions (HALOs). This method balances the model's response selection by considering preferred and rejected batches during training.
By incorporating insights from behavioral economics, KTO aims to enhance the model's decision-making process, particularly in scenarios where ensuring user satisfaction and adherence to specified guidelines are critical. These alignment techniques collectively contribute to advancing the robustness and effectiveness of LLMs in various practical applications, such as customer service and interactive systems.
Alignment Techniques Evaluation
The experiments section evaluates techniques for aligning LLMs using custom datasets from fake customer care conversations. These datasets were designed to enhance the models' adherence to specific guardrails during interactions between agents and users, utilizing generative pre-trained transformer 4 (GPT-IV) to generate responses. The experiments were divided into two main flows: Flow 1 and Flow 2.
Flow 1 compared alignment tuning against supervised fine-tuning using two alignment techniques: IPO and KTO. Models were trained and evaluated based on metrics such as adherence, naturalness, and hallucination. Results indicated that alignment post-fine-tuning provided clear benefits, particularly in improving naturalness and adherence scores by approximately 5% and 15%, respectively, compared to supervised fine-tuning alone. IPO generally outperformed KTO in these experiments, showcasing its effectiveness in enhancing model performance across various dimensions.
Flow 2 focused on iterative improvement by aligning models on a feedback set derived from rejected responses after supervised fine-tuning. This approach showed significant performance gains in adherence and naturalness metrics, demonstrating the potential for continuous model enhancement without the risk of catastrophic forgetting associated with fine-tuning alone. Similar trends in hallucination scores were observed across both flows, highlighting consistent model behavior in handling outside information.
In technical details, the experiments employed Mistral-7B-Instruct as the base model due to its commercial license and superior performance compared to other models tested internally, such as the Llama-2 series. The training setups involved careful consideration of learning rates tailored to each technique: supervised fine-tuning (SFT) utilized a higher learning rate of 5e-4. In contrast, alignment techniques like IPO and KTO required significantly lower rates (2e-6 and 5e-7, respectively). This adjustment aimed to balance model stability and effectiveness in learning from the provided datasets.
Results from both experiment flows, evaluated using GPT-IV, showed alignment techniques significantly enhancing adherence and naturalness and reducing hallucination in conversational contexts. Graphical representations visually highlighted these improvements across metrics.
Conclusion
To sum up, alignment training via IPO demonstrated comparable or superior performance to instruction fine-tuning in enhancing instruction adherence for conversation bots. Operating at lower learning rates and optimizing for distribution loss with a reference model, alignment proves beneficial for iterative tasks such as feedback-driven improvement and safety alignment. Future research could explore mitigating hallucination and expanding alignment techniques into broader customer care domains, such as intent detection and insights beyond conversation bots.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Garg, R., et al. (2024). Alignment For Performance Improvement in Conversation Bots. ArXiv.Doi:10.48550/arXiv.2406.18954, https://arxiv.org/abs/2406.18954