In an article recently submitted to the arxiv* server, researchers introduced ToolSword, a framework addressing safety concerns in large language models (LLMs) applied to tool learning. They identified six safety scenarios spanning input, execution, and output stages, including malicious queries (MQ), jailbreak attacks (JA), noise misdirection (NM), risky cues (RC), harmful feedback (HF), and error conflicts (EC).
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Experiments on 11 LLMs, including generative pre-trained transformer (GPT)-4, revealed persistent challenges in handling harmful queries, using risky tools, and managing detrimental feedback, highlighting the need for safety considerations in tool learning.
Background
In recent times, tool learning has emerged as a promising strategy for integrating LLMs into practical applications. The process of tool learning involves three key stages: input, execution, and output. Researchers have primarily focused on enhancing LLM capabilities through tool utilization, employing methods such as fine-tuning the base model and promoting generalization skills. However, existing studies have largely overlooked the safety concerns associated with tool learning.
Safety concerns become particularly evident when contrasting standard dialogues with tool learning contexts. In standard dialogues, LLMs exhibit a safe alignment mechanism, recognizing and refusing assistance with unsafe queries. However, the tool-learning context introduces potential disruptions to this safety alignment. LLMs may respond to unsafe queries by invoking relevant tools, and the selection of tools might be influenced by malicious noise.
Prior research has not comprehensively addressed these safety challenges in the tool-learning process, necessitating a focused analysis to fill this critical gap. The introduction of ToolSword responded to this need, presenting a comprehensive framework designed to unveil safety issues throughout the tool learning process. By identifying and examining six safety scenarios at different stages of tool learning, ToolSword aimed to provide a granular understanding of how LLMs manage safety challenges. The researchers conducted experiments involving 11 LLMs, revealing significant safety risks despite their advanced capabilities. The findings underscored the urgency of enhancing safety measures in LLMs engaged in tool learning, contributing valuable insights for future research and development in this domain.
ToolSword
ToolSword was introduced as a comprehensive framework addressing safety challenges in the tool-learning process of LLMs. The framework analyzed LLMs across three stages: input, execution, and output, employing two safety scenarios for each stage. In the input stage, MQ and JA assessed LLMs' ability to decline unsafe requests. MQ involved testing LLMs against malicious queries and associated tools, while JA introduced jailbreak methods to increase the challenge. The execution stage evaluated LLMs' tool selection proficiency through NM and RC. NM examined LLMs' handling of noisy tool names, and RC assessed their ability to avoid unsafe tools. In the output stage, HF evaluated LLMs' capacity to prevent harmful content generation, and EC examined their ability to rectify factual errors or conflicts in feedback.
The scenarios involved real-world situations, such as noisy tool names and unsafe tool selection, to comprehensively assess LLMs' safety performance. Statistical information indicated the diversity and scale of the scenarios tested. ToolSword aimed to fill the gap in existing research by addressing safety concerns in tool learning, which has been largely overlooked. Through these safety scenarios, ToolSword provided a granular understanding of LLMs' challenges, emphasizing the need to enhance safety measures in practical applications. The framework was applied to 11 LLMs, revealing significant safety risks and emphasizing the urgency of addressing these issues. Overall, ToolSword contributed to advancing research in tool learning safety and provides valuable insights for the development and deployment of LLMs in real-world applications.
Experiments
The experiments focused on evaluating 11 LLMs in terms of safety during the tool-learning process. The LLMs, selected from various sources, included ChatGLM-3, ToolLLaMA-2, RoTLLaMA, NexusRaven, Qwen-chat, and GPT models. Three stages of analysis, namely input, execution, and output, each with two safety scenarios, were conducted to assess the models comprehensively.
In the input stage, safety scenarios such as MQ and JA tested the LLMs' ability to recognize and reject unsafe requests. The results revealed challenges in identifying MQs, with significant variations among LLMs and the need for improved safety measures. During the execution stage, the focus was on evaluating LLMs' proficiency in accurate tool selection. Scenarios like NM and RC exposed vulnerabilities in tool selection, especially in the presence of noise and the understanding of tools' potential risks.
In the output stage, LLMs' capacity to filter out harmful information was tested through scenarios like HF and EC. Findings indicated that LLMs struggle to recognize harmful content in tool feedback and cannot often rectify conflicting information.
The experiments underscored the limitations and safety risks associated with current LLMs in tool learning, emphasizing the urgency of improving safety alignment mechanisms and enhancing their comprehension of tool functions. The study revealed variations in performance across different LLMs and highlighted the need for targeted training strategies to address safety concerns in practical applications.
Further Studies
Increasing the size of LLMs did not uniformly enhance safety in tool learning, challenging the assumption that larger models inherently improved performance. In some cases, safety might even diminish with larger sizes. Surprisingly, the smallest model, ChatGLM-3-6B, exhibited better safety in certain scenarios due to its limited tool-learning ability. LLMs, despite safety concerns, could outperform humans in tool learning tasks in noise-free environments. Emphasizing the need for robust safety mechanisms, the authors suggested that, while LLMs demonstrated powerful capabilities, refining their safety remained crucial for practical applications.
Conclusion
In conclusion, the researchers introduced ToolSword, a framework examining safety issues in LLMs during tool learning across three stages. The authors highlighted the need for enhanced safety alignment mechanisms in LLMs. Acknowledging limitations, they identified issues without presenting specific defense strategies. Future research will prioritize addressing these gaps. The analysis focused on a single stage, emphasizing a comprehensive evaluation across all tool-learning interactions for a detailed assessment.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Ye, J., Li, S., Li, G., Huang, C., Gao, S., Wu, Y., Zhang, Q., Gui, T., & Huang, X. (2024, February 16). ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages. ArXiv.org. https://doi.org/10.48550/arXiv.2402.10753, https://arxiv.org/abs/2402.10753